Abstract
Understanding how to work with breast cancer data to aid the early detection of breast cancer in women is very important to the health and wellbeing of women around the world. This study explores various statistical methods and techniques to analyze breast cancer related dataset, to discover if common statistical methods can be used to analyze these datasets. The impact of breast cancer on the well being of women provokes the need for both accurate and interpretable results. The statistical methods investigated in this study are Logistic regression modelling, Principal component analysis and clustering analysis. Logistic regression was used on the Haberman’s Breast Cancer Survival dataset to create a model representing the relationship between the variables in the dataset. We used principal component analysis to reduce the dimensionality of the anthropometric data of breast cancer patients and the control groups from the Coimbra breast cancer data. Finally, Hierarchical and K-Means clustering was used to cluster the Wisconsin breast cancer data into groups of benign and malignant breast masses.