Exploratory Data Analysis on Breast Cancer Patient Data

title: "Exploratory Data Analysis on Breast Cancer Patient Data" date: "2024-02-10" description: "A walkthrough of my exploratory data analysis project on the Wisconsin Breast Cancer dataset — feature correlation, visualization, and what the data actually tells us about diagnosis." tags: ["python", "ml", "data-analysis"] published: true image: ""

EDA projects have a reputation for being boring — load the dataset, check .info(), draw a heatmap, done. This one turned out to be more interesting than I expected, mostly because the data itself tells a clear story if you know what you're looking for, and because some of what I found initially looked like good news but was actually a warning sign.

The Dataset

The Wisconsin Breast Cancer dataset is one of the canonical ML datasets — 569 samples, 30 features, binary classification (Malignant / Benign). The features are computed from digitized images of fine needle aspirate (FNA) biopsies and describe characteristics of the cell nuclei: radius, texture, perimeter, area, smoothness, compactness, concavity, and a few others. Each measurement comes in three variants: mean, standard error, and worst (largest mean of the three largest values), giving you 10 measurements × 3 = 30 features.

from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np
 
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['diagnosis'] = data.target  # 0 = Malignant, 1 = Benign
 
print(df.shape)          # (569, 31)
print(df['diagnosis'].value_counts())
# 1    357  (Benign)
# 0    212  (Malignant)

The class distribution — 357 benign vs 212 malignant — is worth noting immediately. It's not wildly imbalanced, but it's enough that accuracy alone is a misleading metric. A model that predicts "Benign" for everything scores 62.7% accuracy while being completely useless.

EDA Steps

My process was roughly:

Check for missing values (there are none in this dataset, which is one reason it's a teaching classic)
Examine distributions per class
Build the correlation matrix
Identify multicollinear features
Look at feature separation between classes

# No missing values, but worth confirming
print(df.isnull().sum().sum())  # 0
 
# Basic stats split by diagnosis
print(df.groupby('diagnosis').describe().T)

The describe-by-group output immediately showed that malignant tumors have consistently higher values across almost every feature. This isn't surprising medically, but it's useful confirmation that the features are actually informative.

Key Visualizations

Violin plots by diagnosis. For the _mean features, I plotted side-by-side violins split by diagnosis:

import matplotlib.pyplot as plt
import seaborn as sns
 
mean_features = [col for col in df.columns if 'mean' in col]
 
fig, axes = plt.subplots(2, 5, figsize=(20, 8))
axes = axes.flatten()
 
for i, feature in enumerate(mean_features):
    sns.violinplot(
        data=df, x='diagnosis', y=feature,
        palette={0: '#e74c3c', 1: '#2ecc71'},
        ax=axes[i]
    )
    axes[i].set_title(feature)
    axes[i].set_xticklabels(['Malignant', 'Benign'])
 
plt.tight_layout()
plt.savefig('violin_plots.png', dpi=150)

The violins showed that radius_mean, perimeter_mean, and area_mean have almost no overlap between classes — the distributions are nearly completely separated. smoothness_mean and fractal_dimension_mean, on the other hand, overlap significantly, suggesting they contribute less to classification.

Correlation heatmap. This is where the interesting finding appeared:

corr_matrix = df[mean_features].corr()
 
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(
    corr_matrix,
    mask=mask,
    annot=True,
    fmt='.2f',
    cmap='RdYlGn',
    center=0,
    square=True,
    linewidths=0.5
)
plt.title('Feature Correlation Matrix (Mean Features)')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)

The heatmap showed correlations of 0.99 between radius_mean, perimeter_mean, and area_mean. That makes geometric sense — the perimeter and area of a circle are both determined entirely by the radius. But from a machine learning perspective, feeding all three into a model simultaneously provides no additional information and can destabilise some algorithms.

Pair plot. I ran a pair plot on the top 5 features by variance to look at the pairwise separability:

top_features = ['radius_mean', 'texture_mean', 'perimeter_mean',
                'area_mean', 'concavity_mean', 'diagnosis']
 
sns.pairplot(
    df[top_features],
    hue='diagnosis',
    palette={0: '#e74c3c', 1: '#2ecc71'},
    plot_kws={'alpha': 0.5, 's': 20}
)
plt.savefig('pair_plot.png', dpi=150)

The pair plot made the radius_mean / perimeter_mean / area_mean redundancy visually obvious — those three scatter plots are almost identical. But it also showed that concavity_mean combined with any size feature gives strong separation.

Feature Engineering Notes

The high multicollinearity between the size features (radius, perimeter, area) is a problem for models that assume feature independence, particularly logistic regression and naive Bayes. There are a few ways to handle it:

Drop two of the three and keep only radius_mean (the most interpretable)
Apply PCA to collapse them into a single component
Use a tree-based model that's inherently robust to correlated features

I also noticed that the _worst features are highly correlated with the corresponding _mean features. So you're not really getting 30 independent signals — the effective dimensionality is much lower.

For the baseline model I didn't do any feature selection, just to see what raw performance looks like.

Baseline Model

A logistic regression with StandardScaler preprocessing and a 80/20 split:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
 
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
 
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
 
model = LogisticRegression(max_iter=10000, random_state=42)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
 
print(classification_report(y_test, y_pred, target_names=['Malignant', 'Benign']))

Results on the test set:

	Precision	Recall	F1-score
Malignant	0.95	0.93	0.94
Benign	0.96	0.97	0.97
Accuracy			0.956

95.6% accuracy with logistic regression, no feature engineering, no hyperparameter tuning. This dataset is famously "easy" for classifiers, which is also why it's used for teaching.

More important than accuracy is the recall on the malignant class. A false negative here — predicting Benign when the tumour is actually Malignant — is far more costly than a false positive. At 0.93 recall on malignant, the model misses about 7% of malignant cases, which in a real screening context would be unacceptable. You'd optimise the decision threshold to push malignant recall higher, accepting more false positives.

Surprising Findings

The redundancy between radius_mean, perimeter_mean, and area_mean was expected once I thought about the geometry. What I didn't expect was how much it showed up visually — the pair plot panels for those three features are so similar they look like the same plot.

What genuinely surprised me was how little fractal_dimension_mean contributed. It has almost no separation between classes in the violin plot and very low correlation with diagnosis. It's a sophisticated-sounding measurement but turns out to be the least useful feature for this particular classification task.

The lesson from this dataset is that domain knowledge matters in EDA. The geometric relationship between radius, perimeter, and area is obvious once stated, but I only noticed the multicollinearity problem because I built the correlation heatmap. Visualisation found what intuition missed.

What Comes Next

The logical next steps from an EDA like this would be:

Feature selection — drop the redundant size features or apply PCA
Compare classifiers: random forest, SVM, gradient boosting
ROC curve analysis to understand the precision/recall tradeoff
Cross-validation instead of a single train/test split
Calibration — logistic regression probabilities are relatively well-calibrated, but it's worth checking

For a production screening tool you'd also want to think carefully about the cost matrix — the penalty for a false negative in cancer diagnosis is much higher than a false positive, and that should directly influence the model selection and threshold choice.

This project reminded me that EDA isn't just a preliminary step before the "real" modelling work. Understanding your data well — its structure, its redundancies, its class balance — directly determines which models are appropriate and how you should evaluate them.