This page was generated from docs/examples/Breast_Cancer.ipynb. Interactive online version: Binder badge

Breast Cancer Diagnosis

In this notebook we will apply the LogitBoost algorithm to a toy dataset to classify cases of breast cancer as benign or malignant.

Imports

[1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid', palette='colorblind', color_codes=True)

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.manifold import TSNE

from logitboost import LogitBoost

Loading the Data

The breast cancer dataset imported from scikit-learn contains 569 samples with 30 real, positive features (including cancer mass attributes like mean radius, mean texture, mean perimeter, et cetera). Of the samples, 212 are labeled “malignant” and 357 are labeled “benign”. We load this data into a 569-by-30 feature matrix and a 569-dimensional target vector. Then we randomly shuffle the data and designate two thirds for training and one third for testing.

[2]:
data = load_breast_cancer()
X = data.data
y = data.target_names[data.target]
n_classes = data.target.size

# Shuffle data and split it into training/testing samples
test_size = 1 / 3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size,
                                                    shuffle=True, stratify=y,
                                                    random_state=0)

Visualizing the Training Set

Although the features are 30-dimensional, we can visualize the training set by using t-distributed stochastic neighbor embedding (t-SNE) to project the features onto a 2-dimensional space.

[3]:
tsne = TSNE(n_components=2, random_state=0)
X_train_tsne = tsne.fit_transform(X_train)

plt.figure(figsize=(10, 8))
mask_benign = (y_train == 'benign')
mask_malignant = (y_train == 'malignant')

plt.scatter(X_train_tsne[mask_benign, 0], X_train_tsne[mask_benign, 1],
           marker='s', c='g', label='benign', edgecolor='k', alpha=0.7)
plt.scatter(X_train_tsne[mask_malignant, 0], X_train_tsne[mask_malignant, 1],
           marker='o', c='r', label='malignant', edgecolor='k', alpha=0.7)


plt.title('t-SNE plot of the training data')
plt.xlabel('1st embedding axis')
plt.ylabel('2nd embedding axis')
plt.legend(loc='best', frameon=True, shadow=True)

plt.tight_layout()
plt.show()
plt.close()
../_images/examples_Breast_Cancer_6_0.png

Fitting the LogitBoost Model

Next, we initialize a LogitBoost classifier and fit it to the training data. By default, LogitBoost uses decision stumps (decision trees with depth 1, i.e., a single split) as its base estimator.

[4]:
lboost = LogitBoost(n_estimators=200, random_state=0)
lboost.fit(X_train, y_train)
[4]:
LogitBoost(base_estimator=None, bootstrap=False, learning_rate=1.0,
           max_response=4.0, n_estimators=200, random_state=0,
           weight_trim_quantile=0.05)

Prediction Accuracy

As a first indicator of how well the model predicts the correct labels, we can check its accuracy score (number of correct predictions over the number of total predictions) on the training and test data. If the classifier is good, then the accuracy score should be close to 1.

[5]:
y_pred_train = lboost.predict(X_train)
y_pred_test = lboost.predict(X_test)

accuracy_train = accuracy_score(y_train, y_pred_train)
accuracy_test = accuracy_score(y_test, y_pred_test)

print('Training accuracy: %.4f' % accuracy_train)
print('Test accuracy:     %.4f' % accuracy_test)
Training accuracy: 0.9947
Test accuracy:     0.9737

Precision and Recall

We can also report our LogitBoost model’s precision and recall.

[6]:
report_train = classification_report(y_train, y_pred_train)
report_test = classification_report(y_test, y_pred_test)
print('Training\n%s' % report_train)
print('Testing\n%s' % report_test)
Training
              precision    recall  f1-score   support

      benign       0.99      1.00      1.00       238
   malignant       1.00      0.99      0.99       141

    accuracy                           0.99       379
   macro avg       1.00      0.99      0.99       379
weighted avg       0.99      0.99      0.99       379

Testing
              precision    recall  f1-score   support

      benign       0.97      0.98      0.98       119
   malignant       0.97      0.96      0.96        71

    accuracy                           0.97       190
   macro avg       0.97      0.97      0.97       190
weighted avg       0.97      0.97      0.97       190

Visualizing Accuracy During Boosting

[7]:
iterations = np.arange(1, lboost.n_estimators + 1)
staged_accuracy_train = list(lboost.staged_score(X_train, y_train))
staged_accuracy_test = list(lboost.staged_score(X_test, y_test))

plt.figure(figsize=(10, 8))
plt.plot(iterations, staged_accuracy_train, label='Training', marker='.')
plt.plot(iterations, staged_accuracy_test, label='Test', marker='.')

plt.xlabel('Iteration')
plt.ylabel('Accuracy')
plt.title('Ensemble accuracy during each boosting iteration')
plt.legend(loc='best', shadow=True, frameon=True)

plt.tight_layout()
plt.show()
plt.close()
../_images/examples_Breast_Cancer_14_0.png

Contribution of Each Estimator in the Ensemble

Like other ensemble models, the LogitBoost model can suffer from over-specialization: estimators added to the ensemble in later boosting iterations make relatively small or even negligible contributions toward improving the overall predictions on the training set. This can be quantified by computing the mean of the absolute prediction of each estimator in the ensemble taken over the training set.

[8]:
contrib_train = lboost.contributions(X_train)

plt.figure(figsize=(10, 8))
plt.plot(iterations, contrib_train, lw=2)
plt.xlabel('Estimator Number')
plt.ylabel('Average Absolute Contribution')
plt.title('Average absolute contribution of the estimators in the ensemble')
plt.show()
plt.close()
../_images/examples_Breast_Cancer_16_0.png

Appendix: System Information

This is included for replicability.

[9]:
# sys_info.py is a file in the same directory as these example notebooks:
# doc/source/examples
import sys_info

Machine
=======
      Platform: Darwin-18.7.0-x86_64-i386-64bit
  Machine Type: x86_64
     Processor: i386

Python
======
       Version: 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018, 02:44:43)
                [Clang 6.0 (clang-600.0.57)]
Implementation: CPython

Packages
========
         numpy: 1.17.2
         scipy: 1.3.1
    matplotlib: 3.1.1
       seaborn: 0.9.0
       sklearn: 0.21.3