15.2 Case Study: Classification with k-Nearest Neighbors and the Digits Dataset, Part 1

  • To process mail efficiently and route each letter to the correct destination, postal service computers must be able to scan handwritten names, addresses and zip codes and recognize the letters and digits
  • Scikit-learn enables even novice programmers to make such machine-learning problems manageable

Supervised Machine Learning: Classification

  • Attempt to predict the distinct class (category) to which a sample belongs
    • Binary classificationtwo classes (e.g., “dog” or “cat”)
  • Digits dataset bundled with scikit-learn
    • 8-by-8 pixel images representing 1797 hand-written digits (0 through 9)
  • Goal: Predict which digit an image represents
    • Multi-classification10 possible digits (the classes)
  • Train a classification model using labeled data—know in advance each digit’s class
  • We’ll use one of the simplest machine-learning classification algorithms, k-nearest neighbors (k-NN), to recognize handwritten digits

15.2.1 k-Nearest Neighbors Algorithm (k-NN)

  • Predict a sample’s class by looking at the k training samples nearest in "distance" to the sample
  • Filled dots represent four distinct classes—A (blue), B (green), C (red) and D (purple)
  • Class with the most “votes” wins
    • Odd k value avoids ties — there’s never an equal number of votes

Diagram for the discussion of the k-nearest neighbors algorithm


15.2.2 Loading the Dataset with the load_digits Function

  • Returns a Bunch object containing digit samples and metadata
  • A Bunch is a dictionary with additional dataset-specific attributes
In [2]:
from sklearn.datasets import load_digits
In [3]:
digits = load_digits()  

Displaying Digits Dataset's Description

  • Digits dataset is a subset of the UCI (University of California Irvine) ML hand-written digits dataset
    • Original dataset: 5620 samples—3823 for training and 1797 for testing
    • Digits dataset: Only the 1797 testing samples
  • A Bunch’s DESCR attribute contains dataset's description
    • Each sample has 64 features (Number of Attributes) that represent an 8-by-8 image with pixel values 016 (Attribute Information)
    • No missing values (Missing Attribute Values)
  • 64 features may seem like a lot
    • Datasets can have hundreds, thousands or even millions of features
    • Processing datasets like these can require enormous computing capabilities
In [4]:
print(digits.DESCR)
.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

.. topic:: References

  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.
In [5]:
type(digits)
Out[5]:
sklearn.utils.Bunch

Dictionary-like object, the interesting attributes are: ‘data’, the data to learn, ‘images’, the images corresponding to each sample, ‘target’, the classification labels for each sample, ‘target_names’, the meaning of the labels, and ‘DESCR’, the full description of the dataset.

The input data can be accessed by:

digits.data

The target variable data can be accessed by:

digits.target

Checking the Sample and Target Sizes (1 of 2)

  • Bunch object’s data and target attributes are NumPy arrays:

    • data array: The 1797 samples (digit images), each with 64 features with values 0 (white) to 16 (black), representing pixel intensities Pixel intensities in grayscale shades from white (0) to black (16)

    • target array: The images’ labels, (classes) indicating which digit each image represents

In [6]:
digits.target[::100]  # target values of every 100th sample
Out[6]:
array([0, 4, 1, 7, 4, 8, 2, 2, 4, 4, 1, 9, 7, 3, 2, 1, 2, 5])

Remember:
The slicing parameters are aptly named

slice[start:stop:step]

so the slice starts at the location defined by start, stops before the location stop is reached, and moves from one position to the next by step items.

>>> "ABCD"[0:4:2]
'AC'

Checking the Sample and Target Sizes (2 of 2)

  • Confirm number of samples and features (per sample) via data array’s shape
In [5]:
digits.data.shape
Out[5]:
(1797, 64)
  • Confirm that number of target values matches number of samples via target array’s shape
In [7]:
digits.target.shape
Out[7]:
(1797,)
In [8]:
digits.data
Out[8]:
array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])
In [38]:
digits.data[0]
Out[38]:
array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])
In [9]:
digits.target[0]
Out[9]:
0
In [40]:
digits.data[1]
Out[40]:
array([ 0.,  0.,  0., 12., 13.,  5.,  0.,  0.,  0.,  0.,  0., 11., 16.,
        9.,  0.,  0.,  0.,  0.,  3., 15., 16.,  6.,  0.,  0.,  0.,  7.,
       15., 16., 16.,  2.,  0.,  0.,  0.,  0.,  1., 16., 16.,  3.,  0.,
        0.,  0.,  0.,  1., 16., 16.,  6.,  0.,  0.,  0.,  0.,  1., 16.,
       16.,  6.,  0.,  0.,  0.,  0.,  0., 11., 16., 10.,  0.,  0.])
In [42]:
digits.target[0],digits.target[1]
Out[42]:
(0, 1)

A Sample Digit Image

  • Images are two-dimensional—width and a height in pixels
  • Digits dataset's Bunch object has an images attribute
    • Each element is an 8-by-8 array representing a digit image’s pixel intensities
  • Scikit-learn stores the intensity values as NumPy type float64
In [7]:
digits.images[13]  # show array for sample image at index 13
Out[7]:
array([[ 0.,  2.,  9., 15., 14.,  9.,  3.,  0.],
       [ 0.,  4., 13.,  8.,  9., 16.,  8.,  0.],
       [ 0.,  0.,  0.,  6., 14., 15.,  3.,  0.],
       [ 0.,  0.,  0., 11., 14.,  2.,  0.,  0.],
       [ 0.,  0.,  0.,  2., 15., 11.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  2., 15.,  4.,  0.],
       [ 0.,  1.,  5.,  6., 13., 16.,  6.,  0.],
       [ 0.,  2., 12., 12., 13., 11.,  0.,  0.]])
  • Visualization of digits.images[13]

    Image of a handwritten digit 3

In [43]:
digits.target[0],digits.target[1], digits.target[13]
Out[43]:
(0, 1, 3)
In [44]:
import matplotlib.pyplot as plt 
plt.gray() 
plt.matshow(digits.images[0]) 
plt.show()
<Figure size 432x288 with 0 Axes>
In [11]:
digits.data[0]
Out[11]:
array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])
In [45]:
import matplotlib.pyplot as plt 
plt.gray() 
plt.matshow(digits.images[1]) 
plt.show()
<Figure size 432x288 with 0 Axes>
In [46]:
import matplotlib.pyplot as plt 
plt.gray() 
plt.matshow(digits.images[13]) 
plt.show()
<Figure size 432x288 with 0 Axes>

Preparing the Data for Use with Scikit-Learn (1 of 2)

  • Scikit-learn estimators require samples to be stored in a two-dimensional array of floating-point values (or list of lists or pandas DataFrame):
    • Each row represents one sample
    • Each column in a given row represents one feature for that sample
  • Multi-dimensional data samples must be flattened into a one-dimensional array
  • For categorical features (e.g., strings like 'spam' or 'not-spam'), you’d have to preprocess those features into numerical values—known as one-hot encoding (discussed later in deep learning)

Preparing the Data for Use with Scikit-Learn (2 of 2)

  • load_digits returns the preprocessed data ready for machine learning
  • 8-by-8 array digits.images[13] corresponds to 1-by-64 array digits.data[13]:
In [47]:
digits.data[13]
Out[47]:
array([ 0.,  2.,  9., 15., 14.,  9.,  3.,  0.,  0.,  4., 13.,  8.,  9.,
       16.,  8.,  0.,  0.,  0.,  0.,  6., 14., 15.,  3.,  0.,  0.,  0.,
        0., 11., 14.,  2.,  0.,  0.,  0.,  0.,  0.,  2., 15., 11.,  0.,
        0.,  0.,  0.,  0.,  0.,  2., 15.,  4.,  0.,  0.,  1.,  5.,  6.,
       13., 16.,  6.,  0.,  0.,  2., 12., 12., 13., 11.,  0.,  0.])

15.2.3 Visualizing the Data (1 of 2)

  • Always familiarize yourself with your data—called data exploration
  • Let's visualize the dataset’s first 24 images with Matplotlib
  • To see how difficult a problem handwritten digit recognition is, consider the variations among the images of the 3s in the first, third and fourth rows, and look at the images of the 2s in the first, third and fourth rows.

15.2.3 Visualizing the Data (2 of 2)

First 24 digit images in the digits dataset


Creating the Diagram

In [12]:
import matplotlib.pyplot as plt
In [13]:
figure, axes = plt.subplots(nrows=4, ncols=6, figsize=(6, 4))

for item in zip(axes.ravel(), digits.images, digits.target):
    axes, image, target = item 
    axes.imshow(image, cmap=plt.cm.gray_r)
    axes.set_xticks([])  # remove x-axis tick marks
    axes.set_yticks([])  # remove y-axis tick marks
    axes.set_title(target)
plt.tight_layout()

15.2.4 Splitting the Data for Training and Testing (1 of 2)

  • Typically train a model with a subset of a dataset
  • Save a portion for testing, so you can evaluate a model’s performance using unseen data
  • Function train_test_split shuffles the data to randomize it, then splits the samples in the data array and the target values in the target array into training and testing sets
    • Shuffling helps ensure that the training and testing sets have similar characteristics
    • Returns a tuple of four elements in which the first two are the samples split into training and testing sets, and the last two are the corresponding target values split into training and testing sets

15.2.4 Splitting the Data for Training and Testing (2 of 2)

  • Convention:
    • Uppercase X represents samples
    • Lowercase y represents target values
In [16]:
from sklearn.model_selection import train_test_split
In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, random_state=11)  # random_state for reproducibility
  • Scikit-learn bundled classification datasets have balanced classes
    • Samples are divided evenly among the classes
    • Unbalanced classes could lead to incorrect results

Training and Testing Set Sizes

In [18]:
X_train.shape
Out[18]:
(1347, 64)
In [19]:
X_test.shape
Out[19]:
(450, 64)

15.2.5 Creating the Model

  • In scikit-learn, models are called estimators
  • KNeighborsClassifier estimator implements the k-nearest neighbors algorithm
In [20]:
from sklearn.neighbors import KNeighborsClassifier
In [21]:
knn = KNeighborsClassifier()

15.2.6 Training the Model with the KNeighborsClassifier Object’s fit method (1 of 2)

  • Load sample training set (X_train) and target training set (y_train) into the estimator
In [22]:
knn.fit(X=X_train, y=y_train)
Out[22]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

15.2.6 Training the Model with the KNeighborsClassifier Object’s fit method (2 of 2)

  • fit normally loads data into an estimator then performs complex calculations behind the scenes that learn from the data to train a model
  • KNeighborsClassifier’s fit method just loads the data
    • No initial learning process
    • The estimator is lazy — work is performed only when you use it to make predictions
  • Lots of models have significant training phases that can take minutes, hours, days or more
    • High-performance GPUs and TPUs can significantly reduce model training time

15.2.7 Predicting Digit Classes with the KNeighborsClassifier’s predict method (1 of 2)

  • Returns an array containing the predicted class of each test image:
In [23]:
predicted = knn.predict(X=X_test)
In [24]:
expected = y_test
  • predicted digits vs. expected digits for the first 20 test samples—see index 18
In [25]:
predicted[:20]
Out[25]:
array([0, 4, 9, 9, 3, 1, 4, 1, 5, 0, 4, 9, 4, 1, 5, 3, 3, 8, 5, 6])
In [26]:
expected[:20]
Out[26]:
array([0, 4, 9, 9, 3, 1, 4, 1, 5, 0, 4, 9, 4, 1, 5, 3, 3, 8, 3, 6])

15.2.7 Predicting Digit Classes with the KNeighborsClassifier’s predict method (2 of 2)

  • Locate all incorrect predictions for the entire test set:
In [27]:
wrong = [(p, e) for (p, e) in zip(predicted, expected) if p != e]
In [66]:
wrong
Out[66]:
[(5, 3),
 (8, 9),
 (4, 9),
 (7, 3),
 (7, 4),
 (2, 8),
 (9, 8),
 (3, 8),
 (3, 8),
 (1, 8)]
  • Incorrectly predicted only 10 of the 450 test samples

15.3 Case Study: Classification with k-Nearest Neighbors and the Digits Dataset, Part 2

15.3.1 Metrics for Measuring Model Accuracy

Estimator Method score

  • Returns an indication of how well the estimator performs on test data
  • For classification estimators, returns the prediction accuracy for the test data:
In [67]:
print(f'{knn.score(X_test, y_test):.2%}')
97.78%
  • kNeighborsClassifier with default k of 5 achieved 97.78% prediction accuracy using only the estimator’s default parameters
  • Can use hyperparameter tuning to try to determine the optimal value for k

Confusion Matrix (1 of 2)

  • Shows correct and incorrect predicted values (the hits and misses) for a given class
In [28]:
from sklearn.metrics import confusion_matrix
In [29]:
confusion = confusion_matrix(y_true=expected, y_pred=predicted)
In [30]:
confusion
Out[30]:
array([[45,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0, 45,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0, 54,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0, 42,  0,  1,  0,  1,  0,  0],
       [ 0,  0,  0,  0, 49,  0,  0,  1,  0,  0],
       [ 0,  0,  0,  0,  0, 38,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0, 42,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0, 45,  0,  0],
       [ 0,  1,  1,  2,  0,  0,  0,  0, 39,  1],
       [ 0,  0,  0,  0,  1,  0,  0,  0,  1, 41]], dtype=int64)

Confusion Matrix (2 of 2)

  • Correct predictions shown on principal diagonal from top-left to bottom-right
  • Nonzero values not on principal diagonal indicate incorrect predictions
  • Each row represents one distinct class (0–9)
  • Columns specify how many test samples were classified into classes 0–9
  • Row 0 shows digit class 0all 0s were predicted correctly

    [45, 0, 0, 0, 0, 0, 0, 0, 0, 0]

  • Row 8 shows digit class 8five 8s were predicted incorrectly

    [ 0, 1, 1, 2, 0, 0, 0, 0, 39, 1]

    • Correctly predicted 88.63% (39 of 44) of 8s
    • 8s harder to recognize

Visualizing the Confusion Matrix

  • A heat map displays values as colors
  • Convert the confusion matrix into a DataFrame, then graph it
  • Principal diagonal and incorrect predictions stand out nicely in heat map
In [31]:
import pandas as pd
In [32]:
confusion_df = pd.DataFrame(confusion, index=range(10), columns=range(10))
In [33]:
import seaborn as sns
In [34]:
figure = plt.figure(figsize=(7, 6))
axes = sns.heatmap(confusion_df, annot=True, 
                   cmap=plt.cm.nipy_spectral_r) 

15.3.2 K-Fold Cross-Validation

  • Uses all of your data for training and testing
  • Gives a better sense of how well your model will make predictions
  • Splits the dataset into k equal-size folds (unrelated to k in the k-nearest neighbors algorithm)
  • Repeatedly trains your model with k – 1 folds and test the model with the remaining fold
  • Consider using k = 10 with folds numbered 1 through 10
    • train with folds 1–9, then test with fold 10
    • train with folds 1–8 and 10, then test with fold 9
    • train with folds 1–7 and 9–10, then test with fold 8
    • ...

KFold Class

  • KFold class and function cross_val_score perform k-fold cross validation
  • n_splits=10 specifies the number of folds
  • shuffle=True randomizes the data before splitting it into folds
    • Particularly important if the samples might be ordered or grouped (as in Iris dataset we'll see later)
In [35]:
from sklearn.model_selection import KFold
In [36]:
kfold = KFold(n_splits=10, random_state=11, shuffle=True)

Calling Function cross_val_score to Train and Test Your Model (1 of 2)

  • estimator=knnestimator to validate
  • X=digits.datasamples to use for training and testing
  • y=digits.targettarget predictions for the samples
  • cv=kfoldcross-validation generator that defines how to split the samples and targets for training and testing
In [37]:
from sklearn.model_selection import cross_val_score
In [38]:
scores = cross_val_score(estimator=knn, X=digits.data, y=digits.target, cv=kfold)

Calling Function cross_val_score to Train and Test Your Model (2 of 2)

  • Lowest accuracy was 97.78% — one was 100%
In [ ]:
scores  # array of accuracy scores for each fold
In [39]:
print(f'Mean accuracy: {scores.mean():.2%}')
Mean accuracy: 98.72%
  • Mean accuracy even better than the 97.78% we achieved when we trained the model with 75% of the data and tested the model with 25% earlier

15.3.3 Running Multiple Models to Find the Best One (1 of 3)

  • Difficult to know in advance which machine learning model(s) will perform best for a given dataset
    • Especially when they hide the details of how they operate
  • Even though the KNeighborsClassifier predicts digit images with a high degree of accuracy, it’s possible that other estimators are even more accurate
  • Let’s compare KNeighborsClassifier, SVC and GaussianNB

15.3.3 Running Multiple Models to Find the Best One (2 of 3)

In [40]:
from sklearn.svm import SVC
In [41]:
from sklearn.naive_bayes import GaussianNB
  • Create the estimators
  • To avoid a scikit-learn warning, we supplied a keyword argument when creating the SVC estimator
    • This argument’s value will become the default in scikit-learn version 0.22
In [43]:
estimators = {
    'KNeighborsClassifier': knn, 
    'SVC': SVC(gamma='scale'),
    'GaussianNB': GaussianNB()}

15.3.3 Running Multiple Models to Find the Best One (3 of 3)

  • Execute the models:
In [44]:
for estimator_name, estimator_object in estimators.items():
    kfold = KFold(n_splits=10, random_state=11, shuffle=True)
    scores = cross_val_score(estimator=estimator_object, 
        X=digits.data, y=digits.target, cv=kfold)
    print(f'{estimator_name:>20}: ' + 
          f'mean accuracy={scores.mean():.2%}; ' +
          f'standard deviation={scores.std():.2%}')
KNeighborsClassifier: mean accuracy=98.72%; standard deviation=0.75%
                 SVC: mean accuracy=98.72%; standard deviation=0.79%
          GaussianNB: mean accuracy=84.48%; standard deviation=3.47%
  • KNeighborsClassifier and SVC estimators’ accuracies are identical so we might want to perform hyperparameter tuning on each to determine the best

15.3.4 Hyperparameter Tuning (1 of 3)

  • In real-world machine learning studies, you’ll want to tune hyperparameters to choose values that produce the best possible predictions
  • To determine the best value for k in the kNN algorithm, try different values and compare performance
  • Scikit-learn also has automated hyperparameter tuning capabilities

15.3.4 Hyperparameter Tuning (2 of 3)

  • Create KNeighborsClassifiers with odd k values from 1 through 19
  • Perform k-fold cross-validation on each
In [84]:
for k in range(1, 20, 2):  # k is an odd value 1-19; odds prevent ties
    kfold = KFold(n_splits=10, random_state=11, shuffle=True)
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(estimator=knn, 
        X=digits.data, y=digits.target, cv=kfold)
    print(f'k={k:<2}; mean accuracy={scores.mean():.2%}; ' +
          f'standard deviation={scores.std():.2%}')
k=1 ; mean accuracy=98.83%; standard deviation=0.58%
k=3 ; mean accuracy=98.78%; standard deviation=0.78%
k=5 ; mean accuracy=98.72%; standard deviation=0.75%
k=7 ; mean accuracy=98.44%; standard deviation=0.96%
k=9 ; mean accuracy=98.39%; standard deviation=0.80%
k=11; mean accuracy=98.39%; standard deviation=0.80%
k=13; mean accuracy=97.89%; standard deviation=0.89%
k=15; mean accuracy=97.89%; standard deviation=1.02%
k=17; mean accuracy=97.50%; standard deviation=1.00%
k=19; mean accuracy=97.66%; standard deviation=0.96%

15.3.4 Hyperparameter Tuning (3 of 3)

  • Machine learning is not without its costs, especially in big data and deep learning
  • Compute time grows rapidly with k, because k-NN needs to perform more calculations to find the nearest neighbors
  • Can use function cross_validate to perform cross-validation and time the results

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.