load_digits
Function¶Bunch
object containing digit samples and metadataBunch
is a dictionary with additional dataset-specific attributesfrom sklearn.datasets import load_digits
digits = load_digits()
DESCR
attribute contains dataset's description 64
features (Number of Attributes
) that represent an 8-by-8 image with pixel values 0
–16
(Attribute Information
)Missing Attribute Values
) print(digits.DESCR)
type(digits)
Dictionary-like object, the interesting attributes are: ‘data’, the data to learn, ‘images’, the images corresponding to each sample, ‘target’, the classification labels for each sample, ‘target_names’, the meaning of the labels, and ‘DESCR’, the full description of the dataset.
The input data can be accessed by:
digits.data
The target variable data can be accessed by:
digits.target
Bunch
object’s data
and target
attributes are NumPy arrays:
data
array: The 1797 samples (digit images), each with 64 features with values 0 (white) to 16 (black), representing pixel intensities
target
array: The images’ labels, (classes) indicating which digit each image represents
digits.target[::100] # target values of every 100th sample
Remember:
The slicing parameters are aptly named
slice[start:stop:step]
so the slice starts at the location defined by start
, stops before the location stop
is reached, and moves from one position to the next by step
items.
>>> "ABCD"[0:4:2]
'AC'
data
array’s shape
digits.data.shape
target
array’s shape
digits.target.shape
digits.data
digits.data[0]
digits.target[0]
digits.data[1]
digits.target[0],digits.target[1]
Bunch
object has an images
attributefloat64
digits.images[13] # show array for sample image at index 13
Visualization of digits.images[13]
digits.target[0],digits.target[1], digits.target[13]
import matplotlib.pyplot as plt
plt.gray()
plt.matshow(digits.images[0])
plt.show()
digits.data[0]
import matplotlib.pyplot as plt
plt.gray()
plt.matshow(digits.images[1])
plt.show()
import matplotlib.pyplot as plt
plt.gray()
plt.matshow(digits.images[13])
plt.show()
DataFrame
): 'spam'
or 'not-spam'
), you’d have to preprocess those features into numerical values—known as one-hot encoding (discussed later in deep learning)load_digits
returns the preprocessed data ready for machine learning digits.images[13]
corresponds to 1-by-64 array digits.data[13]
:digits.data[13]
plt.cm.gray_r
is for grayscale with 0 for whiteplt.cm
object or a string, like 'gray_r'
import matplotlib.pyplot as plt
figure, axes = plt.subplots(nrows=4, ncols=6, figsize=(6, 4))
for item in zip(axes.ravel(), digits.images, digits.target):
axes, image, target = item
axes.imshow(image, cmap=plt.cm.gray_r)
axes.set_xticks([]) # remove x-axis tick marks
axes.set_yticks([]) # remove y-axis tick marks
axes.set_title(target)
plt.tight_layout()
train_test_split
shuffles the data to randomize it, then splits the samples in the data
array and the target values in the target
array into training and testing sets
X
represents samplesy
represents target valuesfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, random_state=11) # random_state for reproducibility
train_test_split
reserves 75% of the data for training and 25% for testingX_train.shape
X_test.shape
KNeighborsClassifier
estimator implements the k-nearest neighbors algorithmfrom sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
KNeighborsClassifier
Object’s fit
method (1 of 2)¶X_train
) and target training set (y_train
) into the estimatorknn.fit(X=X_train, y=y_train)
n_neighbors
corresponds to k in the k-nearest neighbors algorithm KNeighborsClassifier
default settingsKNeighborsClassifier
Object’s fit
method (2 of 2)¶fit
normally loads data into an estimator then performs complex calculations behind the scenes that learn from the data to train a modelKNeighborsClassifier
’s fit
method just loads the data KNeighborsClassifier
’s predict
method (1 of 2)¶predicted = knn.predict(X=X_test)
expected = y_test
predicted
digits vs. expected
digits for the first 20 test samples—see index 18predicted[:20]
expected[:20]
KNeighborsClassifier
’s predict
method (2 of 2)¶wrong = [(p, e) for (p, e) in zip(predicted, expected) if p != e]
wrong
score
¶print(f'{knn.score(X_test, y_test):.2%}')
kNeighborsClassifier
with default k of 5 achieved 97.78% prediction accuracy using only the estimator’s default parametersfrom sklearn.metrics import confusion_matrix
confusion = confusion_matrix(y_true=expected, y_pred=predicted)
confusion
0
—all 0s were predicted correctly
[45, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Row 8 shows digit class 8
—five 8s were predicted incorrectly
[ 0, 1, 1, 2, 0, 0, 0, 0, 39, 1]
8
sDataFrame
, then graph itimport pandas as pd
confusion_df = pd.DataFrame(confusion, index=range(10), columns=range(10))
import seaborn as sns
figure = plt.figure(figsize=(7, 6))
axes = sns.heatmap(confusion_df, annot=True,
cmap=plt.cm.nipy_spectral_r)
KFold
Class¶KFold
class and function cross_val_score
perform k-fold cross validation n_splits=10
specifies the number of foldsshuffle=True
randomizes the data before splitting it into folds from sklearn.model_selection import KFold
kfold = KFold(n_splits=10, random_state=11, shuffle=True)
cross_val_score
to Train and Test Your Model (1 of 2)¶estimator=knn
— estimator to validateX=digits.data
— samples to use for training and testingy=digits.target
— target predictions for the samplescv=kfold
— cross-validation generator that defines how to split the samples and targets for training and testingfrom sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=knn, X=digits.data, y=digits.target, cv=kfold)
cross_val_score
to Train and Test Your Model (2 of 2)¶scores # array of accuracy scores for each fold
print(f'Mean accuracy: {scores.mean():.2%}')
KNeighborsClassifier
predicts digit images with a high degree of accuracy, it’s possible that other estimators are even more accurateKNeighborsClassifier
, SVC
and GaussianNB
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
SVC
estimatorestimators = {
'KNeighborsClassifier': knn,
'SVC': SVC(gamma='scale'),
'GaussianNB': GaussianNB()}
for estimator_name, estimator_object in estimators.items():
kfold = KFold(n_splits=10, random_state=11, shuffle=True)
scores = cross_val_score(estimator=estimator_object,
X=digits.data, y=digits.target, cv=kfold)
print(f'{estimator_name:>20}: ' +
f'mean accuracy={scores.mean():.2%}; ' +
f'standard deviation={scores.std():.2%}')
KNeighborsClassifier
and SVC
estimators’ accuracies are identical so we might want to perform hyperparameter tuning on each to determine the bestKNeighborsClassifiers
with odd k values from 1 through 19for k in range(1, 20, 2): # k is an odd value 1-19; odds prevent ties
kfold = KFold(n_splits=10, random_state=11, shuffle=True)
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(estimator=knn,
X=digits.data, y=digits.target, cv=kfold)
print(f'k={k:<2}; mean accuracy={scores.mean():.2%}; ' +
f'standard deviation={scores.std():.2%}')
cross_validate
to perform cross-validation and time the results©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.
DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.