15.6 Case Study: Unsupervised Machine Learning, Part 1—Dimensionality Reduction

  • We’ve focused on getting to know your data
  • Unsupervised machine learning and visualization can help you find patterns and relationships among unlabeled samples
  • Visualizing data with two variables is easy
    • Plot data in 2D with one variable along each axis
    • Visualization libraries also can plot datasets with three variables in 3D
  • But how do you visualize data with more than three dimensions?
    • Digits dataset samples each have 64 features (dimensions) and a target value
    • Big data samples can have hundreds, thousands or even millions of features (dimensions)

15.6 Case Study: Unsupervised Machine Learning, Part 1—Dimensionality Reduction (cont.)

  • To visualize, must reduce the data to two or three dimensions
  • Use an unsupervised machine learning technique called dimensionality reduction
  • Visualizing the results can reveal patterns in the data that will help you choose the most appropriate machine learning algorithms to use
  • For example, Clusters of points might indicate distinct classes of information within the dataset
    • So a classification algorithm might be appropriate
    • Still need to determine the class of the samples in each cluster
    • Might require consulting with a domain expert and studying samples in a cluster to see what they have in common

15.6 Case Study: Unsupervised Machine Learning, Part 1—Dimensionality Reduction (cont.)

  • Dimensionality reduction also serves other purposes
    • Training estimators on big data with significant numbers of dimensions can take hours, days, weeks or longer.
    • Difficult for humans to think about highly dimensional data
    • Could eliminate or combine closely correlated features to improve training performance
      • Might reduce the accuracy of the model

Loading the Digits Dataset

  • Ignore Digits dataset labels and use dimensionality reduction to help visualize the data in two dimensions
  • We added %matplotlib inline to enable Matplotlib in this notebook.
In [1]:
%matplotlib inline
from sklearn.datasets import load_digits
In [2]:
digits = load_digits()

Creating a TSNE Estimator for Dimensionality Reduction

  • TSNE estimator uses an algorithm called t-distributed Stochastic Neighbor Embedding (t-SNE) to analyze a dataset’s features and reduce them to the specified number of dimensions
    • Algorithm’s details are beyond scope
    • We first tried the popular PCA (principal components analysis) estimator but did not like the results, so we switched to TSNE
  • Create a TSNE object that reduces a dataset’s features to two dimensions
  • random_state for reproducibility of the “render sequence” when we display the digit clusters
In [3]:
from sklearn.manifold import TSNE
In [4]:
tsne = TSNE(n_components=2, random_state=11)

Transforming the Digits Dataset’s Features into Two Dimensions

  • Two steps
    • Train the estimator with the dataset
    • Use the estimator to transform the data into the specified number of dimensions
  • Can perform separately with TSNE methods fit and transform
  • Perform in one statement using fit_transform
    • Returns array with same number of rows as digits.data and two columns
In [5]:
reduced_data = tsne.fit_transform(digits.data)
In [6]:
(1797, 2)

Visualizing the Reduced Data

  • Rather than Seaborn’s scatterplot function, use Matplotlib’s scatter function
    • Returns collection of plotted items, which we’ll use in a second scatter plot
In [7]:
import matplotlib.pyplot as plt
figure = plt.figure(figsize=(5, 5))
dots = plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c='black')

Visualizing the Reduced Data (cont.)

  • Did not label axes — they do not correspond to specific features of the original dataset
  • New features produced by TSNE could be quite different from dataset’s original features
  • Clear clusters of related data points
  • Appear to be 11 main clusters, rather than 10
  • Some "loose" data points
    • Makes sense because, as you saw, some digits were difficult to classify

Visualizing the Reduced Data with Different Colors for Each Digit

  • Don’t know whether all the items in each cluster represent the same digit
    • If not, then the clusters are not helpful
  • Use targets in Digits dataset to color the dots to see whether clusters indeed represent specific digits
  • c=digits.target — use target values determine dot colors
  • cmap=plt.cm.get_cmap('nipy_spectral_r', 10)color map to use
    • Specifically use 10 distinct colors for the 10 digits
  • Last statement adds color bar key
In [8]:
figure = plt.figure(figsize=(6, 5))

dots = plt.scatter(reduced_data[:, 0], reduced_data[:, 1],
    c=digits.target, cmap=plt.cm.get_cmap('nipy_spectral_r', 10))
colorbar = plt.colorbar(dots)  

