Iris setosa: https://commons.wikimedia.org/wiki/File:Wild_iris_KEFJ_(9025144383).jpg. Credit: Courtesy of Nation Park services.
.png)
Iris versicolor: https://commons.wikimedia.org/wiki/Iris_versicolor#/media/File:IrisVersicolor-FoxRoost-Newfoundland.jpg. Credit: Courtesy of Jefficus, https://commons.wikimedia.org/w/index.php?title=User:Jefficus&action=edit&redlink=1

Iris virginica: https://commons.wikimedia.org/wiki/File:IMG_7911-Iris_virginica.jpg. Credit: Christer T Johansson.

from sklearn.datasets import load_iris
iris = load_iris()
print(iris.DESCR)
target_names contains names for the target array’s numeric labelsdtype='<U10' — elements are strings with a max of 10 charactersfeature_names contains names for each column in the data arrayiris.data.shape
iris.target.shape
iris.target
iris.target_names
iris.feature_names
import pandas as pd
# pd.set_option('max_columns', 5)  # needed only in IPython interactive mode
# pd.set_option('display.width', None)  # needed only in IPython interactive mode
DataFrame containing the data array’s contentsfeature_names as the column namesiris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
target array to look up the corresponding species name in target_names arrayiris_df['species'] = [iris.target_names[i] for i in iris.target]
iris_df.head()
pd.set_option('precision', 2)
iris_df.describe()
describe on the 'species' column confirms that it contains three unique valuesiris_df['species'].describe()
pairplot creates a grid of graphsimport seaborn as sns
# sns.set(font_scale=1.1)
sns.set_style('whitegrid')
grid = sns.pairplot(data=iris_df, vars=iris_df.columns[0:4], hue='species')
data—The DataFrame (or two-dimensional array or list) containing the data to plot.vars—A sequence containing the names of the variables to plot. For a DataFrame, these are the names of the columns to plot. Here, we use the first four DataFrame columns, representing the sepal length, sepal width, petal length and petal width, respectively.hue—The DataFrame column that’s used to determine colors of the plotted data. In this case, we’ll color the data by Iris species. pairplot in One Color¶hue keyword argument, pairplot uses only one color to plot all the data because it does not know how to distinguish the species: grid = sns.pairplot(data=iris_df, vars=iris_df.columns[0:4])
pairplot in One Color (cont.)¶pairplot diagrams work well for a small number of features or a subset of features so that you have a small number of rows and columns, and for a relatively small number of samples so you can see the data points. KMeans Estimator¶KMeans estimator to place each sample in the Iris dataset into a clusterKMeans Estimator¶KMeans default argumentsKMeans estimator, it calculates for each cluster a centroid representing the cluster’s center data point n_clusters). from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=11)
KMeans object’s fit Method¶KMeans object contains: labels_ array with values from 0 to n_clusters - 1, indicating the clusters to which the samples belongcluster_centers_ array in which each row represents a centroidkmeans.fit(iris.data)
target array values to get a sense of how well the k-means algorithm clustered the samples target array represents these with values 0–2 KMeans chose clusters perfectly, then each group of 50 elements in the estimator’s labels_ array should have a distinct label. KMeans labels are not related to dataset’s target array print(kmeans.labels_[0:50])
print(kmeans.labels_[50:100])
print(kmeans.labels_[100:150])
pairplot diagramsPCA estimator to perform dimensionality reduction from 4 to 2 dimensionsfrom sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=11)
pca.fit(iris.data)  # trains estimator once
iris_pca = pca.transform(iris.data)  # can be called many times to reduce data
transform again to reduce the cluster centroids from four dimensions to two for plotting transform returns an array with same number of rows as iris.data, but only two columnsiris_pca[0:5,:]
iris_pca.shape
DataFrame and add a species column that we’ll use to determine dot colorsiris_pca_df = pd.DataFrame(iris_pca, 
                           columns=['Component1', 'Component2'])
iris_pca_df['species'] = iris_df.species
cluster_centers_ array has same number of features (four) as dataset's samplesPCA estimator as other samplesiris_pca_df.head()
axes = sns.scatterplot(data=iris_pca_df, x='Component1', 
    y='Component2', hue='species', legend='brief') 
# reduce centroids to 2 dimensions
iris_centers = pca.transform(kmeans.cluster_centers_)
# plot centroids as larger black dots
import matplotlib.pyplot as plt
dots = plt.scatter(iris_centers[:,0], iris_centers[:,1], s=100, c='k')
KMeans here on the small Iris datasetKMeans on larger datasets, consider MiniBatchKMeansMiniBatchKMeans is faster on large datasets and the results are almost as goodDBSCAN and MeanShift estimators, we do not specify number of clusters in advancefrom sklearn.cluster import DBSCAN, MeanShift,\
    SpectralClustering, AgglomerativeClustering
estimators = {
    'KMeans': kmeans,
    'DBSCAN': DBSCAN(),
    'MeanShift': MeanShift(),
    'SpectralClustering': SpectralClustering(n_clusters=3),
    'AgglomerativeClustering': 
        AgglomerativeClustering(n_clusters=3)
}
import numpy as np
for name, estimator in estimators.items():
    estimator.fit(iris.data)
    print(f'\n{name}:')
    for i in range(0, 101, 50):
        labels, counts = np.unique(
            estimator.labels_[i:i+50], return_counts=True)
        print(f'{i}-{i+50}:')
        for label, count in zip(labels, counts):
            print(f'   label={label}, count={count}')          
DBSCAN correctly predicted three clusters (labeled -1, 0 and 1)MeanShift predicted only two clusters (labeled as 0 and 1)©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.
DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.