Iris setosa: https://commons.wikimedia.org/wiki/File:Wild_iris_KEFJ_(9025144383).jpg. Credit: Courtesy of Nation Park services.
Iris versicolor: https://commons.wikimedia.org/wiki/Iris_versicolor#/media/File:IrisVersicolor-FoxRoost-Newfoundland.jpg. Credit: Courtesy of Jefficus, https://commons.wikimedia.org/w/index.php?title=User:Jefficus&action=edit&redlink=1
Iris virginica: https://commons.wikimedia.org/wiki/File:IMG_7911-Iris_virginica.jpg. Credit: Christer T Johansson.
from sklearn.datasets import load_iris
iris = load_iris()
print(iris.DESCR)
target_names
contains names for the target
array’s numeric labelsdtype='<U10'
— elements are strings with a max of 10 charactersfeature_names
contains names for each column in the data
arrayiris.data.shape
iris.target.shape
iris.target
iris.target_names
iris.feature_names
import pandas as pd
# pd.set_option('max_columns', 5) # needed only in IPython interactive mode
# pd.set_option('display.width', None) # needed only in IPython interactive mode
DataFrame
containing the data
array’s contentsfeature_names
as the column namesiris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
target
array to look up the corresponding species name in target_names
arrayiris_df['species'] = [iris.target_names[i] for i in iris.target]
iris_df.head()
pd.set_option('precision', 2)
iris_df.describe()
describe
on the 'species'
column confirms that it contains three unique valuesiris_df['species'].describe()
pairplot
creates a grid of graphsimport seaborn as sns
# sns.set(font_scale=1.1)
sns.set_style('whitegrid')
grid = sns.pairplot(data=iris_df, vars=iris_df.columns[0:4], hue='species')
data
—The DataFrame
(or two-dimensional array or list) containing the data to plot.vars
—A sequence containing the names of the variables to plot. For a DataFrame
, these are the names of the columns to plot. Here, we use the first four DataFrame
columns, representing the sepal length, sepal width, petal length and petal width, respectively.hue
—The DataFrame
column that’s used to determine colors of the plotted data. In this case, we’ll color the data by Iris species. pairplot
in One Color¶hue
keyword argument, pairplot
uses only one color to plot all the data because it does not know how to distinguish the species: grid = sns.pairplot(data=iris_df, vars=iris_df.columns[0:4])
pairplot
in One Color (cont.)¶pairplot
diagrams work well for a small number of features or a subset of features so that you have a small number of rows and columns, and for a relatively small number of samples so you can see the data points. KMeans
Estimator¶KMeans
estimator to place each sample in the Iris dataset into a clusterKMeans
Estimator¶KMeans
default argumentsKMeans
estimator, it calculates for each cluster a centroid representing the cluster’s center data point n_clusters
). from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=11)
KMeans
object’s fit
Method¶KMeans
object contains: labels_
array with values from 0
to n_clusters - 1
, indicating the clusters to which the samples belongcluster_centers_
array in which each row represents a centroidkmeans.fit(iris.data)
target
array values to get a sense of how well the k-means algorithm clustered the samples target
array represents these with values 0–2 KMeans
chose clusters perfectly, then each group of 50 elements in the estimator’s labels_
array should have a distinct label. KMeans
labels are not related to dataset’s target
array print(kmeans.labels_[0:50])
print(kmeans.labels_[50:100])
print(kmeans.labels_[100:150])
pairplot
diagramsPCA
estimator to perform dimensionality reduction from 4 to 2 dimensionsfrom sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=11)
pca.fit(iris.data) # trains estimator once
iris_pca = pca.transform(iris.data) # can be called many times to reduce data
transform
again to reduce the cluster centroids from four dimensions to two for plotting transform
returns an array with same number of rows as iris.data
, but only two columnsiris_pca[0:5,:]
iris_pca.shape
DataFrame
and add a species column that we’ll use to determine dot colorsiris_pca_df = pd.DataFrame(iris_pca,
columns=['Component1', 'Component2'])
iris_pca_df['species'] = iris_df.species
cluster_centers_
array has same number of features (four) as dataset's samplesPCA
estimator as other samplesiris_pca_df.head()
axes = sns.scatterplot(data=iris_pca_df, x='Component1',
y='Component2', hue='species', legend='brief')
# reduce centroids to 2 dimensions
iris_centers = pca.transform(kmeans.cluster_centers_)
# plot centroids as larger black dots
import matplotlib.pyplot as plt
dots = plt.scatter(iris_centers[:,0], iris_centers[:,1], s=100, c='k')
KMeans
here on the small Iris datasetKMeans
on larger datasets, consider MiniBatchKMeans
MiniBatchKMeans
is faster on large datasets and the results are almost as goodDBSCAN
and MeanShift
estimators, we do not specify number of clusters in advancefrom sklearn.cluster import DBSCAN, MeanShift,\
SpectralClustering, AgglomerativeClustering
estimators = {
'KMeans': kmeans,
'DBSCAN': DBSCAN(),
'MeanShift': MeanShift(),
'SpectralClustering': SpectralClustering(n_clusters=3),
'AgglomerativeClustering':
AgglomerativeClustering(n_clusters=3)
}
import numpy as np
for name, estimator in estimators.items():
estimator.fit(iris.data)
print(f'\n{name}:')
for i in range(0, 101, 50):
labels, counts = np.unique(
estimator.labels_[i:i+50], return_counts=True)
print(f'{i}-{i+50}:')
for label, count in zip(labels, counts):
print(f' label={label}, count={count}')
DBSCAN
correctly predicted three clusters (labeled -1
, 0
and 1
)MeanShift
predicted only two clusters (labeled as 0
and 1
)©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.
DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.