Clustering for dataset exploration#

Learn how to discover the underlying groups (or “clusters”) in a dataset. By the end of this chapter, you’ll be clustering companies using their stock market prices, and distinguishing different species by clustering their measurements.

Unsupervised Learning#

How many clusters?#

You are given an array points of size 300x2, where each row gives the (x, y) co-ordinates of a point on a map. Make a scatter plot of these points, and use the scatter plot to guess how many clusters there are.

matplotlib.pyplot has already been imported as plt. In the IPython Shell:

Create an array called xs that contains the values of points\[:,0\] - that is, column 0 of points.

Create an array called ys that contains the values of points\[:,1\] - that is, column 1 of points.

Make a scatter plot by passing xs and ys to the plt.scatter() function.

Call the plt.show() function to show your plot.

How many clusters do you see?

# edited/added
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
points = np.array(pd.read_csv("archive/Unsupervised-Learning-in-Python/datasets/points.csv", header = None))
xs = points[:,0]
ys = points[:,1]
plt.scatter(xs, ys, alpha=0.5)
plt.show()

Correct! The scatter plot suggests that there are 3 distinct clusters.

Clustering 2D points#

From the scatter plot of the previous exercise, you saw that the points seem to separate into 3 clusters. You’ll now create a KMeans model to find 3 clusters, and fit it to the data points from the previous exercise. After the model has been fit, you’ll obtain the cluster labels for some new points using the .predict() method.

You are given the array points from the previous exercise, and also an array new_points.

Import KMeans from sklearn.cluster.

Using KMeans(), create a KMeans instance called model to find 3 clusters. To specify the number of clusters, use the n_clusters keyword argument.

Use the .fit() method of model to fit the model to the array of points points.

Use the .predict() method of model to predict the cluster labels of new_points, assigning the result to labels.

Hit submit to see the cluster labels of new_points.

# edited/added
import numpy as np
import pandas as pd
points = np.array(pd.read_csv("archive/Unsupervised-Learning-in-Python/datasets/points.csv", header = None))
new_points = np.array(pd.read_csv("archive/Unsupervised-Learning-in-Python/datasets/new_points.csv", header = None))

# Import KMeans
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters=3)

# Fit model to points
model.fit(points)

# Determine the cluster labels of new_points: labels

## KMeans(n_clusters=3)

labels = model.predict(new_points)

# Print cluster labels of new_points
print(labels)

## [0 1 2 0 1 0 1 1 1 2 0 1 1 2 2 1 2 2 1 1 2 1 0 1 0 2 1 2 2 0 0 1 1 1 2 0 1
##  1 0 1 2 0 0 2 0 1 2 2 1 1 1 1 2 2 0 0 2 2 2 0 0 1 1 1 0 1 2 1 0 2 0 0 0 1
##  0 2 2 0 1 2 0 2 0 1 2 1 2 0 1 1 1 0 1 1 0 2 2 2 2 0 1 0 2 2 0 0 1 0 2 2 0
##  2 2 2 1 1 1 1 2 2 1 0 1 2 1 0 2 1 2 2 1 2 1 2 0 1 0 0 1 2 0 1 0 0 2 1 1 0
##  2 0 2 1 0 2 2 0 2 1 1 2 1 2 2 1 1 0 1 1 2 0 2 0 0 1 0 1 1 0 0 2 0 0 0 2 1
##  1 0 2 0 2 2 1 1 1 0 1 1 1 2 2 0 1 0 0 0 2 1 1 1 1 1 1 2 2 1 2 2 2 2 1 2 2
##  1 1 0 2 0 0 2 0 2 0 2 1 1 2 1 1 1 2 0 0 2 1 1 2 1 2 2 1 2 2 0 2 0 0 0 1 2
##  2 2 0 1 0 2 0 2 2 1 0 0 0 2 1 1 1 0 1 2 2 1 0 0 2 0 0 2 0 1 0 2 2 2 2 1 2
##  2 1 1 0]

Great work! You’ve successfully performed k-Means clustering and predicted the labels of new points. But it is not easy to inspect the clustering by just looking at the printed labels. A visualization would be far more useful. In the next exercise, you’ll inspect your clustering with a scatter plot!

Inspect your clustering#

Let’s now inspect the clustering you performed in the previous exercise!

A solution to the previous exercise has already run, so new_points is an array of points and labels is the array of their cluster labels.

Import matplotlib.pyplot as plt.

Assign column 0 of new_points to xs, and column 1 of new_points to ys.

Make a scatter plot of xs and ys, specifying the c=labels keyword arguments to color the points by their cluster label. Also specify alpha=0.5.

Compute the coordinates of the centroids using the .cluster_centers\_ attribute of model.

Assign column 0 of centroids to centroids_x, and column 1 of centroids to centroids_y.

Make a scatter plot of centroids_x and centroids_y, using ‘D’ (a diamond) as a marker by specifying the marker parameter. Set the size of the markers to be 50 using s=50.

# Import pyplot
from matplotlib import pyplot as plt

# Assign the columns of new_points: xs and ys
xs = new_points[:,0]
ys = new_points[:,1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs, ys, c=labels, alpha=0.5)

# Assign the cluster centers: centroids
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

# Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x, centroids_y, marker='D', s=50)
plt.show()

Fantastic! The clustering looks great! But how can you be sure that 3 clusters is the correct choice? In other words, how can you evaluate the quality of a clustering? Tune into the next video in which Ben will explain how to evaluate a clustering!

Evaluating a clustering#

How many clusters of grain?#

In the video, you learned how to choose a good number of clusters for a dataset using the k-means inertia graph. You are given an array samples containing the measurements (such as area, perimeter, length, and several others) of samples of grain. What’s a good number of clusters in this case?

KMeans and PyPlot (plt) have already been imported for you.

This dataset was sourced from the UCI Machine Learning Repository.

For each of the given values of k, perform the following steps:

Create a KMeans instance called model with k clusters.

Fit the model to the grain data samples.

Append the value of the inertia\_ attribute of model to the list inertias.

The code to plot ks vs inertias has been written for you, so hit submit to see the plot!

# edited/added
grains = pd.read_csv("archive/Unsupervised-Learning-in-Python/datasets/grains.csv")
samples = np.array(grains)[:,:7]
varieties = list(np.array(grains)[:,8])

ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(samples)
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
# Plot ks vs inertias

## KMeans(n_clusters=1)
## KMeans(n_clusters=2)
## KMeans(n_clusters=3)
## KMeans(n_clusters=4)
## KMeans(n_clusters=5)

plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)

## ([<matplotlib.axis.XTick object at 0x7ffcdace2910>, <matplotlib.axis.XTick object at 0x7ffcdace2a60>, <matplotlib.axis.XTick object at 0x7ffcdace95e0>, <matplotlib.axis.XTick object at 0x7ffcdacf72e0>, <matplotlib.axis.XTick object at 0x7ffcdacf7610>], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])

plt.show()

Excellent job! The inertia decreases very slowly from 3 clusters to 4, so it looks like 3 clusters would be a good choice for this data.

Evaluating the grain clustering#

In the previous exercise, you observed from the inertia plot that 3 is a good number of clusters for the grain data. In fact, the grain samples come from a mix of 3 different grain varieties: “Kama”, “Rosa” and “Canadian”. In this exercise, cluster the grain samples into three clusters, and compare the clusters to the grain varieties using a cross-tabulation.

You have the array samples of grain samples, and a list varieties giving the grain variety for each sample. Pandas (pd) and KMeans have already been imported for you.

Create a KMeans model called model with 3 clusters.

Use the .fit_predict() method of model to fit it to samples and derive the cluster labels. Using .fit_predict() is the same as using .fit() followed by .predict().

Create a DataFrame df with two columns named ‘labels’ and ‘varieties’, using labels and varieties, respectively, for the column values. This has been done for you.

Use the pd.crosstab() function on df\[‘labels’\] and df\[‘varieties’\] to count the number of times each grain variety coincides with each cluster label. Assign the result to ct.

Hit submit to see the cross-tabulation!

# Create a KMeans model with 3 clusters: model
model = KMeans(n_clusters=3)

# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(samples)

# Create a DataFrame with clusters and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])

# Display ct
print(ct)

## varieties  Canadian wheat  Kama wheat  Rosa wheat
## labels                                           
## 0                       2          60          10
## 1                       0           1          60
## 2                      68           9           0

Great work! The cross-tabulation shows that the 3 varieties of grain separate really well into 3 clusters. But depending on the type of data you are working with, the clustering may not always be this good. Is there anything you can do in such situations to improve your clustering? You’ll find out in the next video!

Transforming features for better clusterings#

Scaling fish data for clustering#

You are given an array samples giving measurements of fish. Each row represents an individual fish. The measurements, such as weight in grams, length in centimeters, and the percentage ratio of height to length, have very different scales. In order to cluster this data effectively, you’ll need to standardize these features first. In this exercise, you’ll build a pipeline to standardize and cluster the data.

These fish measurement data were sourced from the Journal of Statistics Education.

Import:

make_pipeline from sklearn.pipeline.

StandardScaler from sklearn.preprocessing.

KMeans from sklearn.cluster.

Create an instance of StandardScaler called scaler.

Create an instance of KMeans with 4 clusters called kmeans.

Create a pipeline called pipeline that chains scaler and kmeans. To do this, you just need to pass them in as arguments to make_pipeline().

# edited/added
fish = np.array(pd.read_csv("archive/Unsupervised-Learning-in-Python/datasets/fish.csv", header = None))
samples = fish[:,1:]
species = fish[:,0]

# Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Create scaler: scaler
scaler = StandardScaler()

# Create KMeans instance: kmeans
kmeans = KMeans(n_clusters=4)

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, kmeans)

Great work! Now that you’ve built the pipeline, you’ll use it in the next exercise to cluster the fish by their measurements.

Clustering the fish data#

You’ll now use your standardization and clustering pipeline from the previous exercise to cluster the fish by their measurements, and then create a cross-tabulation to compare the cluster labels with the fish species.

As before, samples is the 2D array of fish measurements. Your pipeline is available as pipeline, and the species of every fish sample is given by the list species.

Import pandas as pd.

Fit the pipeline to the fish measurements samples.

Obtain the cluster labels for samples by using the .predict() method of pipeline.

Using pd.DataFrame(), create a DataFrame df with two columns named ‘labels’ and ‘species’, using labels and species, respectively, for the column values.

Using pd.crosstab(), create a cross-tabulation ct of df\[‘labels’\] and df\[‘species’\].

# Import pandas
import pandas as pd

# Fit the pipeline to samples
pipeline.fit(samples)

# Calculate the cluster labels: labels

## Pipeline(steps=[('standardscaler', StandardScaler()),
##                 ('kmeans', KMeans(n_clusters=4))])

labels = pipeline.predict(samples)

# Create a DataFrame with labels and species as columns: df
df = pd.DataFrame({'labels': labels, 'species': species})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['species'])

# Display ct
print(ct)

## species  Bream  Pike  Roach  Smelt
## labels                            
## 0            1     0     19      1
## 1            0    17      0      0
## 2            0     0      0     13
## 3           33     0      1      0

Excellent! It looks like the fish data separates really well into 4 clusters!

Clustering stocks using KMeans#

In this exercise, you’ll cluster companies using their daily stock price movements (i.e. the dollar difference between the closing and opening prices for each trading day). You are given a NumPy array movements of daily price movements from 2010 to 2015 (obtained from Yahoo! Finance), where each row corresponds to a company, and each column corresponds to a trading day.

Some stocks are more expensive than others. To account for this, include a Normalizer at the beginning of your pipeline. The Normalizer will separately transform each company’s stock price to a relative scale before the clustering begins.

Note that Normalizer() is different to StandardScaler(), which you used in the previous exercise. While StandardScaler() standardizes features (such as the features of the fish data from the previous exercise) by removing the mean and scaling to unit variance, Normalizer() rescales each sample - here, each company’s stock price - independently of the other.

KMeans and make_pipeline have already been imported for you.

Import Normalizer from sklearn.preprocessing.

Create an instance of Normalizer called normalizer.

Create an instance of KMeans called kmeans with 10 clusters.

Using make_pipeline(), create a pipeline called pipeline that chains normalizer and kmeans.

Fit the pipeline to the movements array.

# edited/added
stock = np.array(pd.read_csv("archive/Unsupervised-Learning-in-Python/datasets/company-stock-movements-2010-2015-incl.csv", header = None, skiprows=1))
movements = stock[:,1:]
companies = list(stock[:,0])

# Import Normalizer
from sklearn.preprocessing import Normalizer

# Create a normalizer: normalizer
normalizer = Normalizer()

# Create a KMeans model with 10 clusters: kmeans
kmeans = KMeans(n_clusters=10)

# Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer, kmeans)

# Fit pipeline to the daily price movements
pipeline.fit(movements)

## Pipeline(steps=[('normalizer', Normalizer()),
##                 ('kmeans', KMeans(n_clusters=10))])

Great work - you’re really getting the hang of this. Now that your pipeline has been set up, you can find out which stocks move together in the next exercise!

Which stocks move together?#

In the previous exercise, you clustered companies by their daily stock price movements. So which company have stock prices that tend to change in the same way? You’ll now inspect the cluster labels from your clustering to find out.

Your solution to the previous exercise has already been run. Recall that you constructed a Pipeline pipeline containing a KMeans model and fit it to the NumPy array movements of daily stock movements. In addition, a list companies of the company names is available.

Import pandas as pd.

Use the .predict() method of the pipeline to predict the labels for movements.

Align the cluster labels with the list of company names companies by creating a DataFrame df with labels and companies as columns. This has been done for you.

Use the .sort_values() method of df to sort the DataFrame by the ‘labels’ column, and print the result.

Hit submit and take a moment to see which companies are together in each cluster!

# Import pandas
import pandas as pd

# Predict the cluster labels: labels
labels = pipeline.predict(movements)

# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': labels, 'companies': companies})

# Display df sorted by cluster label
print(df.sort_values('labels'))

##     labels                           companies
## 26       0                      JPMorgan Chase
## 1        0                                 AIG
## 3        0                    American express
## 5        0                     Bank of America
## 18       0                       Goldman Sachs
## 16       0                   General Electrics
## 55       0                         Wells Fargo
## 15       0                                Ford
## 30       1                          MasterCard
## 44       1                        Schlumberger
## 32       1                                  3M
## 53       1                       Valero Energy
## 13       1                   DuPont de Nemours
## 59       1                               Yahoo
## 10       1                      ConocoPhillips
## 8        1                         Caterpillar
## 35       1                            Navistar
## 57       1                               Exxon
## 2        1                              Amazon
## 12       1                             Chevron
## 42       2                   Royal Dutch Shell
## 43       2                                 SAP
## 41       2                       Philip Morris
## 46       2                      Sanofi-Aventis
## 28       2                           Coca Cola
## 19       2                     GlaxoSmithKline
## 20       2                          Home Depot
## 37       2                            Novartis
## 52       2                            Unilever
## 54       2                            Walgreen
## 6        2            British American Tobacco
## 49       2                               Total
## 39       2                              Pfizer
## 24       3                               Intel
## 47       3                            Symantec
## 23       3                                 IBM
## 50       3  Taiwan Semiconductor Manufacturing
## 51       3                   Texas instruments
## 56       4                            Wal-Mart
## 29       5                     Lookheed Martin
## 36       5                    Northrop Grumman
## 4        5                              Boeing
## 34       6                          Mitsubishi
## 7        6                               Canon
## 45       6                                Sony
## 58       6                               Xerox
## 48       6                              Toyota
## 21       6                               Honda
## 40       7                      Procter Gamble
## 9        7                   Colgate-Palmolive
## 25       7                   Johnson & Johnson
## 38       7                               Pepsi
## 27       7                      Kimberly-Clark
## 31       8                           McDonalds
## 17       9                     Google/Alphabet
## 14       9                                Dell
## 11       9                               Cisco
## 33       9                           Microsoft
## 22       9                                  HP
## 0        9                               Apple

Fantastic job - you have completed Chapter 1! Take a look at the clusters. Are you surprised by any of the results? In the next chapter, you’ll learn about how to communicate results such as this through visualizations.

Machine Learning Scientist with Python

Clustering for dataset exploration

Contents