Python : An Unsupervised Learning Task Using K-Means Clustering

3 minute read

In the previous post, we performed a supervised machine learning in order to classify Iris flowers, and did pretty well in predicting the labels (kinds) of flowers. We could evaluate the performance of our model because we had the “species” column with the name of three iris kinds. Now, let´s imagine that we were not given this column and that we wanted to know if there are different kinds of Iris flower based only on the measurements of length and the width of the sepals and petals. Well, this is a called an unsupervised learning.

Unsupervised learning means that there is no class (label) column which we can use to test and evaluate how well a model is performing. So there is no outcome to be predicted, therefore the goal is trying to find patterns in the data to reach a reasonable conclusion.

We will use the K-means clustering algorithm on our Iris data assuming that we do not have the “species” column. We will investigate if data can be grouped into 3 clusters representing the three species of Iris (Iris setosa, Iris virginica and Iris versicolor).

We will use Python in this post, here is the R version. So let´s dive in :).

Preparing the data

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
iris = sns.load_dataset("iris")
data = iris.drop("species",1)
data.head()
sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
data.describe()
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Exploring the Data

Let´s now do some exploratory data analysis to see if we can get an idea about the data. Remember, we assume we don´t know anything about how many clusters (kinds) of flowers we have from the dataset.

sns.lmplot(x='sepal_length',y='petal_width',data=data,fit_reg=False)
<seaborn.axisgrid.FacetGrid at 0x7fcf98ac0320>

png

sns.lmplot(x='sepal_width',y='petal_width',data=data,fit_reg=False)
<seaborn.axisgrid.FacetGrid at 0x7fcf98adf4a8>

png

sns.lmplot(x='petal_length',y='sepal_width',data=data,fit_reg=False)
<seaborn.axisgrid.FacetGrid at 0x7fcf90c93550>

png

sns.lmplot(x='petal_length',y='petal_width',data=iris,fit_reg=False)
<seaborn.axisgrid.FacetGrid at 0x7fcf90b99d30>

png

From the previous plots, it seems that the data can be grouped into, minimum, two clusters. One cluster will definitely be easy for K-means to determine, while the others might get tricky to define.

The K-means clustering algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster.

The next process consists of iterating through two steps till the within cluster variation cannot be reduced any further.:

  • Reassign data points to the cluster whose centroid is closest.
  • Calculate new centroid of each cluster.

Since this is an unsupervised task, there is no training or testing step, we will go on and try different clustering numbers for our data and visualize the result.

K-Means Clustering

# Importing KMeans from slLearn
from sklearn.cluster import KMeans
# Create an instance of a K Means model with 2 clusters
# Since we are supposed to not know there are 3 species
kmeans_two_md = KMeans(n_clusters=2)
# Fitting the model
kmeans_two_md.fit(data)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
# Cluster center vectors
kmeans_two_md.cluster_centers_
array([[ 5.00566038,  3.36981132,  1.56037736,  0.29056604],
       [ 6.30103093,  2.88659794,  4.95876289,  1.69587629]])
plt.scatter(x='petal_length',y='sepal_width',data=data,c=kmeans_two_md.labels_)
<matplotlib.collections.PathCollection at 0x7fcf8b31a278>

png

# Create an instance of a K Means model with 3 clusters
kmeans_three_md = KMeans(n_clusters=3)
# Fitting the model
kmeans_three_md.fit(data)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
# Cluster center vectors
kmeans_three_md.cluster_centers_
array([[ 6.85      ,  3.07368421,  5.74210526,  2.07105263],
       [ 5.006     ,  3.428     ,  1.462     ,  0.246     ],
       [ 5.9016129 ,  2.7483871 ,  4.39354839,  1.43387097]])
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(12,6))
iris_map = {'virginica':1,'setosa':2, 'versicolor':3}
ax1.set_title('K Means')
ax1.scatter(x='petal_length',y='sepal_width',data=data,c=kmeans_three_md.labels_)
ax2.set_title("Original")
ax2.scatter(x='petal_length',y='sepal_width',data=iris,c=iris['species'].apply(lambda x: iris_map[x]))
<matplotlib.collections.PathCollection at 0x7fcf8ae20588>

png

As expected, the Setosa was correctly grouped, however, as seen in the previous post, there will always be that overlap between Versicolor and Virginica which explains the misclassification.