K Means Clustering Algorithm: Explained

Classification problems are solved by objective segmentation and subjective segmentation.

A non technical explanation ( http://dni-institute.in/blogs/segmentation-a-perspective-2/ ) on when to use subjective segmentation technique such as K means clustering and when to use objective segmentation methods such as Decision Tree.

One of the most frequently used unsupervised algorithms is K Means. K Means Clustering is exploratory data analysis technique. This is non-hierarchical method of grouping objects together.

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

In this blog, we aim to explain the algorithm in  a simple steps and with an example.

Business Scenario: We have height and weight information. Using these two variables, we need to group the objects based on height and weight information.

k means clusters

If you look at the above chart, you will expect that there are two visible clusters/segments  and we want these to be identified using K Means algorithm.

Data Sample

Height
Weight
185
72
170
56
168
60
179
68
182
72
188
77
180
71
180
70
183
84
180
88
180
67
177
76
Step 1: Input
Dataset, Clustering Variables and Maximum Number of Clusters (K in Means Clustering)
In this dataset, only two variables –height and weight – are considered for clustering
Height
Weight
185
72
170
56
168
60
179
68
182
72
188
77
180
71
180
70
183
84
180
88
180
67
177
76
Step 2:  Initialize cluster centroid
In this example, value of K is considered as 2.  Cluster centroids are initialized with first 2 observations.
Cluster
Initial Centroid
Height
Weight
K1
185
72
K2
170
56
Step 3: Calculate Euclidean Distance
Euclidean  is one of the distance measures used on K Means algorithm. Euclidean distance between of a observation and initial cluster centroids  1 and 2 is calculated. Based on euclidean distance each observation is assigned to one of the clusters - based on minimum distance.

Euclidean Distance

 

First two observations

Height
Weight
185
72
170
56
Now initial cluster centroids are :
Cluster
Updated
Centroid
Height
Weight
K1
185
72
K2
170
56

Euclidean Distance Calculation from each of the clusters is calculated.

Euclidian Distance from Cluster 1
Euclidian Distance from Cluster 2
Assignment
(185-185)2+(72-72)2
 =0
(185-170)2+(72-56)2
= 21.93
1
(170-185)2+(56-72)2
= 21.93
(170-170)2+(56-56)2
= 0
2
We have considered two observations for assignment only because we knew the assignment.  And there is no change in Centroids as these two observations were only considered as initial centroids
Step 4: Move on to next observation and calculate Euclidean Distance
Height
Weight
168
60
Euclidean Distance from Cluster 1
Euclidean Distance from Cluster 2
Assignment
(168-185)2+(60-72)2
 =20.808
(168-185)2+(60-72)2
= 4.472
2
 Since distance is minimum from cluster 2, so the observation is assigned to cluster 2. Now revise Cluster Centroid – mean value Height and Weight as Custer Centroids. Addition is only to cluster 2, so centroid of cluster 2 will be updated
Updated cluster centroids
Cluster
Updated Centroid
Height
Weight
K=1
185
72
K=2
(170+168)/2 = 169
(56+60)/2 = 58
Step 5: Calculate Euclidean Distance for the next observation, assign next observation based on minimum euclidean distance and update the cluster centroids.
Next Observation.
Height
Weight
179
68
Euclidean Distance Calculation and Assignment
Euclidain Distance from Cluster 1
Euclidain Distance from Cluster 2
Assignment
7.211103
14.14214
1
Update Cluster Centroid
Cluster
Updated Centroid
Height
Weight
K=1
182
70.6667
K=2
169
58
Continue the steps until all observations are assigned
Final assignments
 k means assignment
Cluster Centroids
Cluster
Updated Centroid
Height
Weight
K=1
182.8
72
K=2
169
58

 

 

 

 

This is what was expected initially based on two-dimensional plot.

Clusters

 

A few important considerations in K Means

  • Scale of measurements influences Euclidean Distance , so variable standardisation becomes necessary
  • Depending on expectations - you may require outlier treatment
  • K Means clustering may be biased on initial centroids - called cluster seeds
  • Maximum clusters  is typically inputs and may also impacts the clusters getting created

In the next blog, we focus on creating clusters using R. K Means Clustering using R

21 thoughts on “K Means Clustering Algorithm: Explained

  1. I am little confused. This is not an accurate depiction of k-Means algorithm. k-Means algorithm steps:

    K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

    One iteration:
    1. Assign labels (clusters) to all observations
    2. Calculate the new Centroid values using mean

    Ref: http://stanford.edu/~cpiech/cs221/handouts/kmeans.html

    In your example you are updating the centroid values even before assigning all the observations to clusters.

    Please clarify.

    • Thanks Kumar for your comment.. I do not think there is any overall approach-wise disconnect between steps we explained and mentioned in the link.. If you read the Step 3 - it calculates Euclidean Distance, Assign observation to a Cluster and Cluster Centroids are updated.Hope it helps

  2. Excellent explaination. I was stuck in making my own implementation of KMeans in R, but this post make it easier to implement. Thank you.

  3. There's probably a calculation mistake in the updated centroid values in step 5. But,apart from that, wonderfully explained. Such a complex thing made so easy!

  4. Hello,

    I have one doubt on K means clustering. How can any team work on K means clustering algorithm( team means in real time project) because if value of K will be multiple so cluster will also create multiple so can only one person will work on K means or how we use this also real-time project?

    • When a k means clustering project is being done, multiple values of k are considered. There are a few considerations to select the final clustering is selected.

        Observation % in each of the clusters
        R2 value of each of the variable
        Overall R2 value and other clustering performance statistics e.g. CCC
  5. Thanks, Dnl Institute for your reply. Have you any link or any site where we are using K means like this.

    Observation % in each of the clusters
    R2 value of each of the variable
    Overall R2 value and other clustering performance statistics

Leave a Comment