# K Means Clustering Algorithm: Explained

Classification problems are solved by objective segmentation and subjective segmentation.

A non technical explanation ( http://dni-institute.in/blogs/segmentation-a-perspective-2/ ) on when to use subjective segmentation technique such as K means clustering and when to use objective segmentation methods such as Decision Tree.

One of the most frequently used unsupervised algorithms is K Means. K Means Clustering is exploratory data analysis technique. This is non-hierarchical method of grouping objects together.

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

In this blog, we aim to explain the algorithm in  a simple steps and with an example.

Business Scenario: We have height and weight information. Using these two variables, we need to group the objects based on height and weight information.

If you look at the above chart, you will expect that there are two visible clusters/segments  and we want these to be identified using K Means algorithm.

Data Sample

 Height Weight 185 72 170 56 168 60 179 68 182 72 188 77 180 71 180 70 183 84 180 88 180 67 177 76
Step 1: Input
Dataset, Clustering Variables and Maximum Number of Clusters (K in Means Clustering)
In this dataset, only two variables –height and weight – are considered for clustering
 Height Weight 185 72 170 56 168 60 179 68 182 72 188 77 180 71 180 70 183 84 180 88 180 67 177 76
Step 2:  Initialize cluster centroid
In this example, value of K is considered as 2.  Cluster centroids are initialized with first 2 observations.
 Cluster Initial Centroid Height Weight K1 185 72 K2 170 56
Step 3: Calculate Euclidean Distance
Euclidean  is one of the distance measures used on K Means algorithm. Euclidean distance between of a observation and initial cluster centroids  1 and 2 is calculated. Based on euclidean distance each observation is assigned to one of the clusters - based on minimum distance.

First two observations

 Height Weight 185 72 170 56
Now initial cluster centroids are :
 Cluster Updated Centroid Height Weight K1 185 72 K2 170 56

Euclidean Distance Calculation from each of the clusters is calculated.

 Euclidian Distance from Cluster 1 Euclidian Distance from Cluster 2 Assignment (185-185)2+(72-72)2  =0 (185-170)2+(72-56)2 = 21.93 1 (170-185)2+(56-72)2 = 21.93 (170-170)2+(56-56)2 = 0 2
We have considered two observations for assignment only because we knew the assignment.  And there is no change in Centroids as these two observations were only considered as initial centroids
Step 4: Move on to next observation and calculate Euclidean Distance
 Height Weight 168 60
 Euclidean Distance from Cluster 1 Euclidean Distance from Cluster 2 Assignment (168-185)2+(60-72)2  =20.808 (168-185)2+(60-72)2 = 4.472 2
Since distance is minimum from cluster 2, so the observation is assigned to cluster 2. Now revise Cluster Centroid – mean value Height and Weight as Custer Centroids. Addition is only to cluster 2, so centroid of cluster 2 will be updated
Updated cluster centroids
 Cluster Updated Centroid Height Weight K=1 185 72 K=2 (170+168)/2 = 169 (56+60)/2 = 58
Step 5: Calculate Euclidean Distance for the next observation, assign next observation based on minimum euclidean distance and update the cluster centroids.
Next Observation.
 Height Weight 179 68
Euclidean Distance Calculation and Assignment
 Euclidain Distance from Cluster 1 Euclidain Distance from Cluster 2 Assignment 7.211103 14.14214 1
Update Cluster Centroid
 Cluster Updated Centroid Height Weight K=1 182 70.6667 K=2 169 58
Continue the steps until all observations are assigned
Final assignments

Cluster Centroids
 Cluster Updated Centroid Height Weight K=1 182.8 72 K=2 169 58

This is what was expected initially based on two-dimensional plot.

A few important considerations in K Means

• Scale of measurements influences Euclidean Distance , so variable standardisation becomes necessary
• Depending on expectations - you may require outlier treatment
• K Means clustering may be biased on initial centroids - called cluster seeds
• Maximum clusters  is typically inputs and may also impacts the clusters getting created

In the next blog, we focus on creating clusters using R. K Means Clustering using R

### 29 thoughts on “K Means Clustering Algorithm: Explained”

1. Excellent Example. No better example found

• Thanks Vishal

2. Very good..example..
but there is a text mistake in step 4.. euclidean distance from cluster 2

• Thanks Nitesh.. We have corrected the spelling.

3. I am little confused. This is not an accurate depiction of k-Means algorithm. k-Means algorithm steps:

K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

One iteration:
1. Assign labels (clusters) to all observations
2. Calculate the new Centroid values using mean

In your example you are updating the centroid values even before assigning all the observations to clusters.

• Thanks Kumar for your comment.. I do not think there is any overall approach-wise disconnect between steps we explained and mentioned in the link.. If you read the Step 3 - it calculates Euclidean Distance, Assign observation to a Cluster and Cluster Centroids are updated.Hope it helps

• @Dnl Institute,
from step 3, the assignment is only done to the new data points, and the centroids are updated. But what if the data points assigned in the previous iterations have to change from one cluster to the other due to the change in the centroids. I mean the euclidian distance changes right. so, there could be a possibility that a data point in one cluster is more closer to the data point in the other cluster, than the data point in the same cluster. I hope my explanation is good.

4. Great, Professional.
You demo is peffect!

Thanks Bro.

5. Perfect.....
Mistake on calculation
Step 5 -Uploaded centroid weight values is incorrect I think.

• Thanks Hazim.. Let us check and correct if required..

6. Great introduction to k-mean clustering

7. Great ,easy to understand

8. Excellent explaination. I was stuck in making my own implementation of KMeans in R, but this post make it easier to implement. Thank you.

9. There's probably a calculation mistake in the updated centroid values in step 5. But,apart from that, wonderfully explained. Such a complex thing made so easy!

10. The simplest way of learning K Means..

11. Hello,

I have one doubt on K means clustering. How can any team work on K means clustering algorithm( team means in real time project) because if value of K will be multiple so cluster will also create multiple so can only one person will work on K means or how we use this also real-time project?

• When a k means clustering project is being done, multiple values of k are considered. There are a few considerations to select the final clustering is selected.

Observation % in each of the clusters
R2 value of each of the variable
Overall R2 value and other clustering performance statistics e.g. CCC
12. Thanks, Dnl Institute for your reply. Have you any link or any site where we are using K means like this.

Observation % in each of the clusters
R2 value of each of the variable
Overall R2 value and other clustering performance statistics

• These are practical steps and considerations.. So you may not see a lot of info on internet.

13. I am afraid your K-means method is not correct.
You are not updating the centroids of the clusters every time an observation changes a cluster but instead you update when all observations are assigned a cluster (or not!).

and here (page 388) : http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf
Thanks
T.

• Thanks for the comments.. We have advised that it is directional and not a pure research blog. Also, we have mentioned that all objects are reconsidered for the reassignment..

14. What if we only have height.. them how will you apply
Like we have transactions of a person .... How to apply k mean on that

15. I am afraid this is not a right version of k-means clustering. first, the centroid is revised every time a new data point is assign. that may be OK. The most serious problem is that after all data points are assigned, the clustering ends. Assign all data points is only one step in k-means clustering, and next step is to update centroids, and these two steps are repeated until no data point changes clustering.

16. I think there is mistake in step 5 that is updated centroid how 182.8 nd 72 come.there is calculation mistake. But rest of steps is well explained

17. Good job !

Please fix the table in Step 4: the values used in the calculation of the distance to Cluster 2 are not correct (you used the Cluster 1 values again by mistake).

18. This is great!!!

19. I appreciate your work on Data Science. It's such a wonderful read on Data Science course. Keep sharing stuffs like this. I am also educating people on similar Data Science training so if you are interested to know more you can watch this Data Science tutorial:-https://www.youtube.com/watch?v=h_GnVUIISk0&

20. simplest and clear example i ever found.

21. Thanks for ever