Classification problems are solved by objective segmentation and subjective segmentation.
A non technical explanation ( http://dniinstitute.in/blogs/segmentationaperspective2/ ) on when to use subjective segmentation technique such as K means clustering and when to use objective segmentation methods such as Decision Tree.
One of the most frequently used unsupervised algorithms is K Means. K Means Clustering is exploratory data analysis technique. This is nonhierarchical method of grouping objects together.
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).
In this blog, we aim to explain the algorithm in a simple steps and with an example.
Business Scenario: We have height and weight information. Using these two variables, we need to group the objects based on height and weight information.
If you look at the above chart, you will expect that there are two visible clusters/segments and we want these to be identified using K Means algorithm.
Data Sample
Height

Weight

185

72

170

56

168

60

179

68

182

72

188

77

180

71

180

70

183

84

180

88

180

67

177

76

Dataset, Clustering Variables and Maximum Number of Clusters (K in Means Clustering)
Height

Weight

185

72

170

56

168

60

179

68

182

72

188

77

180

71

180

70

183

84

180

88

180

67

177

76

Cluster

Initial Centroid


Height

Weight


K_{1}

185

72

K_{2}

170

56

Euclidean is one of the distance measures used on K Means algorithm. Euclidean distance between of a observation and initial cluster centroids 1 and 2 is calculated. Based on euclidean distance each observation is assigned to one of the clusters  based on minimum distance.
First two observations
Height

Weight

185

72

170

56

Cluster

Updated
Centroid 

Height

Weight


K_{1}

185

72

K_{2}

170

56

Euclidean Distance Calculation from each of the clusters is calculated.
Euclidian Distance from Cluster 1

Euclidian Distance from Cluster 2

Assignment

(185185)^{2}+(7272)^{2
} =0

(185170)^{2}+(7256)^{2
}= 21.93

1

(170185)^{2}+(5672)^{2
}= 21.93

(170170)^{2}+(5656)^{2
}= 0

2

Height

Weight

168

60

Euclidean Distance from Cluster 1

Euclidean Distance from Cluster 2

Assignment

(168185)^{2}+(6072)^{2
} =20.808

(168185)^{2}+(6072)^{2
}= 4.472

2

Cluster

Updated Centroid


Height

Weight


K=1

185

72

K=2

(170+168)/2 = 169

(56+60)/2 = 58

Height

Weight

179

68

Euclidain Distance from Cluster 1

Euclidain Distance from Cluster 2

Assignment

7.211103

14.14214

1

Cluster

Updated Centroid


Height

Weight


K=1

182

70.6667

K=2

169

58

Cluster

Updated Centroid


Height

Weight


K=1

182.8

72

K=2

169

58

This is what was expected initially based on twodimensional plot.
A few important considerations in K Means
 Scale of measurements influences Euclidean Distance , so variable standardisation becomes necessary
 Depending on expectations  you may require outlier treatment
 K Means clustering may be biased on initial centroids  called cluster seeds
 Maximum clusters is typically inputs and may also impacts the clusters getting created
In the next blog, we focus on creating clusters using R. K Means Clustering using R
Excellent Example. No better example found
Thanks Vishal
Very good..example..
but there is a text mistake in step 4.. euclidean distance from cluster 2
Thanks Nitesh.. We have corrected the spelling.
I am little confused. This is not an accurate depiction of kMeans algorithm. kMeans algorithm steps:
KMeans finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.
One iteration:
1. Assign labels (clusters) to all observations
2. Calculate the new Centroid values using mean
Ref: http://stanford.edu/~cpiech/cs221/handouts/kmeans.html
In your example you are updating the centroid values even before assigning all the observations to clusters.
Please clarify.
Thanks Kumar for your comment.. I do not think there is any overall approachwise disconnect between steps we explained and mentioned in the link.. If you read the Step 3  it calculates Euclidean Distance, Assign observation to a Cluster and Cluster Centroids are updated.Hope it helps
@Dnl Institute,
from step 3, the assignment is only done to the new data points, and the centroids are updated. But what if the data points assigned in the previous iterations have to change from one cluster to the other due to the change in the centroids. I mean the euclidian distance changes right. so, there could be a possibility that a data point in one cluster is more closer to the data point in the other cluster, than the data point in the same cluster. I hope my explanation is good.
Great, Professional.
You demo is peffect!
Thanks Bro.
Perfect.....
Mistake on calculation
Step 5 Uploaded centroid weight values is incorrect I think.
Thanks Hazim.. Let us check and correct if required..
Great introduction to kmean clustering
Great ,easy to understand
Excellent explaination. I was stuck in making my own implementation of KMeans in R, but this post make it easier to implement. Thank you.
There's probably a calculation mistake in the updated centroid values in step 5. But,apart from that, wonderfully explained. Such a complex thing made so easy!
The simplest way of learning K Means..
Hello,
I have one doubt on K means clustering. How can any team work on K means clustering algorithm( team means in real time project) because if value of K will be multiple so cluster will also create multiple so can only one person will work on K means or how we use this also realtime project?
When a k means clustering project is being done, multiple values of k are considered. There are a few considerations to select the final clustering is selected.
Observation % in each of the clusters
R^{2} value of each of the variable
Overall R^{2} value and other clustering performance statistics e.g. CCC
Thanks, Dnl Institute for your reply. Have you any link or any site where we are using K means like this.
Observation % in each of the clusters
R2 value of each of the variable
Overall R2 value and other clustering performance statistics
These are practical steps and considerations.. So you may not see a lot of info on internet.
I am afraid your Kmeans method is not correct.
You are not updating the centroids of the clusters every time an observation changes a cluster but instead you update when all observations are assigned a cluster (or not!).
Have a look here : https://pdfs.semanticscholar.org/99d0/ea088fb5f545c7ab4a0d77b2df7c68f031ae.pdf
and here (page 388) : http://wwwbcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf
Thanks
T.
Thanks for the comments.. We have advised that it is directional and not a pure research blog. Also, we have mentioned that all objects are reconsidered for the reassignment..
What if we only have height.. them how will you apply
Like we have transactions of a person .... How to apply k mean on that
I am afraid this is not a right version of kmeans clustering. first, the centroid is revised every time a new data point is assign. that may be OK. The most serious problem is that after all data points are assigned, the clustering ends. Assign all data points is only one step in kmeans clustering, and next step is to update centroids, and these two steps are repeated until no data point changes clustering.
I think there is mistake in step 5 that is updated centroid how 182.8 nd 72 come.there is calculation mistake. But rest of steps is well explained
Good job !
Please fix the table in Step 4: the values used in the calculation of the distance to Cluster 2 are not correct (you used the Cluster 1 values again by mistake).
This is great!!!
I appreciate your work on Data Science. It's such a wonderful read on Data Science course. Keep sharing stuffs like this. I am also educating people on similar Data Science training so if you are interested to know more you can watch this Data Science tutorial:https://www.youtube.com/watch?v=h_GnVUIISk0&
simplest and clear example i ever found.
Thanks for ever