K Means Clustering- Summary Statistics and Visualization

In this post, kmeans function is used for K Means Clustering. Overall approach which is used in K Means Clustering is discussed in the earlier blog.

Using London Athlete Data

setwd("~/training")

london <-read.csv("londonT.csv")

londonkm <- kmeans(london[4:5],
                   centers= 4)

k-means output statistics

kmeans function gives below statistics as standard output.

  • Cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
  • Centers: A matrix of cluster centres.
  • Totss: The total sum of squares.
  • Withinss: Vector of within-cluster sum of squares, one component per cluster.
  • tot.withinss: Total within-cluster sum of squares, i.e., sum(withinss).
  • Betweenss: The between-cluster sum of squares, i.e. totss-tot.withinss.
  • Size: The number of points in each cluster.
  • Iter: The number of (outer) iterations.
  • Ifault: Integer: indicator of a possible algorithm problem – for experts.

Clustering Performance

We need to have a mechanism to understand and conclude that the clusters created.

SAS Fastclus procedure provides

*   Variable Level R Square
*   Overall R Square
*   Pseudo  F 
*   CCC- Cubic Clustering Criterion (CCC).

But kmeans does not provide any of the above statistics. We can write custom functions to calculate the above.

We are showing steps to calculate overall R Square and Pseudo F

Overall R Square is ratio between variance to within Variance. Pseudo F calculation.

#Ratio of between Variance to within cluster analysis
Overall_R_sq <- londonkm$betweenss/londonkm$totss
# Pseudo F 
F_Pseudo <- ((londonkm$totss-londonkm$betweenss)/(length(londonkm$withinss)-1))/((londonkm$betweenss)/(sum(londonkm$size)-length(londonkm$withinss)))

Overall_R_sq
## [1] 0.728
F_Pseudo
## [1] 1293

Higher values of overall R Square and Pseudo F Statistics, better (in terms of differential among clusters) is the clusters created.

Profiling and Visualization

For clustering height and weight of the athletes are used, first we need to compared the heights and weights across clusters to know what type of athletes each cluster has.

par(mfrow=c(1,2))

## Plots on clustering variables
height.m <- aggregate(london$Height, by=list(londonkm$cluster), mean)
names(height.m) <- c("Cluster","Height") 

## Bar Plot: Height
cp <-barplot(height=height.m$Height ,
        names.arg=height.m$Cluster,
        xlab="Cluster",
        ylab="Avg Height",
        main="Height by Clusters",
        col=c("darkorchid1","firebrick2","darkseagreen2","goldenrod1"),
        ylim= c(0,max(height.m$Height)+10 ) ,
        border=NA 
)
text(cp, (max(height.m$Height)+5), round(height.m$Height, 1),cex=0.7,col="black") 

weight.m <- aggregate(london$Weight, by=list(londonkm$cluster), mean)
names(weight.m ) <- c("Cluster","Weight") 

## Bar Plot: Weight
cp <-barplot(height=weight.m$Weight ,
        names.arg=weight.m$Cluster,
        xlab="Cluster",
        ylab="Avg Weight",
        main="Weight by Clusters",
        col=c("darkorchid1","firebrick2","darkseagreen2","goldenrod1"),
        ylim= c(0,max(weight.m$Weight)+20 ) ,
        border=NA 
)
text(cp, (max(weight.m$Weight)+10), round(weight.m$Weight, 1),cex=0.7,col="black")
plot of chunk unnamed-chunk-3
## Could have used below steps to directly get cluster centers
londonkm$centers
##     Age Height
## 1 27.07  192.4
## 2 23.53  178.5
## 3 33.52  175.3
## 4 24.13  164.3

Now, we can leverage other variables for profiling the athletes in each cluster.

*   Win Rate
*   Gender Distribution
par(mfrow=c(1,1))
# Win Variable
Win <- london$Gold>0 | london$Silver | london$Bronze
Win.Cluster <-table(londonkm$cluster,Win)
Win.Rate.Cluster <- Win.Cluster[,2]/sum(Win.Cluster[,1],Win.Cluster[,2])

symbols(x=c(1,2,3,4), 
        y=Win.Rate.Cluster, 
        circles=Win.Rate.Cluster, 
        inches=1/3, 
        bg="firebrick2", 
        fg=NULL,
        xaxt='n',
        xlab="Cluster",
        ylab="Win Rate (%)",
        main="Win Rate by Cluster"
        )
        
axis(1, 
     at=c(1,2,3,4), 
     las=1,  # Angle of the label
     labels=c(1,2,3,4), # label of x axis
     cex.axis=1 #axis label size
     )

plot of chunk unnamed-chunk-4

## Gender
gender <- table(london$Sex,londonkm$cluster)
par(xpd=TRUE)
barplot(as.matrix(gender),
        xlab="Cluster",
        ylab="Counts",
        col=c("brown1","cyan1"),
        main="Sex Distribution by Gender"
        )
legend( "topleft",  ## position of the legend
        inset=c(0.1,.001), ## allow tweaking of the legend position
        legend=row.names(gender), ## legends
        border="white", ## legend box line
       fill=c("brown1","cyan1"), ## legend colors
       lwd=0, ## line width around legend box
       box.col="white"  , ## legend box
       cex =0.8,  ## size of the legend text and box
       bg=NULL  ## legend box color, NULL indicate transparent
       )

plot of chunk unnamed-chunk-4

Looking at Plots for Gender/Sex and Win Rat, it is clear that clusters are very distinct on these variables even though these were not considered in cluster creation.


 

1 thought on “K Means Clustering- Summary Statistics and Visualization”

Leave a Comment