In this post, **kmeans** function is used for K Means Clustering. Overall approach which is used in K Means Clustering is discussed in the earlier blog.

Using London Athlete Data

setwd("~/training") london <-read.csv("londonT.csv") londonkm <- kmeans(london[4:5], centers= 4)

**k-means output statistics**

**kmeans** function gives below statistics as standard output.

- Cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
- Centers: A matrix of cluster centres.
- Totss: The total sum of squares.
- Withinss: Vector of within-cluster sum of squares, one component per cluster.
- tot.withinss: Total within-cluster sum of squares, i.e., sum(withinss).
- Betweenss: The between-cluster sum of squares, i.e.Â totss-tot.withinss.
- Size: The number of points in each cluster.
- Iter: The number of (outer) iterations.
- Ifault: Integer: indicator of a possible algorithm problem Â– for experts.

#### Clustering Performance

We need to have a mechanism to understand and conclude that the clusters created.

SAS Fastclus procedure provides

```
* Variable Level R Square
* Overall R Square
* Pseudo F
* CCC- Cubic Clustering Criterion (CCC).
```

But **kmeans** does not provide any of the above statistics. We can write custom functions to calculate the above.

We are showing steps to calculate overall R Square and Pseudo F

Overall R Square is ratio between variance to within Variance. Pseudo F calculation.

#Ratio of between Variance to within cluster analysis Overall_R_sq <- londonkm$betweenss/londonkm$totss # Pseudo F F_Pseudo <- ((londonkm$totss-londonkm$betweenss)/(length(londonkm$withinss)-1))/((londonkm$betweenss)/(sum(londonkm$size)-length(londonkm$withinss))) Overall_R_sq

`## [1] 0.728`

`F_Pseudo`

`## [1] 1293`

Higher values of overall R Square and Pseudo F Statistics, better (in terms of differential among clusters) is the clusters created.

### Profiling and Visualization

For clustering height and weight of the athletes are used, first we need to compared the heights and weights across clusters to know what type of athletes each cluster has.

par(mfrow=c(1,2)) ## Plots on clustering variables height.m <- aggregate(london$Height, by=list(londonkm$cluster), mean) names(height.m) <- c("Cluster","Height") ## Bar Plot: Height cp <-barplot(height=height.m$Height , names.arg=height.m$Cluster, xlab="Cluster", ylab="Avg Height", main="Height by Clusters", col=c("darkorchid1","firebrick2","darkseagreen2","goldenrod1"), ylim= c(0,max(height.m$Height)+10 ) , border=NA ) text(cp, (max(height.m$Height)+5), round(height.m$Height, 1),cex=0.7,col="black") weight.m <- aggregate(london$Weight, by=list(londonkm$cluster), mean) names(weight.m ) <- c("Cluster","Weight") ## Bar Plot: Weight cp <-barplot(height=weight.m$Weight , names.arg=weight.m$Cluster, xlab="Cluster", ylab="Avg Weight", main="Weight by Clusters", col=c("darkorchid1","firebrick2","darkseagreen2","goldenrod1"), ylim= c(0,max(weight.m$Weight)+20 ) , border=NA ) text(cp, (max(weight.m$Weight)+10), round(weight.m$Weight, 1),cex=0.7,col="black")

```
## Could have used below steps to directly get cluster centers
londonkm$centers
```

```
## Age Height
## 1 27.07 192.4
## 2 23.53 178.5
## 3 33.52 175.3
## 4 24.13 164.3
```

Now, we can leverage other variables for profiling the athletes in each cluster.

```
* Win Rate
* Gender Distribution
```

```
par(mfrow=c(1,1))
# Win Variable
Win <- london$Gold>0 | london$Silver | london$Bronze
Win.Cluster <-table(londonkm$cluster,Win)
Win.Rate.Cluster <- Win.Cluster[,2]/sum(Win.Cluster[,1],Win.Cluster[,2])
symbols(x=c(1,2,3,4),
y=Win.Rate.Cluster,
circles=Win.Rate.Cluster,
inches=1/3,
bg="firebrick2",
fg=NULL,
xaxt='n',
xlab="Cluster",
ylab="Win Rate (%)",
main="Win Rate by Cluster"
)
axis(1,
at=c(1,2,3,4),
las=1, # Angle of the label
labels=c(1,2,3,4), # label of x axis
cex.axis=1 #axis label size
)
```

```
## Gender
gender <- table(london$Sex,londonkm$cluster)
par(xpd=TRUE)
barplot(as.matrix(gender),
xlab="Cluster",
ylab="Counts",
col=c("brown1","cyan1"),
main="Sex Distribution by Gender"
)
legend( "topleft", ## position of the legend
inset=c(0.1,.001), ## allow tweaking of the legend position
legend=row.names(gender), ## legends
border="white", ## legend box line
fill=c("brown1","cyan1"), ## legend colors
lwd=0, ## line width around legend box
box.col="white" , ## legend box
cex =0.8, ## size of the legend text and box
bg=NULL ## legend box color, NULL indicate transparent
)
```

Looking at Plots for Gender/Sex and Win Rat, it is clear that clusters are very distinct on these variables even though these were not considered in cluster creation.

## 1 thought on “K Means Clustering- Summary Statistics and Visualization”