CART Algorithm: Best Split for a Categorical Variable

Similar to continuous variables, Decision Tree Algorithm - CART has to find the best split for categorical variable as well.

Only difference will be to find possible cut off values. For example, we have a variable - education- it had 4 levels -"University","Graduate","High School" and "Others".

We consider all possible two way splits for the cut off points.  And here are the examples..

1 Level (Left Node) and 3 Levels (Right Node)

{"University"}  and  {"Graduate","High School","Others"}

{"Graduate"}  and  {"University","High School","Others"}

{"High School"}  and  {"Graduate","University ","Others"}

{"Others "}  and  {"Graduate","High School","University"}

2 Levels (Left Node) and 2 Levels (Right Node)

{"University","Graduate"} and {"High School","Others"}
{"University","High School"} and {"Graduate","Others"}
{"University","Others"} and {"Graduate","High School"}

Gini Index for each of these  splits is calculated and compared to select the best best for the categorical variables.

Left Mode Split Value Gini Index for Split
{"University"} 0.00120231
{"Graduate"} 0.000717221
{"High School"} 3.54328E-05
{"Others"} 0.00039842
{"University","Graduate"} 0.000125327
{"University","High School"} 0.000949849
{"University","Others"} 0.00093108

When a variable has very high levels , it becomes computationally complex so one of the implementation has a limit on number of level a variable can have.

PreviousDecision Tree - CART Algorithm - Best Split for Numeric Variables

 Next: Decision Tree - CART Algorithm - Selecting Best Variable

 

Leave a Comment