Classification and Regression Tree (CART) is one of commonly used Decision Tree algorithms. In this post, we will explained the steps of CART algorithm using an example data.
Decision Tree is a recursive partitioning approach and CART split each of the input node into two child nodes, so CART decision tree is Binary Decision Tree. At each level of decision tree, the algorithm identify a condition - which variable and level to be used for splitting input node (data sample) into two child nodes.
CART Algorithm Steps
Decision Tree building algorithm involves a few simple steps and these are:
- Take Labelled Input data - with a Target Variable and a list of Independent Variables
- Best Split: Find Best Split for each of the independent variables
- Best Variable: Select the Best Variable for the split
- Split the input data into Left and Right Nodes
- Continue step 2-4 on each of the nodes until meet stopping criteria
- Decision Tree Pruning : Steps to prune Decision Tree built
In the previous blog, we have explained the Gini Index calculation.
We have considered a sample data for explaining the Decision Tree building process using CART algorithm. The data has a few variables and below is description of these variables.
Parent Node has below split of Target Variable.
Best Split for a Variable
We are considering Last Month Spend, Gender, Education and Last 3 month average spend for the decision tree building. Let's first consider variable "Last Month Spend" for finding out the "Best Split Cut Off" for this variables.
List of potential cut values have to be identified first. One of the common approach is to find splits /cut off point is to take middle values. For this variable, the distinct values and then finding middle points.
Below is an example of find unique values and then taking average value of two adjacent distinct values to find cut off point.
For each of these "Cut off Point" , we need to calculate Gini Index and Gini Index for a Split. The Gini Index Calculations are explained with worked out example - here.
Now we need to find cut off which has the highest Gini Gain(highest drop in Gini Index for a split or highest GiniGain) and pick up the split point as "Best Split Point" for the variable "Last Month Spend".
Similarly, we need to find the best split for each of the input independent variables.
On the next page, we will explain steps to find best split for a character /categorical variable.