In the precious blogs, we have explained on selecting Best Split for each of the independent variables. Now we need to select the best variable, again consideration is Gini Index Value.

For each of the independent Variables, we have best split and its Gini Index value. Here is the table.

Variable Spend in the last Month (Last_month_spend) has highest Gini Value. We can select this for splitting the input data. So condition for the split is "**Last_Month_spend <75**", it is applied on the input data and split into two child nodes.

Left node has 6% observation and Right Node has 94%. % of Target Variable =1 in the parent node was 26.3% and increased to 74.1% on the left node. It means if Last month spend has been lower , it gives good indication of customer reducing spend in the next 3 months.

Once, we have split the input data into two sample - left node and right node, we need to perform similar steps again on each of these samples to grow decision tree further. The process of splitting each of the node to child nodes continue till meets stopping criteria.

Now, we would want to build a decision tree using rpart package in R and validate out steps.

# read data dt_data <- read.csv("dt_data.csv") # Build a Decision Tree library(rpart) table(dt_data$Spend_Drop_over50pct)/nrow(dt_data) names(dt_data) dt1 <- rpart(Spend_Drop_over50pct~ Gender+Education_level+Last_Month_spend+ Last_3m_avg_spend, data=dt_data) library(rpart.plot) rpart.plot(dt1)

Here is the Decision Tree build on the sample dataset but using rpart (which leverages CART algorithm).

If you see the first split condition, it is same as the condition arrived using manual steps.

Should we choose the highest gini or the lowest? Which one is correct?

Should we choose the highest gini or the lowest? Which one is correct?

Shouldn't we choose the lowest gini value, instead the highest one?