A decision tree is a hierarchical or tree like representation of decisions. Decision Tree is a technique to iteratively break input data (or node) into two or more data samples (or nodes). And this recursive partitioning of input data (or node) continue until it meets specified condition(s).
Decision Tree is a method for objective segmentation. Segmentation – A Perspective
The aim of decision tree based recursive partitioning is to improve impurity measure of the output/child node(s). These nodes are called child node and the input node as parent node. If an algorithm breaks parent node into two child nodes at each stage, is called binary decision tree.
Example, banks and financial institutions grant credit facility after evaluating credit risk involved. Credit risk involved in credit decisions is evaluated using Credit Scorecard [Credit Score: What is it and how is it developed?]. Also, there are a few additional decisions involved in credit underwriting [Credit Underwriting: Minimize credit risk losses using Data Science and Analytics].
The last 2 years of customer performance on meeting credit obligations is available with us. We want to understand the variable(s) explains high risk of customers who defaulted on a credit facility given to them.
The sample has 24 customers. And for making it simple, only customer age and gender are considered. Age is a continuous variable and Gender is nominal variable.
Input sample has 12 customers who have defaulted on the credit facility. So, default rate is 50%.
We want to understand if the customers with certain age group has higher chance of defaulting, or one gender has higher default rate than that of the other gender.
Decision Tree is one of the techniques which can help us answer these questions. Decision Tree process has to find the variable and cut off (for numeric and group values for nominal variables) to be considered for the split. The aim of the split will be to improve impurity (default rate) of the child nodes.
Based on exploratory analysis, we can see that Male group has higher default rate of 63% whereas Female group has 25%. Average age of default customers is around 39 years as compare to non-defaulting customers has average age of 47.
Decision Tree can help in find the cut of Age variable and interaction effect between Age and Gender. Also, if there are more number of variables, the efficacy exploratory data analysis in selecting the variables or finding association with target variable could be low.
In this example, Gender variable is selected for partitioning the input data sample. After split, there are two samples (or child nodes) – one for each Gender.
Impurity (or default rate) has increased for one child node to 63%. Now, each of these child nodes are further partitioned to improve the impurity. Since each child node undergoes same process of partitioning as their parent node, the process is called recursive partitioning.
Left Node (Gender=Female) is partitioned based on Age>40 condition and Right Node (Gender=Male) is partitioned using Age >50 condition. In this example, Age is only variable so left and right node are partitioned using same variable but in reality all the input variables considered for each node, and the best variable and split point will be selected for each of the nodes.
Default rate for Male Customers who are aged below 50 is 77% compared to that of the customers who have 50% default rate.
This is very simple example of decision tree building. We need to understand and answer a few more questions related to decision tree.
- What are the impurity measures in decision tree?
- What are the decision tree algorithms?
- When does a decision tree growth stopped?
- How to interpret results of a decision tree?
- How do we build decision tree using SAS or SAS Enterprise Miner?
- How do we build decision tree using R?
- Decision Tree using rpart in R