Decision Tree is one of the commonly used exploratory data analysis and objective segmentation techniques.
Great advantage with Decision Tree is that the its output is relatively easy to understand or intrepret. Introduction to Decision Tree and intrepet Decision Tree results
Simple way to understand decision tree is that it is hierarchical approach to partition the input data. And at each node one variable and its value is used for the partition.
If we are working on an objective segmentation problem, our aim is to find conditions which help us find a segment which is very similar on target variable value.
For example, when customer applies for a credit card, the bank or credit card provider accepts or rejects the application based on predicted risk -probability of default- for the application.
For building rules for predicting the risk from a credit card application, we can use Decision Tree. The decision tree can help in finding out the segments which have a low risk (default probability).
Considering this is an objective segmentation, we need to have a target/dependent variable. In this case it will be whether a customer has defaulted (Bad) and current with payment (Good) on the credit card. When to use objective or subjective segmentation
So, we have an input data which has both good and bad customers. We want to find out the rules or conditions which separate Good Customers from bad customers. Because, when the bank has to approve or reject a application, it will have only attributes of applicant and need to estimate/predict the risk of default based on attributes values.
The input dataset has % of good customers. The rule(s) should help you find out the segments with significantly higher percentage of good customers.
We will build a decision tree and find out rules and their accuracy in the below example.
Objective of the blog to give an overview to below points
- Packages for Decision Tree in r
- Decision Tree Classification in R
- CART and CHAID Decision in r
- How to build Decision Tree in R?
- Plot Decision Tree in R?
Some of the other useful information on Decision Tree
CART Decision Tree Algorithm: Gini Index and CART Algorithm Explained with a worked out example
We are using German Credit data for this example
# Distribution of Good/Bad Customers table(GermanCredit$Class) ## ## Bad Good ## 300 700
In the sample of 1000 rows/observations, 300 are ‘Bad’ and 700 are ‘Good’ customers. So, aim of the Decision Tree is to find rules which can improve the classification rate.
rpart for Decision Tree
We are using rpart package which implement CART type of Decision Tree algorithm for Recursive Partitioning and Regression Trees. We can get more information about the package using library help.
Some of the useful functions are
- rpart : Main function used for Recursive Partitioning and Regression Trees building
- plot: It is generic function and can be used for plotting an Rpart object created by rpart function
- summary: Summary of the rpart decision tree
- predict: Once decision tree is build, this function helps in predicting values using the Fitted Rpart Object
Split the sample
We may want to split the input data into development and validation samples. The validation sample can help in validating the decision tree build on development sample and comparing accuracy of the decision tree rules.
Build Decision Tree using rpart
Considering, the dependent variable takes two values, it is an example of classification.
There a list of different approaches for building Decision Tree. But typically, in a decision tree, at each node a variable is selected for the split/partition and the best split of the variable. This process is repeated at each node unless split criteria are met at each of the node.
Actual process of variable selection and finding out the variable and its value depends on the algorithm used on decision tree.
rpart uses Classification and Regression Tree (CART) algorithm. GINI impurity measure is used for selecting variable and best split value.
CART based decision tree is Binary Decision Tree. At each node, the input data is partition into two child nodes.
dt1 <- rpart(Class~ ., data=dev) summary(dt1)
In the above code rpart function is used to build a decision tree on data frame dev. Target variable is Class and all the other variables from the data frame are used as independent variables.
Size of the decision tree can be controlled using control option - rpart.control- in rpart algorithm. Minimum number of observations for a node to be considered for a split is given using minsplit.
minbucket is used for giving minimum observations for a child node to consider a rule for node split.
There are a few other options to control Decision Tree Building process. control = rpart.control(cp = , minsplit =))
dt2 <- rpart(Class~ ., control = rpart.control( minsplit = 50, maxdepth = 5), data=dev)
Plotting Decision Tree Output
We can use plot and text for plotting and labelling the Decision Tree output.
# Decision Tree Dendogram plot(dt2, uniform = T, compress = T, margin = 0.2, branch = 0.3) # Label on Decision Tree text(dt2, use.n = T, digits = 3, cex = 0.6)
If we want to see the labels of a Decision Tree object created by rpart, we can use labels.
labels(dt2) ##  "root" "CheckingAccountStatus.none< 0.5" ##  "Duration>=15.5" "SavingsAccountBonds.lt.100>=0.5" ##  "X>=444" "X< 444" ##  "CreditHistory.PaidDuly< 0.5" "CreditHistory.PaidDuly>=0.5" ##  "SavingsAccountBonds.lt.100< 0.5" "Duration< 15.5" ##  "CheckingAccountStatus.none>=0.5"
In the next blog, how to predict target value for another dataset. Read next blog on CART Decision Tree.
DnI Institute offers wide range of Data Science and Advanced Analytics trainings which are based on industry examples and case studies.