# Predict and Analyze results of CART Decision Tree

Initial steps to build a Decision Tree is explained and illustrated in our previous blog- Decision Tree using rpart in R.

The steps covered in previous blogs

• Load Libraries - rpart and rpart.plot
• Read input data and split into development and validation samples
• Build Decision Tree
• Plot Decision Tree

Also a few blogs are written on What is Decision Tree? What is Gini Index and CART algorithm?

Objective of this blog is to use some of the other R functions to build decision tree using R and explain on the decision tree output

• Improving and Intrepreting Decision Tree Output
• Scoring a new dataset
• Calculating Accuracy of Decision Tree Classification

Decision Tree Output

Decision Tree output of rpart or CART algorithm is Binary Decision Tree. A Binary Decision Tree has two child nodes for each of the parent nodes. CART decision tree uses Gini Index as impurity measure. You could find a worked out example on Gini Index calculation.

Decision Tree can be plotted using plot and rpart.plot functions. Based on your preferences, you could either of these functions.

```par(mfrow=c(1,2))
dt2 <- rpart(Class~ .,
control = rpart.control( minsplit = 50,
maxdepth = 5),
data=dev)

# Decision Tree Dendogram
plot(dt2,
uniform = T,
compress = T,
margin = 0.1,
branch = 0.1)
# Label on Decision Tree
text(dt2,
use.n = T,
digits = 3,
cex = 0.6)
# this function require rpart.plot package
rpart.plot(dt2)
```

Decision Tree Explanation

Output Decision Tree leaf nodes are tagged as ‘Good’ and ‘Bad’ based on % mix of the target variable ( in this case it is Class variable).

In this example, the Decision Tree was built to improve Classification Rate of the input data. The input data has 600 observations/records. 189 (31.5%) belongs to class ‘Bad’ and 411 (68.5%) are of class ‘Good’. So, each node should improve purity or classification rate of the nodes.

For knowing the classification rate, we can look at the output Decision Tree or by printing the Decision Tree object.

For print Decision Tree object, we can use print function.

```##
##  178  422
## n= 600
##
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
##
##  1) root 600 178 Good (0.2967 0.7033)
##    2) CheckingAccountStatus.none< 0.5 362 152 Good (0.4199 0.5801)
##      4) Duration>=22.5 147  65 Bad (0.5578 0.4422)
##        8) Purpose.UsedCar< 0.5 130  51 Bad (0.6077 0.3923)
##         16) Housing.Rent>=0.5 30   6 Bad (0.8000 0.2000) *
##         17) Housing.Rent< 0.5 100  45 Bad (0.5500 0.4500)
##           34) ResidenceDuration>=1.5 82  33 Bad (0.5976 0.4024) *
##           35) ResidenceDuration< 1.5 18   6 Good (0.3333 0.6667) *
##        9) Purpose.UsedCar>=0.5 17   3 Good (0.1765 0.8235) *
##      5) Duration< 22.5 215  70 Good (0.3256 0.6744)
##       10) Property.Unknown>=0.5 18   5 Bad (0.7222 0.2778) *
##       11) Property.Unknown< 0.5 197  57 Good (0.2893 0.7107)
##         22) Amount< 1290 74  31 Good (0.4189 0.5811)
##           45) Purpose.Radio.Television>=0.5 24   4 Good (0.1667 0.8333) *
##         23) Amount>=1290 123  26 Good (0.2114 0.7886) *
##    3) CheckingAccountStatus.none>=0.5 238  26 Good (0.1092 0.8908) *
##
##             2      3
##   Good 0.5801 0.8908
```

Look at the output and we will get a clear picture on % split of target variable for each node.

Root Node has 31.5% and 68.5% split of “Bad” and “Good” value of Target Variable/Class respectively.

Node 2 and 3 are created based on CheckingAccountStatus.none<0.5

Node 2 has 368 (42.66%) and 157 (57.34%) observations for ‘Bad’ and ‘Good’ Target Values respectively. Improvement in % Bad from 31.5% to 42.66%. Node 3 has 231 (89%) and 32 (11%) observations for ‘Good’ and ‘Bad’ Target Values respectively. Significant improvement in Good % - from 70% to 89%

Next partition is carried out on Node 2 and child node 4 and 5 are created based on Duration>=33

```##
##             4      5
##   Good 0.3692 0.6263
```

Node 4 has 65.22% and 34.78% Bad and Good target values respectively. Again significant improvement - from 42.66% in Node to 65.22%

Node 5 has 37.46% and 62.54% Bad and Good counts. Improvement of Good % from 57.34% in Node 2 to 62.54%.

This process of partition of node to Child Node continue until it meets stop critetia. Criteria are given using control and rpart.control in rpart function.

Now what does it all means, it means that we should select the rule (from root node to a specific leaf node) which define better classification rate. In this example, we want to accept applications which has low default rate ( higher % of Good).

So, Node 3 has 86.21%, Node 11 has 82.46% and Node 21 has 64.89% ‘Good’ applicants. The rules which helps in reaching to these nodes can be used for screening and approving in-coming applications.

Decision Tree Validation and Predicting Class on Validation Sample

The rules identified so far are built on a development sample. We should validate the rules on a validation sample.

predict function can be used for classifying validation sample observations to ‘Good’/‘Bad’ using the rules developed. It will generate probabilities for ‘Good’ and ‘Bad’ and we can assign predicted class.

```p <-predict(dt2,val)
```

Then we can compare the actual and predicted classifications.

```t <-table(val\$Predicted.Class, val\$Class)

Accuracy <- (t[1,1]+t[2,2])/(t[1,1]+t[1,2]+t[2,1]+t[2,2])
```

Overall accuracy of Decision Tree is 74.75%

Other statistics and statistical graphs for assessing accuracy and model performance are

• ROC Curve
• Lift Chart
• Gini
• KS

A detailed blog on performance KPIs for a predictive model.