Step by Step Tutorial on Decision Tree using Python

In this blog, the aim is to show you steps of building a Decision Tree using Python Jupiter Notebook. If you are interested to learn Decision Tree algorithm, we have an excellent tutorial on "Decision Tree Algorithm - CART".

We are using the same data for explaining the steps involved in building a decision tree.  Some of the high-level decision tree steps are:

  1. Reading data
  2. Preparing Data for Decision Tree
  3. Split Sample into Test and Train Samples
  4. Fitting Decision Tree Classifier
  5. Score a new sample data
  6. Visualizing the Decision Tree

Reading Data

We have a sample data file which have a few independent variables and a binary target variable. We can read the CSV file.

We are using pandas library for reading file and creating a data frame.

# Read data from 
import pandas as pd
binary= pd.read_csv('http://dni-institute.in/blogs/wp-content/uploads/2017/07/dt_data.csv')

Explore Data

We should explore the data being created for the reason of understanding the data and describe() function does a good job in getting us the summary statistics of  pandas data frame object. In this case, we have created "binary" data frame and want to describe this data frame to get summary statistics of all the variables.

binary.describe()

Of course, the result is not perfect. Variable or column - "Unnamed:3" is not relevant and should be removed. Similarly, we want to print a few rows to know the data better. Also, important to check Target Variable value distribution.

# Columns
binary.dtypes.index
# Drop a column
binary.drop('Unnamed: 3', axis=1, inplace=True)
# Target Variable to be made {-1, 1}
binary.Spend_Drop_over50pct.replace([0, 1], ['A', 'B'], inplace=True)

# Print a few rows
binary.head()

# Count Target Variable Values
binary.Spend_Drop_over50pct.value_counts()
# Find % Values of Target Variable Levels
round(binary.Spend_Drop_over50pct.value_counts()*100/len(binary.axes[0]),2)

So, the target variables has 74% to 27% split of Target Variable [0,1] or ['A','B']

A    73.68
B    26.32
Name: Spend_Drop_over50pct, dtype: float64

Test and Train Sample

Now, we have a data sample, which can be considered for the modeling. We are splitting the data into Test Sample (which will be used for validating the model developed) and Train Sample (which will be used for Model development).

We are using train_test_split function for creating test and train samples randomly with size of Test Sample as 30% of the observations. We can view a few rows using head function.

# Split sample into Train and Test
from sklearn.cross_validation import train_test_split
Train,Test = train_test_split(binary, test_size = 0.3, random_state = 176)
# Print a few rows
Train.head()

We can extract Independent Variables and Target variables into two arrays.

# Keep Target and Independent Variable into different array
Train_IndepentVars = Train.values[:, 3:5]
Train_TargetVar = Train.values[:,5]

Decision Tree

We have arrays for independent variables and the target variable. We can now build decision tree classifier. We are using DecisionTreeClassifier from sklearn library.  "gini" option helps in leveraging CART - Classification and Regression Tree - algorithm for fitting the decision tree.

# Load library
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
# Building Decision Tree - CART Algorithm (gini criteria)
dt_train_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,
                               max_depth=5, min_samples_leaf=5)
# Train
dt_train_gini.fit(Train_IndepentVars, Train_TargetVar)

We can change decision tree parameters to control the decision tree size.

Decision Tree Visualization

We would want to see the decision tree plot. There are a few options to get the decision tree plot in Python. One of the probably easy option is to using graphviz.

First we can create a text file which stores all relevant information and then open a web link to get the decision tree plot.

with open("dt_train_gini.txt", "w") as f:
    f = tree.export_graphviz(dt_train_gini, out_file=f)

This step create a text file dt_train_gini.txt in the default folder. We can give full path as well. We can copy the context of this text file and put into the box on http://www.webgraphviz.com/

When we click on "Generate Graph", it shows the Decision Tree Plot.

 

 

 

1 thought on “Step by Step Tutorial on Decision Tree using Python”

  1. Excellent guide!!
    i just followed it step by step and my decision tree got built!!

    Keep doing such great help to all of us 🙂

Leave a Comment