Step by Step Tutorial on Decision Tree using Python

In this blog, the aim is to show you steps of building a Decision Tree using Python Jupiter Notebook. If you are interested to learn Decision Tree algorithm, we have an excellent tutorial on "Decision Tree Algorithm - CART".

We are using the same data for explaining the steps involved in building a decision tree.  Some of the high-level decision tree steps are:

  1. Reading data
  2. Preparing Data for Decision Tree
  3. Split Sample into Test and Train Samples
  4. Fitting Decision Tree Classifier
  5. Score a new sample data
  6. Visualizing the Decision Tree

Reading Data

We have a sample data file which have a few independent variables and a binary target variable. We can read the CSV file.

We are using pandas library for reading file and creating a data frame.

Explore Data

We should explore the data being created for the reason of understanding the data and describe() function does a good job in getting us the summary statistics of  pandas data frame object. In this case, we have created "binary" data frame and want to describe this data frame to get summary statistics of all the variables.

Of course, the result is not perfect. Variable or column - "Unnamed:3" is not relevant and should be removed. Similarly, we want to print a few rows to know the data better. Also, important to check Target Variable value distribution.

So, the target variables has 74% to 27% split of Target Variable [0,1] or ['A','B']

Test and Train Sample

Now, we have a data sample, which can be considered for the modeling. We are splitting the data into Test Sample (which will be used for validating the model developed) and Train Sample (which will be used for Model development).

We are using train_test_split function for creating test and train samples randomly with size of Test Sample as 30% of the observations. We can view a few rows using head function.

We can extract Independent Variables and Target variables into two arrays.

Decision Tree

We have arrays for independent variables and the target variable. We can now build decision tree classifier. We are using DecisionTreeClassifier from sklearn library.  "gini" option helps in leveraging CART - Classification and Regression Tree - algorithm for fitting the decision tree.

We can change decision tree parameters to control the decision tree size.

Decision Tree Visualization

We would want to see the decision tree plot. There are a few options to get the decision tree plot in Python. One of the probably easy option is to using graphviz.

First we can create a text file which stores all relevant information and then open a web link to get the decision tree plot.

This step create a text file dt_train_gini.txt in the default folder. We can give full path as well. We can copy the context of this text file and put into the box on http://www.webgraphviz.com/

When we click on "Generate Graph", it shows the Decision Tree Plot.

 

 

 

Leave a Comment