Tutorial on Random Forest using Python

In the previous blog, we explained  Random Forest algorithm and steps you take in building Random Forest Model using R.

In this blog, we will show high level steps required to build a Machine Learning Model in Python.

Random Forest algorithm is based on Classification and Regression Tree  (CART) decision tree algorithm. But it builds a series of decision trees using different "Feature Set" and "Random Sample".

Some of the CART/Decision Tree parameters are relevant in Random Forest are: min_samples_leaf , min_samples_split and max_depth

But user have to consider additional parameters such as max_features (number of variables/features to be considered for each Decision Tree) and n_estimators (number of Decision Trees to be built).  We will discuss about the importance of these parameters in Random Forest Model.

Now let's get started on building a Random Forest model using Python.

We are using the same data for explaining the steps involved in building a Random Forest Model.  Some of the high-level steps are:

  1. Load Library
  2. Reading data
  3. Preparing/Exploring Data
  4. Split Sample into Test and Train Samples
  5. Fitting Random Forest  Classifier
  6. Score a data sample
  7. Find Accuracy of the Model

Load Python Libraries

We are using Random Forest classifier from SCIKIT LEARN.  We are also reading CSV file, so using Pandas package.

Reading Data

We have a small data sample which we could use for building Random Forest Model, ideally if we have a longer list of Features , probably better utility of Random Forest algorithm. But in this example, list of variables/feature set is long long.

Explore Data

We should explore the data being created for the reason of understanding the data and describe() function does a good job in getting us the summary statistics of  pandas data frame object. In this case, we have created "binary" data frame and want to describe this data frame to get summary statistics of all the variables.

Some actions on cleaning data and preparing for modeling.

Test & Train Data Samples

Now, we have a data sample, which can be considered for the modeling. We are splitting the data into Test Sample (which will be used for validating the model developed) and Train Sample (which will be used for Model development).

We are using train_test_split function for creating test and train samples randomly with size of Test Sample as 30% of the observations. We can view a few rows using head function.

Also, we are splitting, Target Variable and Feature Set into two different arrays.

Fitting Random Forest Classifier

We have arrays for independent variables and the target variable. We can now build Machine Learning Model using Random Forest classifier. We are using RandomForestClassifier from sklearn library.   We are just using default parameters except - max_depth and n_estimators.

Scoring using Random Forest Classifier

We have a random forest classifier, we can use that to score a data sample and validate the accuracy of the random forest model developed. In this case, we are scoring the same sample which we have used for training.  We can and should also test on "Test" data sample we have created above.

We are using "predict" function and here we do not need the target variable array.

Now we can check the accuracy of the model and one of the performance statistics is Confusion Matrix.

Random Forest Model Accuracy/Validate Model

We can validate the model and see if the model is generalized and fit for future prediction if that is the objective.

Variable Importance in Random Forest

If input feature set is big and want to find relative importance of variables, we can use code similar to below to find the importance for each of the input variables.


2 thoughts on “Tutorial on Random Forest using Python

  1. Hi,
    First thank you for posting such a helpful article on Random Forest Classifier.
    Can you please help me to understand this part:
    # Split Target and Feature Set
    # Keep Target and Independent Variable into different array
    Train_IndepentVars = Train.values[:, 3:5]
    Train_TargetVar = Train.values[:,5]

Leave a Comment