In the previous blog, we explained **Random Forest algorithm** and steps you take in building Random Forest Model using R.

In this blog, we will show high level steps required to build a Machine Learning Model in Python.

Random Forest algorithm is based on Classification and Regression Tree (CART) decision tree algorithm. But it builds a series of decision trees using different "Feature Set" and "Random Sample".

Some of the CART/Decision Tree parameters are relevant in Random Forest are: min_samples_leaf , min_samples_split and max_depth

But user have to consider additional parameters such as max_features (number of variables/features to be considered for each Decision Tree) and n_estimators (number of Decision Trees to be built). We will discuss about the importance of these parameters in Random Forest Model.

Now let's get started on building a Random Forest model using Python.

We are using the same data for explaining the steps involved in building a Random Forest Model. Some of the high-level steps are:

- Load Library
- Reading data
- Preparing/Exploring Data
- Split Sample into Test and Train Samples
- Fitting Random Forest Classifier
- Score a data sample
- Find Accuracy of the Model

## Load Python Libraries

We are using Random Forest classifier from **SCIKIT LEARN**. We are also reading CSV file, so using **Pandas** package.

# Load Libraries from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import roc_auc_score import pandas as pd

## Reading Data

We have a small data sample which we could use for building Random Forest Model, ideally if we have a longer list of Features , probably better utility of Random Forest algorithm. But in this example, list of variables/feature set is long long.

# Read Data binary= pd.read_csv('http://dni-institute.in/blogs/wp-content/uploads/2017/07/dt_data.csv')

## Explore Data

We should explore the data being created for the reason of understanding the data and describe() function does a good job in getting us the summary statistics of pandas data frame object. In this case, we have created "binary" data frame and want to describe this data frame to get summary statistics of all the variables.

# Explore Data binary.describe()

Some actions on cleaning data and preparing for modeling.

# Data Manipulations # Columns binary.dtypes.index # Drop a column binary.drop('Unnamed: 3', axis=1, inplace=True) # Target Variable to be made {-1, 1} binary.Spend_Drop_over50pct.replace([0, 1], ['A', 'B'], inplace=True) # Print a few rows binary.head() # Count Target Variable Values binary.Spend_Drop_over50pct.value_counts() # Find % Values of Target Variable Levels round(binary.Spend_Drop_over50pct.value_counts()*100/len(binary.axes[0]),2)

## Test & Train Data Samples

Now, we have a data sample, which can be considered for the modeling. We are splitting the data into **Test Sample** (which will be used for validating the model developed) and **Train Sample** (which will be used for Model development).

We are using **train_test_split** function for creating test and train samples randomly with size of Test Sample as 30% of the observations. We can view a few rows using head function.

# Split sample into Train and Test from sklearn.cross_validation import train_test_split Train,Test = train_test_split(binary, test_size = 0.3, random_state = 176) # Print a few rows Train.head()

Also, we are splitting, Target Variable and Feature Set into two different arrays.

# Split Target and Feature Set # Keep Target and Independent Variable into different array Train_IndepentVars = Train.values[:, 3:5] Train_TargetVar = Train.values[:,5]

## Fitting Random Forest Classifier

We have arrays for independent variables and the target variable. We can now build Machine Learning Model using Random Forest classifier. We are using **RandomForestClassifier** from sklearn library. We are just using default parameters except - max_depth and n_estimators.

# Random Forest Model rf_model = RandomForestClassifier(max_depth=10,n_estimators=10) rf_model.fit(Train_IndepentVars,Train_TargetVar)

## Scoring using Random Forest Classifier

We have a random forest classifier, we can use that to score a data sample and validate the accuracy of the random forest model developed. In this case, we are scoring the same sample which we have used for training. We can and should also test on "Test" data sample we have created above.

We are using "predict" function and here we do not need the target variable array.

# Scoring based on the train RF Model predictions = rf_model.predict(Train_IndepentVars)

Now we can check the accuracy of the model and one of the performance statistics is Confusion Matrix.

## Random Forest Model Accuracy/Validate Model

We can validate the model and see if the model is generalized and fit for future prediction if that is the objective.

from sklearn.metrics import confusion_matrix # Confusion Matrix print(" Confusion matrix ", confusion_matrix(Train_TargetVar, predictions))

## Variable Importance in Random Forest

If input feature set is big and want to find relative importance of variables, we can use code similar to below to find the importance for each of the input variables.

importance = rf_model.feature_importances_ importance = pd.DataFrame(importance, index=Train.columns[3:5], columns=["Importance"])

Hi,

First thank you for posting such a helpful article on Random Forest Classifier.

Can you please help me to understand this part:

# Split Target and Feature Set

# Keep Target and Independent Variable into different array

Train_IndepentVars = Train.values[:, 3:5]

Train_TargetVar = Train.values[:,5]

Thank you Tanveer.. This part is just splitting independent variables /predictors and target variable. This helps to give inputs to the classifier.

pandas error in this line:

round(binary.Spend_Drop_over50pct.value_counts()*100/len(binary.axes[0]),2)

print((binary.Spend_Drop_over50pct.value_counts()*100/len(binary.axes[0])).round(2))

I have trouble in running and understanding the script this:

# Data Manipulations

# Columns

binary.dtypes.index

# Drop a column

binary.drop('Unnamed: 3', axis=1, inplace=True)

# Target Variable to be made {-1, 1}

binary.Spend_Drop_over50pct.replace([0, 1], ['A', 'B'], inplace=True)

# Print a few rows

binary.head()

# Count Target Variable Values

binary.Spend_Drop_over50pct.value_counts()

# Find % Values of Target Variable Levels

round(binary.Spend_Drop_over50pct.value_counts()*100/len(binary.axes[0]),2)

and it has error like this:

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

in ()

14 binary.Spend_Drop_over50pct.value_counts()

15 # Find % Values of Target Variable Levels

---> 16 round(binary.Spend_Drop_over50pct.value_counts()*100/len(binary.axes[0]),2)

C:\anaconda2\envs\py27\lib\site-packages\pandas\core\series.py in wrapper(self)

116 return converter(self.iloc[0])

117 raise TypeError("cannot convert the series to "

--> 118 "{0}".format(str(converter)))

119

120 return wrapper

TypeError: cannot convert the series to

why?any solution

Import the numpy module and then run below code this will work.

percentage=np.round(binary.Spend_Drop_over50pct.value_counts()*100/len(binary.axes[0]),2)

percentage