Random Forest using R - Step by Step on a Sample Data

Some of the interested candidates have asked us to show steps on building Random Forest  for a sample data and score another sample using the Random Forest Model built. Here are the steps.. # ---------------------- Random Forest using R -------------------------# # Author : Ram # #-----------------------------------------------------------------------# # Read a dataset which a target variable (binary: ... Read moreRandom Forest using R - Step by Step on a Sample Data

Python Learning - Finding Answers

In this blog, we are sharing some of the scenario arose while working on a project and we are providing the steps to get the work done. I am sure, there would be multiple ways to achieve the outcome but I am sharing my solutions.   Q1:  How do we extract id value from html ... Read morePython Learning - Finding Answers

Python - Data Manipulation Scenarios and Questions

In this blog, we have listed a few data manipulation scenarios or examples from data science projects. These examples can propel your Python learning for Data Science. Data Manipulation is one of the significant activity of any Data Science or Predictive Modeling project.   If you have any scenarios or examples, do share with us ... Read morePython - Data Manipulation Scenarios and Questions

Visualisation using R - Commonly used functions

We will discussing some of the commonly used Base R Graphic functions.  Some of the commonly used functions are plot: Plotting Line Chart and Scatter Plot boxplot: Box Whiskers Plot for a continuous variable  or distributions by different groups hist: Histogram Scatter  Plot We will create a sample data points and then use for the scatter ... Read moreVisualisation using R - Commonly used functions

K Means Clustering Examples and Practical Applications

Pricing Segmentation: E-retailers or e-commerce companies have taken the retail industry by storm. They are offering luring offers and discounts. They aim to move from discount led to convenience or differentiation led offering over a period in time. Some of them have been forced to start the journey. One of the large retailer wanted to ... Read moreK Means Clustering Examples and Practical Applications

Multiple Regression Assumption- Multi-collinearity and Auto-correlation

In the previous blog, we discussed "Linearity" assumption in multiple regression, now we are discussing on Multicollinearity and Auto-correlation. What is multicollinearity? Collinearity is relationship between two variables and it can be between a dependent variable and an independent variable. And one of the way to measure is using Pearson Correlation Coefficient. Correlation Analysis Overview. Multi-collinearity ... Read moreMultiple Regression Assumption- Multi-collinearity and Auto-correlation

Wealth Management and Analytics

Due to increased competition, customer expectations and regulatory requirements, increased focus on data driven analytics for Wealth Management firms. Some of the key topics and uses are listed below. Customer & Marketing Analytics Cross-sell and up-sell analytics Product Sequence Analysis to know how customer take up product and linkage to their life stage Fund Outflow ... Read moreWealth Management and Analytics

Confidence Interval and Random Forest

library(devtools) library(dplyr) library(randomForest) library(ggplot2) # Fetch data from the UCI MAchine Learning Repository url <-"https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data" mpg <- read.table(url,stringsAsFactors = FALSE,na.strings="?") names(mpg) <- c("mpg","cyl","disp","hp","weight","accel","year","origin","name") head(mpg) dim(mpg); summary(mpg) sapply(mpg,class) mpg <- mutate(mpg, hp = as.numeric(hp), year = as.factor(year), origin = as.factor(origin)) head(mpg) # Function to divide data into training, and test sets index <- function(data=data,pctTrain=0.7) { N ... Read moreConfidence Interval and Random Forest

Credit Risk Analyst: Job Roles and Interview Questions

Role of a credit risk analyst (Analytics) involves quite a few interesting and sophisticated analytics themes and some of them are listed below. Building credit risk scorecard for screening new applicants (application scorecard) Developing behavioral scorecard for measuring risk level of existing customers Building PD, LGD and EAD Models Model validation and documentations Analyzing credit ... Read moreCredit Risk Analyst: Job Roles and Interview Questions

CHAID - How does it work?

Decision Tree Algorithm - CHAID - is explained with an example and you can access the details here CHAID In the brief blog, we are sharing R code and steps to get CHAID based decision tree for a dataset. CHAID using R # Decision Tree: CHAID install.packages("CHAID", repos="http://R-Forge.R-project.org") library(CHAID) library(help=CHAID) names(termCrosssell) table(termCrosssell$housing) dt.chaid <- chaid(y~ ... Read moreCHAID - How does it work?

Test of Association for categorical variables

Test of Association for categorical variables

Scale of Measure plays an important role in selecting the right statistical techniques or test for an analysis – “When to use what Statistical Technique”.
In a previous blog, we have discussed on when to use T-test and using R for T-test.
A T-test is often used when you want to compare whether two groups of data are significantly different from each other. We do this by comparing means of the two different groups. For example, whether patients who received medication have higher T-cell counts compared to patients who didn't or whether students who attended special classes scored more that students who didn't. In all such cases we work with continuous data like height, weight, salary etc. But what if we are dealing with categorical variables? Suppose we want to test if females are more likely to respond to a particular marketing campaign compared to males or in other words whether there is any association between gender and response variable. Since, both Gender and Response Variables are categorical, we have to use Chi square test which tests association between two categorical variables.

As in the example below, 45% females respond to the campaign while in males, only 30% are responders. The result could imply that there is some association between gender and response but is this association random or statistically significant? To ascertain this we will use Chi-square test.



Chi Square Test and Statistics

Let’s keep it short as there is enough content available already on how to calculate chi-square statistic and most of the statistical tools such as R and SAS gives you Chi-square statistics directly. We will rather focus on how to interpret the results.
Chi-square measures the difference between the observed frequencies and the expected frequencies which are calculated when there is no association between the variables, in other words, frequencies that are expected when the null hypothesis is true (hypothesis of no association). If the observed frequency equals expected frequency, there is no association between variables. Below is the formula for Chi-Square statistic. Higher the chi-square value, smaller the p-value and hence higher chance of rejecting the null hypothesis.

∑((Observed freq-Expected Freq)^2/(Expected Freq))

Chi Square Statistics v1

Here is an interesting question - Does higher chi-square value indicate stronger association between values? The answer is No. Chi –square does not test for strength of association between variables. Later in this article we will see how to measure strength of association.

Within categorical variable, some variables are called ordinal variables. An ordinal variable is a variable which takes only a few distinct values but the level of variable has order within the levels or the levels of a variable can be ordered in some meaningful way, like response to a customer survey – extremely satisfied, somewhat satisfied, not satisfied at all.

Now, when we want to find association between ordinal variables, a Mantel – Haenszel Chi –square test is a more powerful test for testing ordinal association. What we discussed eearlier is called Pearson Chi Square Test. Please note that Mantel – Haenszel Chi –square test can be used only if both variables are ordinal. Interpretation of this statistic is similar to Pearson’s Chi-square that is higher the value, smaller the p-value and hence higher chance of rejecting null hypothesis.

Measuring strength of Association - Cramer’s V statistic and Spearman correlation

This brings us to our last topic of today’s discussion. Cramer’s V statistic is used to measure the strength of association between categorical variables. Values closer to 1 show strong association while values closer to 0 shows weak or no association. Another important aspect of Cramer’s V statistic is that it is not impacted by sample size as compared to Chi-square statistic which yields higher value for bigger sample size.
For ordinal variables, a Spearman correlation statistic can be used to test the strength of association. Similar to Cramer’s V statistic, values closer to 1/-1 indicate strong positive or negative association respectively while values closer to 0 indicate weaker association. Values are not impacted by sample size as in the case of Cramer’s V.

All above mentioned tests can be done in SAS using proc freq procedure with options ‘chisq’ and ‘measure’. When ‘chisq’ option is provided in a cross tabulation of two variables, SAS provides Pearson’s Chi-square statistic, Cramer’s V statistic and also Mantel Haenszel statistic. For Spearman correlation, ‘measure’ option is used.

Hope you this article was helpful. Please share your comments or any questions you might have.

Chi-Square Statistics Calculation

Chi Square using R

Loss Forecasting Model: Steps

In the previous blog, we have described - what is loss forecasting? And Also illustrated a few terms involved in loss forecasting and model building. Let’s build a loss forecasting model for credit cards portfolio: Step 1 – Data preparation – Start with recent account originations and divide them into different cohorts based on their ... Read moreLoss Forecasting Model: Steps

Loss Forecasting Model: Overview and Definitions

What is Loss Forecasting and why is it important? Banks and other financial institutions need to set aside loan loss provisions as an allowance for loans that may turn bad or default. Loss forecasting – as the term suggests, means predicting future losses that helps banks to decide how much reserve is required to cover ... Read moreLoss Forecasting Model: Overview and Definitions

Kappa statistic: What is it and how it calculated?

Kappa statistics is useful when the measured variables are categorical. And we want to find an agreement between two parties or dimensions. Scenarios: For example a consulting team members are working with two key stakeholders from the client stakeholders. Feedback for the team members are collected from the both stakeholders on a rating scale of ... Read moreKappa statistic: What is it and how it calculated?

How can I become a data analyst or Data Scientist?

I had answered the similar question on Quora Data Analytics, Business Analytics and Data Science- with some differences require following skills Tools and Technology: Commonly used analytics tools are SAS, R, Python etc.. Visualization tools are Tableau, QlikView etc.. As a first step, you could focus on one of the analytics tools Statistical & Machine ... Read moreHow can I become a data analyst or Data Scientist?

Training: Retail Analytics

One of the top retail analytics course in India. It is based on practical case studies and hands on workshops conducted by experienced trainers.   Contact us on info@dni-institute.in . We can amend or add more case studies based on specific requirements.