Test of Association for categorical variables

Scale of Measure plays an important role in selecting the right statistical techniques or test for an analysis – “When to use what Statistical Technique”.

In a previous blog, we have discussed on when to use T-test and using R for T-test.

A T-test is often used when you want to compare whether two groups of data are significantly different from each other. We do this by comparing means of the two different groups. For example, whether patients who received medication have higher T-cell counts compared to patients who didn't or whether students who attended special classes scored more that students who didn't. In all such cases we work with continuous data like height, weight, salary etc. But what if we are dealing with categorical variables? Suppose we want to test if females are more likely to respond to a particular marketing campaign compared to males or in other words whether there is any association between gender and response variable. Since, both Gender and Response Variables are categorical, we have to use Chi square test which tests association between two categorical variables.

As in the example below, 45% females respond to the campaign while in males, only 30% are responders. The result could imply that there is some association between gender and response but is this association random or statistically significant? To ascertain this we will use Chi-square test.

**Chi Square Test and Statistics**

Let’s keep it short as there is enough content available already on how to calculate chi-square statistic and most of the statistical tools such as R and SAS gives you Chi-square statistics directly. We will rather focus on how to interpret the results.

Chi-square measures the difference between the observed frequencies and the expected frequencies which are calculated when there is no association between the variables, in other words, frequencies that are expected when the null hypothesis is true (hypothesis of no association). If the observed frequency equals expected frequency, there is no association between variables. Below is the formula for Chi-Square statistic. Higher the chi-square value, smaller the p-value and hence higher chance of rejecting the null hypothesis.

**∑((Observed freq-Expected Freq)^2/(Expected Freq))**

Here is an interesting question - Does higher chi-square value indicate stronger association between values? The answer is No. Chi –square does not test for strength of association between variables. Later in this article we will see how to measure strength of association.

Within categorical variable, some variables are called ordinal variables. An ordinal variable is a variable which takes only a few distinct values but the level of variable has order within the levels or the levels of a variable can be ordered in some meaningful way, like response to a customer survey – extremely satisfied, somewhat satisfied, not satisfied at all.

Now, when we want to find association between ordinal variables, a Mantel – Haenszel Chi –square test is a more powerful test for testing ordinal association. What we discussed eearlier is called Pearson Chi Square Test. Please note that Mantel – Haenszel Chi –square test can be used only if both variables are ordinal. Interpretation of this statistic is similar to Pearson’s Chi-square that is higher the value, smaller the p-value and hence higher chance of rejecting null hypothesis.

**Measuring strength of Association - Cramer’s V statistic and Spearman correlation**

This brings us to our last topic of today’s discussion. Cramer’s V statistic is used to measure the strength of association between categorical variables. Values closer to 1 show strong association while values closer to 0 shows weak or no association. Another important aspect of Cramer’s V statistic is that it is not impacted by sample size as compared to Chi-square statistic which yields higher value for bigger sample size.

For ordinal variables, a Spearman correlation statistic can be used to test the strength of association. Similar to Cramer’s V statistic, values closer to 1/-1 indicate strong positive or negative association respectively while values closer to 0 indicate weaker association. Values are not impacted by sample size as in the case of Cramer’s V.

All above mentioned tests can be done in SAS using proc freq procedure with options ‘chisq’ and ‘measure’. When ‘chisq’ option is provided in a cross tabulation of two variables, SAS provides Pearson’s Chi-square statistic, Cramer’s V statistic and also Mantel Haenszel statistic. For Spearman correlation, ‘measure’ option is used.

Hope you this article was helpful. Please share your comments or any questions you might have.