Analysis of Variance (ANOVA) is used for comparing means across multiple samples. Focus here is only 1-Way ANOVA and there are a few different ways of applying similar concepts to different scenarios.
If number of samples or groups is one or two, we can use T Test (T Test using SAS).
Using one categorical variables, we can assume to have got these groups or samples, and one continuous variable will be used for mean comparison. So, we have two variables - categorical (creating groups) and continuous (analysing impact).
Scenario 1: A bank executed a marketing campaign and these campaign had 5 different treatments and each of these treatments had contacted customers & their spend values. We want to evaluate whether spend across 5 treatment groups are different or note. So, treatment group is categorical variable and continuous variable as Spend.
Scenario 2: An insurance provider was interested to understand impact of policy inception year on claim amount. So, whether average claim amount is different for different policy inception year.
ANOVA: How does it work?
Analysis of Variance (ANOVA) is used for comparing means of different groups but based on concept of "Sources of Variance". It has 3 Variances - Overall Variance, Variance due to Groups, and Variance within Groups.
Denominator of this formula is called Sum of Squares. 3 Sum of Squares - Total Sum of Square, Sum of Square with in Group /Sum of Square Error, and Sum of Square among Groups/ Sum of Squares Treatment
SST = SSB + SSW
Sum of Square Total = Sum of Square of Between/Among Samples+ Sum of Squares within Samples
Sum of Squares within Samples can also be considered as Sum of Squares of Error.
ANOVA: Explained using Calculations
A country is divided into 4 regions - East, West, North and South. We wanted to check whether Marriage age is different for each of these regions. A sample data is below and we want to perform ANOVA analysis to test hypothesis and Null Hypothesis is that all regions have similar average marriage age.
We can calculate overall mean (double X bar: 28.6) and then Sum of Square Total using the overall mean.
Now , we can calculate mean value for each of the samples (for regions) and then Sum of Squares among groups will be as follow
Similarly, we can calculate mean value for each of the 4 samples (East, West, North and South) and then Sum of Squares. This will give us Sum of Squares Error (SSE).
From Sum of Squares , we need to find Mean of Squares. So, we need to find Degree of Freedom. From Mean Squares, we would want to compare Mean Squares due to groups and Means Squares due to Error.
The Ratio between Mean Square due to Groups and Mean Squares due to Error is F statistics. Higher is the F Statistics, lower is evidence in favour of null hypothesis (meaning groups have different mean values). For the F statistics and Degree of Freedoms (one of groups and other for error), we can find P Value.
Now, all calculations put together, we will get a table and that table is called Analysis of Variance /ANOVA table.
Now, we will using SAS to perform all these calculations for us.
ANOVA using SAS
SAS has a procedure called PROC ANOVA which allows us to perform Analysis of Variance. First of all we need to read the data and then use this procedure. This procedure has two statements, CLASS statement to give name of categorical variable in the above case Region. And MODEL statement helps us to give structure of model or analysis. In the above example, marriage age is target variable and region as independent.
data marriage; infile cards dlm="09"x; input region $ age; cards; S 27 S 22 S 22 S 24 N 24 N 27 N 28 N 30 E 33 E 29 E 30 E 27 W 34 W 35 W 37 W 29 ; run; %* ANOVA using SAS; proc anova data=marriage; class region; MODEL age =region; run;
Output of the ANOVA analysis in SAS has few parts but most important from hypothesis testing is ANOVA table. This table is similar to the table we calculated manually.
Now, looking at the P Value, we can make inference that there is no evidence in support of null hypothesis (No difference among mean values of the samples or marriage age for the 4 regions). So, test indicates that average age across 4 region is different.
We can also see Box Plot which we get in SAS output. This gives visualisation of the data.
Some of the key assumptions in ANOVA analysis are
- Independence: Observations are independent of each other.
- Normality: — Values follow normal distribution within each groups (marriage age for each region).
- Homogeneity of Variances/Homoscedasticity: — Variance for the data is same or similar in all the groups/regions.
Learn more with DnI Institute
- Multiple Regression
- Decision Tree
- Logistic Regression
- K Mean Clustering