Variables from Analysis perspective are categorical and continuous (details on Variables Types). For summarising categorical variables, counts and proportions are used. SAS has PROC FREQ procedures to summarise categorical variables. FREQ - read as frequency of variable values.
In this blog, we will explore some of the commonly used options and statements of PROC FREQ.
PROC FREQ can be used for analysis and validation when analysis variable(s) are categorical .
And it helps in
- Displaying count of variable values or distribution of the variable
- Finding missing values or % missing values of a categorical variable
- Creating a cross table or contingency table for two variables
- Can be used for multi-dimension tables analysis as well
- Finding association between variables using Chi-Square test
- Calculating overall %, row %, column % and cumulative % along with counts
- Computing Exact test statistics
SAS PROC FREQ helps in getting one way (for a single categorical variable) frequency table. Now consider a scenario and will discuss on requirement of PROC FREQ.
Context: We have information about the atheletes participated in London Olympic event. Some of the variables available are Name, Gender, State, Event Participated and which medal has been won.
Scenario 1: You may want to know the count of athletes from each country. This will help you understand the most & least represented countries.
Scenario 2: You may have a question, how Gender distribution are different for each country? Is country A has higher % of “Female” athletes compared to other countries?
Scenario 3: Next question, is there an association between Country and Gender? Means, can we say that Country variable influence whether more/less females are becoming athletes?
Find Count and % of Variable Values
In PROC FREQ, TABLE statement helps in giving variable(s) for which level frequency has to be calculated. In the scenario, we want to calculate count of Male and Female based on Sex variable in the dataset London.
proc freq data=London; table sex; run;
By default, PROC FREQ produces count/frequency, Percent, Cumulative Count/Frequency and Cumulative Percent statistics. We can suppress Percent using NOPERCENT option. NOCUM can be used for suppressing cumulative column.
Control Order of the Table
If we want to sort the variable values based Frequency order we can use ORDER= option. By default (with order= option), the table will have values based on value of categorical variable.
A few times we want to show values with higher count on top and this can be achieved using ORDER=FREQ option.
proc freq data=london order=freq; table Age; run;
We can format the age value by defining custom formats( using PROC FORMAT) and get the values based on the sorted values.
proc format; value agef Low-15 ="1:<15" 15 -20 ="2:15-20" 20- 30="3:20-30" 30-40 = "4:30-40" 40-50 = "5:40-50" 50-High = "6:50+" other="Others" ; run; proc freq data=london order=formatted; table Age; format age agef. ; run;
Creating Contingency or Cross Tab using Two Variables
Table statement can be used to generate cross tab or contingency table for two variables. The two categorical variables have to separated by "*". In the scenario discussed earlier, we wanted to find cross tab between age groups and sex of athletes.
proc freq data=london order=formatted; table Age; format age agef. ; run;
By default, it produce Count/Freq, Percent, Row Percent and Column Percent. If we want suppress these statistics, we can do using NOPERCENT, NOROW, NOCOL options.
In some cases, we may want to get list view instead of as cross tab. We can get this using LIST option in TABLE statement.
proc freq data=london order=formatted; table Age*Sex/nopercent nocol norow list; format age agef. ; run;
In the next set of blogs, we will focus on some additional options and statistical applications of PROC FREQ.