In a number of SAS PROCs or Procedures, we may want to group the observations together. For example if we are using BY statement in PROC PRINT (All About PROC PRINT), SAS prints observations grouped by BY variable(s) values.
When we are using Base SAS and have specified BY statement, the input dataset have to be sorted by BY variables.
In this blog, we will discuss different options of PROC SORT.
Sort a SAS dataset
%* Sample Dataset ; data cust; input custid name $ age; cards; 12 Rohan 23 11 Ronit 37 15 Soni 27 13 John 65 16 Samuel 39 12 Rohan 27 13 John 62 12 Rohan 37 ; run; %* Sort a SAS dataset; proc sort data=cust ; by custid; run;
Sort a SAS dataset using a particular variable and it can be defined using BY statement.
Sort a Dataset and Output to a new Dataset
Using OUT= option we can out the sorted data values to a new dataset instead of overwriting an existing input dataset.
%* Create sorted output dataset; proc sort data=cust out=cust1; by custid; run; Title "Input SAS dataset"; proc print data=cust noobs; run; Title "Sorted SAS dataset"; proc print data=cust1 noobs; run;
Sort using Multiple Variables
A SAS dataset can be sorted using multiple variables. These variables can be listed in BY statement. Based on sequence of variables in BY statement the dataset will be sorted.
%* Sorted SAS dataset with multiple variables; proc sort data=cust out=cust1; by custid age; run;
In this case first observations are arranged using "custid" and then "age" variable values. By default SAS arrange values in Ascending order.
We can arrange data in ascending order for one and descending order for other variable values.
Sort in Ascending and Descending Orders
In this example, we want output dataset to be sorted in ascending order (default option) for custid and descending order for age variable using "descending option" before variable name in BY statement.
%* Sorted SAS dataset with multiple variables; proc sort data=cust out=cust1; by custid descending age; run; Title "Sorted Ascending for custid and descending for age"; proc print data=cust1 noobs; run;
Duplicate Values and Records
A SAS dataset can be duplicate based on one or more variables, it is called duplicate keys , or it can be duplicates across all variables and this called duplicate records.
SAS has an option nodup or noduplicates which removes duplicate values if they are adjacent to each other or next to each other.
%* Remove Duplicates; Title "Input SAS dataset"; proc print data=cust noobs; run; proc sort data=cust nodup out=cust2; by custid; run; Title "Remove duplicate records"; proc print data=cust2 noobs; run;
In this example, notice that duplicate values for "Ronit" were deleted but not for "Rohan". Since Age values were different. SAS was comparing 37 vs 27 which are difference and then 27 vs 37 which were again different, so not deleted.
Now if sort both on CustID and Age, then one of the duplicate records of "Rohan" will be deleted.
proc sort data=cust nodup out=cust2; by custid; run; Title "Remove duplicate records"; proc print data=cust2 noobs; run;
Now if you wanted to keep only nonduplicate for By variable(s), we can get rid of duplicates using dupkey option.
proc sort data=cust nodup out=cust2; by custid age; run; Title "Remove dumplicate records"; proc print data=cust2 noobs; run;
SAS has introduced a few new options in PROC SAS. Next blog will be on a new options in SAS. Wait for the update..