Concatenating Datasets in SAS

Author: Mrinmoy Saikia


Data preparation is one of the most significant steps in Data Science, Analytics or Reporting Projects. In this blog, we focus on learning - "Combining 2 or more SAS datasets Vertically". Combining SAS datasets vertically is also referred as Concatenating Datasets.

Concatenating is combining 2 or more datasets one below another. From the first data set it will take all the observations followed by all observations from the second data set and so on. The total observations will be the sum of the both data set observations.

This can be done multiple ways in SAS and some of these options are

  • SET statement for vertically combining multiple datasets
  • PROC APPEND - appending a SAS dataset with another
  • PROC SQL -for concatenating SAS datasets

SET Statement for Concatenating SAS Datasets

SET data-set(s); After SET keyword, we can list name of SAS datasets to be combined vertically.

During Compilation Phase, SAS reads the information of each data set that is named in the SET statement and then creates a program data vector that contains data descriptors for all the variables from all data sets as well as variables created by the DATA step.

Execution
SAS reads the first observation from the first data set into the program data vector. It processes the first observation and executes other statements in the DATA step. Then it writes the content of the program data vector to the output SAS data set.

SAS continues to read one observation at a time from the first data set until it finds an end-of-file indicator. After that SAS begins reading observations from the second data set and so forth until it reads all observations from all data sets.

Scenario 1: We have two SAS datasets - School and Score. The datasets has information on Boards and Scores of Students. We want to combine /vertically join/concatenate these two datasets.

We are using SET statement. In the below step we are concatenating score data set with school(base data set) in data step by set statement, we can add two or more data set.

In the new data set called concatenate has 8 observations, which is the sum of the observations from the combined data sets.

Scenario 2: How to Identifying the data set source for each data set while concatenating the two or more data set?

One of the way is to create a variable in each of the input datasets and assign data identifier value and when combined , we can find which observation is coming from which dataset.

The another option is to concatenate the data set by Using the IN Option to create the SOURCE variable in the SET statement. The IN=datasetname option as shown below (without using SOURCE=”school”  and SOURCE=”score”  in the original DATA steps).

 

Concatenating using PROC SQL

Here is the step to concatenate datasets using PROC SQL.

 

Appending Datasets in SAS using PROC APPEND

We explored options to concatenate data sets in SAS.  We can also append one dataset to another dataset in SAS using PROC APPEND.  Appending may be efficient as it avoids reading all the observations,  it appends the second dataset to the first  and it can be done  PROC APPEND but it requires both the dataset to have same variables.

 

If the base data set has one variable which does not exist in the other data set then in the log it will show you warning message that the variable is not exist in the other data set.

If the second data set has one extra variable then it will not append the data set and will give the error in log.

WARNING: Variable score was not found on BASE file. The variable will not be added to the BASE file.

ERROR: No appending done because of anomalies listed above. Use FORCE option to append these files.

 

It gives following messages in the log window.
NOTE: Appending WORK.SCORE to WORK.SCORE1.
WARNING: Variable score was not found on BASE file. The variable will not be added to the BASE file.
ERROR: No appending done because of anomalies listed above. Use FORCE option to append these files.
Now, let's use FORCE option.

With the help of force option, PROC APPEND will copy the common variables and values to the base datasets. The dataset score had additional variable -score- which has been excluded from adding to the base dataset as that dataset can't be changed.

If no additional processing is necessary, using PROC APPEND or the APPEND statement in PROC DATASETS is more efficient than using a DATA step to concatenate data sets.

Interleaving Datasets in SAS

Interleaving Datasets is a concatenating steps where we will be concatenating two or more data sets using BY variable in each BY group or we can do it by using index.
Before interleave data sets, the data set has to be sorted or grouped by the same variable or variables that use in the BY statement, or use index for the data sets.
To interleave, we should specify the data sets we want interleaved in the SET statement, and indicate on which variable you want the final data set sorted in the BY statement. We can interleave as many data sets as we would like.

The resulting data set contains all of the variables and all of the observations from all of the input data sets.

Scenario: Interleaving in the Simplest Case

In the below both the datasets SCHOOL and SCORE, the observations are arranged in order of the values of the common BY variable “year”.

We want to interleave both the datasets values based on value of year variable.

The resulting data set INTERLEAVING has 8 observations, which is the sum of the observations from the combined data sets. The new data set contains all variables from both data sets. The value of variables found in one data set but not in the other are set to missing, and the observations are arranged by the values of the BY variable.

Leave a Comment