Data Science: Profile Screening Model for Mid-Management Roles

Business Context: The client was an executive search firm. It has built a candidate database with over a million candidate profiles.  The client wanted to leverage the candidate database for smart candidate selection and recruitment process.

For this project, the aim was to build a predictive model which will help in identifying a list of 100 candidates for each of the middle management job level. This will help us in closing the open position faster, leading to better client experience and faster payment realisation.


Model Development – Target Variable Definition

For defining the target/development variable, we have selected all middle management positions between Mar-2015 and Feb-2016.  We have identified all the candidates who had been referred for each of these positions. The candidates who have been selected for these positions in next the 6 months, they have been tagged as 1 otherwise candidates have been marked as 0.

Candidate Screening


Model Development – Independent Variables

Some of the sources of the data and variables considered are Job Profile - Find 10 important key words or phrases and Candidate Experience - Skills, Experience and Interview Skills are considered for independent variable creation.  Also, external data from Linkedin were available.

Examples of variables

  • Total Years of Experience
  • Year of Experience in the same industry
  • Years in the current role
  • Salary in the current role
  • Gap between Current Salary and Job Role CTC Salary
  • Volume of Linkedin Connections
  • Level of Linkedin Connections - # of C level connections, # of Mid-level connections etc
  • # of recommendations
  • # of courses/training completed
  • # of articles or blogs written

In overall we have around 170 variables for us to build the model. All these variables and data were extracted for the period 6 months before screened for the job profile.

Model Development: Univariate and Bivariate Analysis

Now, data for Target/Dependent Variable (Binary Variable) and a list of independent Variables (both categorical and continuous) are available for the model development. But, univariate and bivariate analysis are very important for the variable treatments and understanding the data & relationships between the target variable and each of the independent variables.

Univariate Analysis: Each of the variables are summarized, and checked for missing values and outliers. One of the challenges in this project was missing values for a lot of variables. A lot of candidates were not very active on Linkedin, not written blogs etc. So, most of the variables which had missing values, the missing values were replaced with zeros.

Some of the continuous variables such as Salary in the previous job etc were also checked for outliers. We have capped values at 1 and 99 percentiles values. This helped to remove the unordinary effect of a few values or candidates.

Bivariate Analysis: We had converted all continuous variables into Category Variables (created buckets) and used column charts to see % mix of Target variables for different levels/values of Categorical Variables. This has helped to build some hypothesis on relationship between each of the independent variables and the target variables. Some examples of the interesting patterns/finding were.

  • When gap between Current Salary and CTC was between 15% to 30%, the candidates had higher chances of getting selected
  • Volume of Connects had positive relationship with probability of selection


Model Development: Variables Reduction and Selection

Though we had only 170 variables, we had used information value (IV) based criteria to select the top 30 most important variables. We also ensured that we keep the diversity of variables (coming from different data sources and explaining different behaviours of the candidates such as experience vs activities on Linkedin)

For final model selection, we had used both ‘forward’ and ‘backward’ methods available in R Packages and SAS.

Finally we had 11 variables in the model, all these variables had P value less than 0.001. This indicated that there is no evidence in support of null hypothesis (H0 : Parameter Estimates is 0 for a variable or no relationship between target variable and an independent variable).

Model Development and Validation

We had split the modelling sample into a development sample (70%) and a validation sample (30%). The model was developed using Logistic Regression and on the development sample.

Some of the model performance statistics checked were

  • Concordance Percentage : It was 73% on the development sample
  • KS
  • Lift Chart
  • Gains Table & Chart

And the model output - parameters and variables- were used for scoring the validation samples. We had completed all the model performance statistics on validation sample and values were on the comparable level.


Leave a Comment