Cox Regression: Overview
In the previous blog on Survival Modeling, we have given an overview on Survival Modeling and also described on Kaplan Meier estimator and Log Rank test.
In the this blog, the focus is on Cox Proportional Hazards (PH) Model. Cox Proportional Hazards model is also referred by Cox Model, Cox Regression, or Proportional Hazard Model.
Cox Model is used for analysis of Survival data and finding out relationship between Survival Time and covariates (also called explanatory variables or predictors) .
Survival Modeling involves both time (survival time) and covariates. Kaplan Meier focuses only time. Cox Proportional Hazard Model considers both time and explanatory variables.
Cox Model formulation has two parts - baseline hazard function and exponential transformation of Covariates.
PH(t,X) = h0(t) e (b0+b1x1 +b2x2+…)
One of the key assumptions in Cox Regression is that the proportional hazards does not depend on the covariates. And exponential expression involves only covariates and does not depend on time. So, covariates are time independent. When covariates are time dependent, parametric survival model or extended Cox Model is used.
Cox Model or Cox Regression is semi-parametric regression. It is semi-parametric as baseline hazard function could take different distribution.
Cox Regression model is different from Multiple Regression as the first involves both time and covariate but the later only covariates.
Cox Regression model has exponential formulation similar to logistic regression but baseline hazard function in addition to logistic regression.
Also, multiple regression use least square estimation for parameter estimation whereas logistic regression leverage maximum likelihood estimation. Cox Regression uses maximum likelihood but “partial” likelihood. Since, the likelihood formulation considers only “failed” cases and not the censored, it is called partial likelihood estimation.
Hazard Ratio in Cox Regression Model
In Cox Model, Hazard is formulated as
h(t,X) = h0(t) e (b0+b1x1 +b2x2+…)
Consider example for i and j cases
hi(t) = h0(t) e (b0+b1x1i +b2x2i+…) ..... (1)
hj(t) = h0(t) e (b0+b1x1j +b2x2j+…) ..... (2)
Hazard Ratio for i and j
hi(t)/hj(t) = e (b1(x1i - x1j ) +b2(x2i - x2j))
Hazards Ratio hi(t)/hj(t) does not depend on the time. This should be parallel across time points for the cases. If we want to check per unit change for a variable X1 keeping other variables constant then Hazard Ratio will be
hi(t)/hj(t) = e b1
It does not depend on value of X, similar to Odds Ratio in Logistic Regression.
Hazard Ratio is Hazard for one individual to Hazard for second individual. And it is dependent on a set of exploratory variable values for these individuals. Difference between Odd Ratio and Hazard Ratio is that Hazard Ratio is ratio of failure rates where as Odd Ratio is the ratio of proportion.
We will show with in an example later in the blog.
Business Context and Data
We are using same context of wealth management as discussed on previous blog. We want to understand the factors impacting attrition/retention rates across 24 months.
Data will have a set of independent/explanatory variables and two dependent/target variables - time to event/attrition and whether customer has attrited/censored.
Cox Regression in R
In survival package in R, coxph function can be used for building Cox Proportional Hazard Model.
Surv function is used for creating survival object before using coxph function.
Some of the main parameters in coxph function are
formula : Defines the input Cox Regression Model. - a formula object, with the response on the left of a ~ operator, and the terms on the right. The response must be a survival object as returned by the surv function.
data : Input data frame which has the variables listed in formula, subset and weights argument.
weights : When input sample is biased sample and if analyst wants to provide weights of cases. This is a vector of case weights.
ties : Couple of methods to handle Ties in Cox Regression in R. This is a character string specifying the method for tie handling. Options are “efron”,“breslow”, and “exact”.
Build a Survival Model - using Cox Regression
A simple model is built using Cox Regression. Explanatory variable is customer age.
cox.ph.model <- coxph(Surv(Time2Event, attrition)~CUSTOMER_AGE,
Summary of Survival Model
Summary of the model results.
## coxph(formula = Surv(Time2Event, attrition) ~ CUSTOMER_AGE, data = wsurv.data)
## n= 1957, number of events= 607
## coef exp(coef) se(coef) z Pr(>|z|)
## CUSTOMER_AGE -0.02067 0.97954 0.00239 -8.65 <2e-16 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## exp(coef) exp(-coef) lower .95 upper .95
## CUSTOMER_AGE 0.98 1.02 0.975 0.984
## Concordance= 0.606 (se = 0.012 )
## Rsquare= 0.041 (max possible= 0.99 )
## Likelihood ratio test= 82.9 on 1 df, p=0
## Wald test = 74.8 on 1 df, p=0
## Score (logrank) test = 74.6 on 1 df, p=0
Hazard Ratio Cox Regression Model estimates Hazard Ratio and does not calculate the baseline function.
The summary result has coef and exp(coef) . These shows coefficient for the variable Age and exponentiation of coefficient.
Hazard Ratio = H1/H0 = e(b0+b0Age1)/e(b1+b1Age0)
H1/H0 = e b1
So, Hazard Ratio for the variable age in the Cox Regression Model is exp(coef) . It indicates that attrition rate goes down by (1-0.979538) = 3% with one unit change in age.
The model should have a two component - baseline hazard function and exponential transformation. But the current summary out put has only one component explonential model.
We can get baseline hazard function using basehaz function. survfit function could also be used for baseline hazard function estimation.
base.hazard <- survfit(Surv(Time2Event, attrition)~1,
## hazard time
## 1 0.01630 3
## 2 0.03095 31
## 3 0.04586 59
## 4 0.06206 91
## 5 0.08955 122
## 6 0.11036 153
## 7 0.12452 181
## 8 0.13725 213
## 9 0.15985 244
## 10 0.17372 276
## 11 0.19020 304
## 12 0.20944 335
## 13 0.22419 367
## 14 0.24621 398
## 15 0.27344 426
## 16 0.30774 458
## 17 0.31740 486
## 18 0.32227 517
## 19 0.32577 549
## 20 0.32717 577
## 21 0.33351 608
## 22 0.33634 640
## 23 0.34061 671
## 24 0.34705 699
## 25 0.35281 731
We can run a Stratified Cox Model. Strata variable could be gender. We can either use strata option to define/give strata variables in this case it is gender. Or we can create dummy variables and use as explanatory variables.
cox.ph.model <- coxph(Surv(Time2Event, attrition)~(CUSTOMER_AGE)*strata(GENDER),
Next blog, will be on Cox Regression: How to interpret Result?
- David G. Kleinbaum,Mitchel Klein, Survival Analysis - A Self-Learning Text