In the last blog, overview to cox proportional model and building Cox Regression Model using R are discussed.

In this blog, the focus is on cox proportional hazards model interpretation or how to interpret Cox Regression Model output in R.

### Cox Regression in R

In survival package in R, coxph function can be used for building Cox Proportional Hazard Model.

1 2 |
install.packages("survival") library(survival) |

1 2 |
cox.ph.m <- coxph(Surv(Time2Event, attrition)~CUSTOMER_AGE+MoB, data=wsurv.data) |

### Summary output

1 2 3 4 5 6 7 8 9 10 |
summary(cox.ph.m ) ## Call: ## coxph(formula = Surv(Time2Event, attrition) ~ CUSTOMER_AGE + ## MoB, data = wsurv.data) ## ## n= 1957, number of events= 607 ## ## coef exp(coef) se(coef) z Pr(>|z|) ## CUSTOMER_AGE -0.00310 0.99690 0.00443 -0.7 0.48 ## MoB -0.27636 0.75854 0.01824 -15.2 |

The summary output gives information on formula used for the Cox Regression Model. Also, it gives sample information, in this sample size is 1957 observations and number of events(attrition) as 607.

Similar to other regression output, Cox Regression output has beta coefficient, standard error, statistics and P value.

Cox Regression has additional column for e^{b} (exponentiation of beta coefficients).

Goodness of Fit Test for each variables is assess whether a variable is significant or not. It is done using Z statistics and its P Value.

For assessing overall Cox Regression Model, a few statistics are calculated and shown in the summary output.

**Concordance** : Similar to concordance level in logistic regression, concordance shows fraction of pairs in the sample, where the observations with the higher survival time has the higher probability of survival predicted by the model.

**Rsquare** : R^{2} is measure variance explained by the model under consideration.

**Likelihood ratio test, Wald test and Score (logrank)** test tests are using for testing Global Null Hypothesis Beta =0. In this example, P Value for each of these tests are 0, so the null hypothesis could not be accepted.

One of the great way to measure model predictive power is by comparing predicted vs actual results.

### Prediction using Cox Regression

**predict** function can be used for finding out the predicted values using the Cox Model developed. In this example, same sample have been used for calculating predicted values.

If we want to find Hazard Probability, we can using the model we have built. Below function gives Hazard Probability

Now, we will discuss steps to get values using R.

**type=“lp”** calculate linear predicted (xb). We can then use the baseline probability for a time and do the exponential transformation of linear prediction.

1 2 3 4 5 6 7 8 9 10 |
lp.pred <- predict(cox.ph.m, type="lp", data=wsurv.data) # Baseline Function base <- basehaz(cox.ph.m) # Base value H0 = 0.025880965 at 59 Days # Predicted Value at time 59 days Pred.val <- 0.025880965*exp(lp.pred) |

Using Predicted Probability, we can create predicted target level. Now, we have predicted probabilities at a time : 59 days. We can define target value (actual at a time 59 days). Similarly for different time periods, we can get base line probabilities using **basehaz** function output.

1 |
Pred.target <- ifelse(Pred.val>0.09,1,0) |

Considering that the input data has attrition/censoring values at different time, we have to define observed target values at 59 Day for appropriate comparison.

1 2 3 |
target <- ifelse(wsurv.data$Time2Event <=59, 0, wsurv.data$attrition) |

**Model Accuracy: Confusion Matrix**

Using Predicted and Observed target level, confusion matrix is created. For this example, accuracy level & confusion matrix is created at 59 days.

Overall accuracy of the model prediction is 79%. Since action will be on predicted group 1, the 61% of the 1s are actually ones.

**Gains Table and Lift Chart**

Gains Table can be created for each of the important time interval and validate accuracy of the model.

Gain Table for one of the time point is as follow.

Lift Chart shows the lift of the model at each of the deciles. For the selected time, the lift chart is as follow.

If we want to calculate Survival Probability, we can use below code.

1 2 3 4 5 6 |
install.packages("pec") library(pec) Pred_Prob <- predictSurvProb(cox.ph.m, newdata=wsurv.data, times=c(59)) # predicts the survival probability based on the model for the dataset -wsurv.data and for the time 59 days |

### References

- http://stats.stackexchange.com/questions/29815/how-to-interpret-the-output-for-calculating-concordance-index-c-index
- https://apha.confex.com/apha/134am/techprogram/paper_135906.htm

Hi, I have a question :- If I have done a study on individuals for around 120 days(suppose) and my data is right centred then is the base line hazard for cox model going to be different for each of the 120 days! Also, if I have to calculate the risk of event for a particular observation say within 50 days, can I just take out the baseline hazard for 50th day and multiply with the predicted "lp" from cox model!

Thanks for your question Debapriya.. You have to build only one Survival Model for any number of days.. You will have two function - base line function ( Time Dependent) and Covariates Function (X dependent). Cox Regression gives you estimated coefficient(s) for both. You could use both these to calculate probability for each case and at a given time..e.g. 50th day.

Hope this helps.

Hi, Thnx for the reply. IThe blog is really useful. But I am finding difficulty in understanding how to get the probability from these two estimates. Could you please help with the formula! Also, I understand the "Risk" Prediction in R is actually exp('lp'). But from the covariate estimates how are we getting the 'lp' or 'risk' prediction!

Hello Debapriya, It is available directly in the function output. But you could calculate as b1*x1+b2*x2.. I will use this example and show you the calculations

Hi, Thnx for the reply. IThe blog is really useful. But I am finding difficulty in understanding how to get the probability from these two estimates. Could you please help with the formula! Also, I understand the "Risk" Prediction in R is actually exp('lp'). But from the covariate estimates how are we getting the 'lp' or 'risk' prediction!

Hi, I am finding this blog helpful, but i have few questions

1) R has different options for type=(lp,risk,expected,terms)

* lp is linear prediction

* risk=exp(lp)

* but what is "expected" option, where do we use it

* "terms " option gives pred of independent variables used in the model, pretty curious why is this option and where do we use it

2) my main question is

when we run baseline model it will throw hazard for the corresponding time

* now since there are many times how do we get the pred score value( in the above example they have considered time 59 and its corresponding hazard,on what basis they have considered it)

* after getting the pred score value how are we going to tell the time to churn for the non-churned observations, which is the main point in survival analysis

Hi i want to know the future time to event probability-say for next month as to which users will move forward.i have applied survival model in R

Absolutely, that's what survival model is up for.. One key challenges is that probably can be calculated only for data point included in the development sample..

cox.ph.m <- coxph(Surv(Time2Event, attrition)~CUSTOMER_AGE, data=wsurv.data) base <- basehaz(cox.ph.m) head(base ) Survival Model has two component - linear predictors and base line predictor.. We can get time coefficient using basehaz. It will give value for each of the time point.. Hope you get the answer.. We can have a call and discuss if required..

hi i am having only 22 vs 60 patients to compare their survival and multi factors with it. will that be sufficient to undergo survival analysis.

Yes, Deepak. Only limitation will be that for future also you will be able to calculate survival probability for these two time points..I assume 22 and 60 are two time points

Hi,

While using: Pred.val <- 0.025880965*exp(lp.pred), I receive values greater than 1. I think this is due to the fact that this formula gives hazard ratio which is not a probability. How can I transform it into probability of event?

Values should be less than 1 only, I will double check and confirm..

Hi, how exactly was the Base value H0 (0.025880965 at 59 Days)

calculated?

In words, what does this value represent? Is this the value of the survival function at time t=59? Is it the event rate at time t=59? (i.e. number of deaths in interval 0<t<59/total number of records)? Or is it something else entirely?

A few questions regarding the above analysis.

1. How exactly are predicted probabilities calculated? pred.val calculated above is not survival probability but the survival function and it can be more than 1.

2. In this code, Pred.target 0.09,1,0), where did the 0.09 come from?

Some more clarity on these will help me understand the concepts better.

Thanks for your comments.. will look at it

can you pls enlight me on how to interpret data by using latitude and longitude as a replacement for time according to the equation.

We need more details to help.. please send email to info@dni-institute.in

The interpretation was really helpful.Just have a few questions. I need to predict retirement for next 10 years from the current data.To do that I have used the same as stated above but the pred.prob is coming greater than 1.I think it is not prob but hazard rate.What does it signify and how do u get 0.09 as cut-off.In my case should I consider that 20% employees retire based on hazard rate. And to calculate for next year do I need to increase the age or just predict for the next year keeping the age as same.

Waiting for the answer.

Dear Ram,

This is indeed informative. Thank you.

Did you checked if h(t) can be greater than one or not? Even I am getting h(t) >1 while applying

h(t) = baseline_hazart* e^(BiXi)

if yes then what is proper interpretation?

I am also working on a similar kind of project where we are trying understand the survival curve of customers (attrition problem).

Regards,

Sultan Amed

Hi,

Thanks for the informative article. I am actually looking to predict survival times using cox ph regression method.

I believe that exp(lp) gives the hazard ratio relative to the sample average for all predictor variables, i.e. the value of exp[bi * (Xi-Xi')] where bi is the coefficient for the covariate and Xi is the individual sample, Xi' is the average of that covariate across entire population. When exp(lp) is multipled by ho(t), it still does not give the hazard function of a particular individual unless it is multiplied by exp[biX'] which can be computed separately.

Please let me know if the above understanding is correct and how can survival times be computed under such circumstances.

The basehaz function from the

`survival`

package does not give the baseline hazard. It actually computes the Cumulative Hazard for whatever reason.See the manual for the survival package: https://stat.ethz.ch/R-manual/R-devel/library/survival/html/basehaz.html

Hi, can you please look at my question here? Any help will be greatly appreciated.

https://stats.stackexchange.com/questions/281550/how-can-i-interpret-the-outcome-of-cox-regression-with-time-varying-variables

Hi thanks for the helpful blog. I have a question, the code, Pred.target 0.09,1,0), where did the 0.09 come from?

waiting for the answer

Hi,

Thanks for a great blog ! Please help me understand the purpose of following snippet -

target <- ifelse(wsurv.data$Time2Event <=59,

0,

wsurv.data$attrition)

1. Why are keeping all entries as 0 before Time2Event <=59 ?

2. How is the probability cutoff decided for a particular Time2Event ? For ex. you took .09 at 59 time

3. How do I calculate hazard probabilities for new data set once the model has been established ?

Would be great if you can help me with clarify my doubts !

Regards

-Saurabh Kapoor