Cox Regression - Interpret Result and Predict

In the last blog, overview to cox proportional model and building Cox Regression Model using R are discussed.

In this blog, the focus is on cox proportional hazards model interpretation or how to interpret Cox Regression Model output in R.

Cox Regression in R

In survival package in R, coxph function can be used for building Cox Proportional Hazard Model.

install.packages("survival")
library(survival)
cox.ph.m <- coxph(Surv(Time2Event, attrition)~CUSTOMER_AGE+MoB,
                      data=wsurv.data)

Summary output

summary(cox.ph.m )
## Call:
## coxph(formula = Surv(Time2Event, attrition) ~ CUSTOMER_AGE + 
##     MoB, data = wsurv.data)
## 
##   n= 1957, number of events= 607 
## 
##                  coef exp(coef) se(coef)     z Pr(>|z|)    
## CUSTOMER_AGE -0.00310   0.99690  0.00443  -0.7     0.48    
## MoB          -0.27636   0.75854  0.01824 -15.2

cox_regression_output

The summary output gives information on formula used for the Cox Regression Model. Also, it gives sample information, in this sample size is 1957 observations and number of events(attrition) as 607.

Similar to other regression output, Cox Regression output has beta coefficient, standard error, statistics and P value.

Cox Regression has additional column for eb (exponentiation of beta coefficients).

Goodness of Fit Test for each variables is assess whether a variable is significant or not. It is done using Z statistics and its P Value.

For assessing overall Cox Regression Model, a few statistics are calculated and shown in the summary output.

Concordance : Similar to concordance level in logistic regression, concordance shows fraction of pairs in the sample, where the observations with the higher survival time has the higher probability of survival predicted by the model.

Rsquare : R2 is measure variance explained by the model under consideration.

Likelihood ratio test, Wald test and Score (logrank) test tests are using for testing Global Null Hypothesis Beta =0. In this example, P Value for each of these tests are 0, so the null hypothesis could not be accepted.

One of the great way to measure model predictive power is by comparing predicted vs actual results.

Prediction using Cox Regression

predict function can be used for finding out the predicted values using the Cox Model developed. In this example, same sample have been used for calculating predicted values.

If we want to find Hazard Probability, we can using the model we have built. Below function gives Hazard Probability

Hazard Function

Now, we will discuss steps to get values using R.

type=“lp” calculate linear predicted (xb). We can then use the baseline probability for a time and do the exponential transformation of linear prediction.

lp.pred <- predict(cox.ph.m,
                  type="lp",
                  data=wsurv.data)
# Baseline Function
base <- basehaz(cox.ph.m)

# Base value H0 = 0.025880965 at 59 Days
# Predicted Value at time 59 days

Pred.val <- 0.025880965*exp(lp.pred)

Using Predicted Probability, we can create predicted target level. Now, we have predicted probabilities at a time : 59 days. We can define target value (actual at a time 59 days). Similarly for different time periods, we can get base line probabilities using basehaz function output.

Pred.target <- ifelse(Pred.val>0.09,1,0)

Considering that the input data has attrition/censoring values at different time, we have to define observed target values at 59 Day for appropriate comparison.

target <- ifelse(wsurv.data$Time2Event <=59,
                 0,
                 wsurv.data$attrition)

Model Accuracy: Confusion Matrix

Using Predicted and Observed target level, confusion matrix is created. For this example, accuracy level & confusion matrix is created at 59 days.

ConfusionMatrix

 

Overall accuracy of the model prediction is 79%. Since action will be on predicted group 1, the 61% of the 1s are actually ones.

Gains Table and Lift Chart

Gains Table can be created for each of the important time interval and validate accuracy of the model.

Gain Table for one of the time point is as follow.

GainTable

 

Lift Chart shows the lift of the model at each of the deciles. For the selected time, the lift chart is as follow.

Lift Chart

If we want to calculate Survival Probability, we can use below code.

install.packages("pec")
library(pec)
Pred_Prob <- predictSurvProb(cox.ph.m,
                newdata=wsurv.data,
                times=c(59))      
# predicts the survival probability based on the model for the dataset -wsurv.data and for the time 59 days

References

  • http://stats.stackexchange.com/questions/29815/how-to-interpret-the-output-for-calculating-concordance-index-c-index
  • https://apha.confex.com/apha/134am/techprogram/paper_135906.htm

25 thoughts on “Cox Regression - Interpret Result and Predict”

  1. Hi, I have a question :- If I have done a study on individuals for around 120 days(suppose) and my data is right centred then is the base line hazard for cox model going to be different for each of the 120 days! Also, if I have to calculate the risk of event for a particular observation say within 50 days, can I just take out the baseline hazard for 50th day and multiply with the predicted "lp" from cox model!

    • Thanks for your question Debapriya.. You have to build only one Survival Model for any number of days.. You will have two function - base line function ( Time Dependent) and Covariates Function (X dependent). Cox Regression gives you estimated coefficient(s) for both. You could use both these to calculate probability for each case and at a given time..e.g. 50th day.

      Hope this helps.

      • Hi, Thnx for the reply. IThe blog is really useful. But I am finding difficulty in understanding how to get the probability from these two estimates. Could you please help with the formula! Also, I understand the "Risk" Prediction in R is actually exp('lp'). But from the covariate estimates how are we getting the 'lp' or 'risk' prediction!

  2. Hi, Thnx for the reply. IThe blog is really useful. But I am finding difficulty in understanding how to get the probability from these two estimates. Could you please help with the formula! Also, I understand the "Risk" Prediction in R is actually exp('lp'). But from the covariate estimates how are we getting the 'lp' or 'risk' prediction!

  3. Hi, I am finding this blog helpful, but i have few questions
    1) R has different options for type=(lp,risk,expected,terms)
    * lp is linear prediction
    * risk=exp(lp)
    * but what is "expected" option, where do we use it
    * "terms " option gives pred of independent variables used in the model, pretty curious why is this option and where do we use it
    2) my main question is
    when we run baseline model it will throw hazard for the corresponding time
    * now since there are many times how do we get the pred score value( in the above example they have considered time 59 and its corresponding hazard,on what basis they have considered it)
    * after getting the pred score value how are we going to tell the time to churn for the non-churned observations, which is the main point in survival analysis

    • Absolutely, that's what survival model is up for.. One key challenges is that probably can be calculated only for data point included in the development sample..

      cox.ph.m <- coxph(Surv(Time2Event, attrition)~CUSTOMER_AGE, data=wsurv.data) base <- basehaz(cox.ph.m) head(base ) Survival Model has two component - linear predictors and base line predictor.. We can get time coefficient using basehaz. It will give value for each of the time point.. Hope you get the answer.. We can have a call and discuss if required..

  4. hi i am having only 22 vs 60 patients to compare their survival and multi factors with it. will that be sufficient to undergo survival analysis.

  5. Hi,
    While using: Pred.val <- 0.025880965*exp(lp.pred), I receive values greater than 1. I think this is due to the fact that this formula gives hazard ratio which is not a probability. How can I transform it into probability of event?

  6. Hi, how exactly was the Base value H0 (0.025880965 at 59 Days)
    calculated?

    In words, what does this value represent? Is this the value of the survival function at time t=59? Is it the event rate at time t=59? (i.e. number of deaths in interval 0<t<59/total number of records)? Or is it something else entirely?

  7. A few questions regarding the above analysis.

    1. How exactly are predicted probabilities calculated? pred.val calculated above is not survival probability but the survival function and it can be more than 1.
    2. In this code, Pred.target 0.09,1,0), where did the 0.09 come from?

    Some more clarity on these will help me understand the concepts better.

  8. can you pls enlight me on how to interpret data by using latitude and longitude as a replacement for time according to the equation.

  9. The interpretation was really helpful.Just have a few questions. I need to predict retirement for next 10 years from the current data.To do that I have used the same as stated above but the pred.prob is coming greater than 1.I think it is not prob but hazard rate.What does it signify and how do u get 0.09 as cut-off.In my case should I consider that 20% employees retire based on hazard rate. And to calculate for next year do I need to increase the age or just predict for the next year keeping the age as same.
    Waiting for the answer.

  10. Dear Ram,
    This is indeed informative. Thank you.
    Did you checked if h(t) can be greater than one or not? Even I am getting h(t) >1 while applying
    h(t) = baseline_hazart* e^(BiXi)
    if yes then what is proper interpretation?

    I am also working on a similar kind of project where we are trying understand the survival curve of customers (attrition problem).

    Regards,
    Sultan Amed

  11. Hi,
    Thanks for the informative article. I am actually looking to predict survival times using cox ph regression method.

    I believe that exp(lp) gives the hazard ratio relative to the sample average for all predictor variables, i.e. the value of exp[bi * (Xi-Xi')] where bi is the coefficient for the covariate and Xi is the individual sample, Xi' is the average of that covariate across entire population. When exp(lp) is multipled by ho(t), it still does not give the hazard function of a particular individual unless it is multiplied by exp[biX'] which can be computed separately.

    Please let me know if the above understanding is correct and how can survival times be computed under such circumstances.

  12. Hi thanks for the helpful blog. I have a question, the code, Pred.target 0.09,1,0), where did the 0.09 come from?

    waiting for the answer

  13. Hi,

    Thanks for a great blog ! Please help me understand the purpose of following snippet -

    target <- ifelse(wsurv.data$Time2Event <=59,
    0,
    wsurv.data$attrition)

    1. Why are keeping all entries as 0 before Time2Event <=59 ?
    2. How is the probability cutoff decided for a particular Time2Event ? For ex. you took .09 at 59 time
    3. How do I calculate hazard probabilities for new data set once the model has been established ?

    Would be great if you can help me with clarify my doubts !

    Regards
    -Saurabh Kapoor

  14. Hello,

    What if our covariates are time-dependent? I know that for using coxph(), we need to add cluster(id), but in predicting and finding the risk what should I do?

Leave a Comment