# Survival Modeling Tutorial using R - Part 1

Survival Modeling is a family of techniques which are used when time to even becomes important.

Survival Models can be used for predicting time of an event ( when customer will take up a product), estimating duration until next event occurs (customer visit to a retail store).

Some of the applications of Survival Modeling across industry verticals.

Some of the concepts related to Survival Modeling are

Survival Function: Probability of surviving until time *t* is called survival function. It is normally represented as S(t).
Hazard Rate: Event rate for time *t* given survival until *t*. This is also called Hazard or failure rate.
Censoring: When event information for the cases under analysis/investigation is missing, it is called censoring.

More information on Survival Modeling and its concepts could be explored using references.

### Survival Modeling in R

Quite a few packages in R to help us proceed with Survival Modeling. These packages also have sample datasets for survival modeling.

### Business Scenario and Data Preparation

It is known fact that acquiring a new customer is much more difficult than retaining new customers. Organizations incur substaitial cost in acquiring new customers and make money over customer life cycle. So, understanding customer retention period is critcal. This will help the organisation in acquiring and promoting right customers. Scenario: For a wealth management client, we wanted to estimate customer time to close an investment relationship (Attrition). At the time of acquisition, we know only basic demographic information about the customers. We have tracked customers for 24 months and checked their attrition status. We want to understand if Age and Gender impacts their loyalty toward the wealth management client. In survival we have to have a response or event variable (whether customer has attrited) and second when customer has attrited - days to attrite. In this example, the % attrition rate is low, so we have created a bias sample (all attrited but only 5% of non attrition in the sample).

### Model Build- Life Tables

One of the first step in Survival Modeling is to analyzing survival times. In this example, survival time is the days a customer has been with the wealth management provider. The survival table and survival time plot could be useful for the analysis. Sample View of the data

 CustID GENDER BIRTH_DATE CUSTOMER_SINCE attrition Time2Event 10259 F 1/1/1900 7/12/2009 1 671 10032 F 1/1/1900 7/5/2009 0 731 10061 F 1/1/1900 7/10/2009 0 731 10183 F 1/1/1900 7/7/2009 0 731 10617 M 1/1/1900 7/10/2009 0 731 10674 M 1/1/1900 6/24/2009 1 731 10739 F 1/1/1900 7/6/2009 0 731 10783 F 1/1/1900 3/8/2009 0 731 10886 M 1/1/1900 6/24/2009 0 731 10902 F 1/1/1900 7/5/2009 0 731 11000 F 1/1/1900 6/24/2009 0 731 11023 F 1/1/1900 3/8/2009 0 731 11032 F 1/1/1900 7/12/2009 0 731

Survival Curve calculates the survival probability or rate at any given point in time. The function will start with 100% and gradually go down with the time.

S(t) = Survived until time t / Total Sample Size

Interpretation for the current example, it gives retention rate (1-Attrition Rate) at any point in time. So, it represents cumulative attrition from the start until a time t.

Survival Function S(t) = Number of customers still with wealth provider /number of customers who are tracked.

In the Survival Plot, at point 213 Days, 86.36% customers have survived (or retained with the wealth provider). Started with 1957 customers and at 213 days from the acquisition, 1690 customers are still with the wealth provider, so 1690/1957 = 86.36% is survival rate at time 213 days.

Hazard Function is cumulative distribution function and indicates probability of failure before time t

F(t) = p(T < t) = Probability of failure by time t
F(t) = 1-S(t) = 1- Probability of survival until time t

In the current example, Hazard function shows cumulative attrition at time t. Cumulative since the starting point of the study.

In this example, by 213 days, 267 customers leaves the wealth provider and we had started with 1957 customers. So Hazard rate (cumulative attrition rate) at time 213 days is 267/1957 =13.64%.

We have survival rate and hazard rate for the sample, we need to understand how survival rate and hazard rate are different for Male and Female. This will help us answer the question, whether Gender impacts survival rate and hazard rate.

Survival Table for both Male and Female are as follow.

Survival curve shows that Male/Female have different level of Survival Rate across period.

The survival curve could be estimated either "kaplan-meier" or "fleming-harrington" method. survdiff function of survival package has a parameter type and by giving appropriate value, one can get KM or FH estimated curves.

log-rank test could be used for testing equality across strata. In this example, we have consider Gender as strata variable.

survdiff function of survival package could help in getting log-rank test for comparing two or more survival curves. This is non parametric test.

Null Hypothesis (H0): No difference between two survival curves (each of Male/Female)

P Value of the Log Rank Test is *0.0676** so at 90% confidence level we reject the null hypothesis of no difference between the survival curves.

For finding effect of multiple variables on survival curve, we can use Cox Proportional Hazard Regression. Cox Proportional Hazard Regression is semi-parametric model.

Cox Proportional Hazard Regression could be used for building a regression model. Cox Regression Modelling will be topic of next blog.

#### Reference

• http://www.statsoft.com/Textbook/Survival-Failure-Time-Analysis
• http://www.ats.ucla.edu/stat/sas/seminars/sas_survival/
• http://www.ats.ucla.edu/stat/r/examples/asa/asa_ch2_r.htm
• http://support.sas.com/resources/papers/proceedings12/132-2012.pdf
• http://www.ats.ucla.edu/stat/stata/seminars/stata_survival/

### 17 thoughts on “Survival Modeling Tutorial using R - Part 1”

1. Pratap

Hi Ram
Many thanks for the example and explanation it was really useful. Is it possible for you forward me the dataset "wealth_survival.csv" ? I am really want to run the model in my computer.

Thanks
Pratap

• Thanks Pratap, will share with you

2. Jayanthi Murugesan

Hi Ram
U R work is appreciated well.
I too need the dataset "wealth_survival.csv". Could you please share the file.

Thanks & Regs

Jayanthi Murugesan

• Thanks Jayanthi.. Will share.

3. Rajaram

can you pls share the csv file(wealth_survival.csv)

• Updated in the blog itself

4. Pam

Nice article. This is very informative.

Could you please share the wealth survival dataset. I couldn't find it on the blog.

Thanks.

is there a way to do it without the survival function if we want to predict future probability?

6. Is there a possibility to share the dataset?

7. Victor

Thanks for this posting and it really helps me a lot. Is it possible to send a copy of the dataset to chenhang90@gmail.com?

8. Dibyajyoti

I want to predict retirement for next 2 years.I have age and time.Could i find the baseline hazard and take exp of coefficients to find individual hazard rates

9. boumghar

Hi, Thanks for this blog. Could please send me a copy of the dataset ? Thanks a lot

10. Hi Ram,

I have read your blog with interest particularly your articles on survival analysis or time-to-event modeling. I would like to run your tutorial in RStudio but couldn't locate the survival_wealth.csv data on the blog. Would you mind emailing the data set? I am on the online experimentation team here at Nordstrom and am researching time-to-event analysis as a way to model merchandise returns.

Thanks so much,

Edward