Survival Modeling Tutorial using R - Part 1

Survival Modeling is a family of techniques which are used when time to even becomes important.

Survival Models can be used for predicting time of an event ( when customer will take up a product), estimating duration until next event occurs (customer visit to a retail store).

Some of the applications of Survival Modeling across industry verticals.

Some of the concepts related to Survival Modeling are

Survival Function: Probability of surviving until time *t* is called survival function. It is normally represented as S(t).
Hazard Rate: Event rate for time *t* given survival until *t*. This is also called Hazard or failure rate.
Censoring: When event information for the cases under analysis/investigation is missing, it is called censoring.

More information on Survival Modeling and its concepts could be explored using references.

Survival Modeling in R

Quite a few packages in R to help us proceed with Survival Modeling. These packages also have sample datasets for survival modeling.


Business Scenario and Data Preparation

It is known fact that acquiring a new customer is much more difficult than retaining new customers. Organizations incur substaitial cost in acquiring new customers and make money over customer life cycle. So, understanding customer retention period is critcal. This will help the organisation in acquiring and promoting right customers. Scenario: For a wealth management client, we wanted to estimate customer time to close an investment relationship (Attrition). At the time of acquisition, we know only basic demographic information about the customers. We have tracked customers for 24 months and checked their attrition status. We want to understand if Age and Gender impacts their loyalty toward the wealth management client. In survival we have to have a response or event variable (whether customer has attrited) and second when customer has attrited - days to attrite. In this example, the % attrition rate is low, so we have created a bias sample (all attrited but only 5% of non attrition in the sample).

Model Build- Life Tables

One of the first step in Survival Modeling is to analyzing survival times. In this example, survival time is the days a customer has been with the wealth management provider. The survival table and survival time plot could be useful for the analysis. Sample View of the data

setwd("\\Learn R\\Surv") <- read.csv("wealth_survival.csv",
                  stringsAsFactors = F
names <-names( )
10259 F 1/1/1900 7/12/2009 1 671
10032 F 1/1/1900 7/5/2009 0 731
10061 F 1/1/1900 7/10/2009 0 731
10183 F 1/1/1900 7/7/2009 0 731
10617 M 1/1/1900 7/10/2009 0 731
10674 M 1/1/1900 6/24/2009 1 731
10739 F 1/1/1900 7/6/2009 0 731
10783 F 1/1/1900 3/8/2009 0 731
10886 M 1/1/1900 6/24/2009 0 731
10902 F 1/1/1900 7/5/2009 0 731
11000 F 1/1/1900 6/24/2009 0 731
11023 F 1/1/1900 3/8/2009 0 731
11032 F 1/1/1900 7/12/2009 0 731

w.surv <- survfit(Surv($Time2Event,$attrition)~ 1, conf.type="none")
sum.surv <-summary(w.surv)
surv.out.df <- data.frame("Time" =sum.surv[[2]],
write.csv(file="surv.csv",surv.out.df )
     xlab="Days since start",
     ylab="Survival Rate",
     main="Survival Rate at different time point")

Survival Curve calculates the survival probability or rate at any given point in time. The function will start with 100% and gradually go down with the time.

S(t) = Survived until time t / Total Sample Size

Interpretation for the current example, it gives retention rate (1-Attrition Rate) at any point in time. So, it represents cumulative attrition from the start until a time t.

Survival Function S(t) = Number of customers still with wealth provider /number of customers who are tracked.



In the Survival Plot, at point 213 Days, 86.36% customers have survived (or retained with the wealth provider). Started with 1957 customers and at 213 days from the acquisition, 1690 customers are still with the wealth provider, so 1690/1957 = 86.36% is survival rate at time 213 days.

Hazard Function is cumulative distribution function and indicates probability of failure before time t

F(t) = p(T < t) = Probability of failure by time t
F(t) = 1-S(t) = 1- Probability of survival until time t

In the current example, Hazard function shows cumulative attrition at time t. Cumulative since the starting point of the study.

Hazard Rate


In this example, by 213 days, 267 customers leaves the wealth provider and we had started with 1957 customers. So Hazard rate (cumulative attrition rate) at time 213 days is 267/1957 =13.64%.

We have survival rate and hazard rate for the sample, we need to understand how survival rate and hazard rate are different for Male and Female. This will help us answer the question, whether Gender impacts survival rate and hazard rate.

w.surv.gender <- survfit(Surv($Time2Event,$attrition)~$GENDER, conf.type="none")
plot(w.surv.gender ,
     xlab="Days since start",
     ylab="Survival Rate",
     main="Survival Rate by Gender",
legend(30, .5, c("Female", "Male"),lwd=2,col=c("red","green"))

Survival Table for both Male and Female are as follow.


Survival curve shows that Male/Female have different level of Survival Rate across period.

The survival curve could be estimated either "kaplan-meier" or "fleming-harrington" method. survdiff function of survival package has a parameter type and by giving appropriate value, one can get KM or FH estimated curves.

log-rank test could be used for testing equality across strata. In this example, we have consider Gender as strata variable.

survdiff function of survival package could help in getting log-rank test for comparing two or more survival curves. This is non parametric test.

Null Hypothesis (H0): No difference between two survival curves (each of Male/Female)

gender.surv <- survdiff(Surv($Time2Event,$attrition)~$GENDER,)

P Value of the Log Rank Test is *0.0676** so at 90% confidence level we reject the null hypothesis of no difference between the survival curves.

For finding effect of multiple variables on survival curve, we can use Cox Proportional Hazard Regression. Cox Proportional Hazard Regression is semi-parametric model.

Cox Proportional Hazard Regression could be used for building a regression model. Cox Regression Modelling will be topic of next blog.



17 thoughts on “Survival Modeling Tutorial using R - Part 1”

  1. Hi Ram
    Many thanks for the example and explanation it was really useful. Is it possible for you forward me the dataset "wealth_survival.csv" ? I am really want to run the model in my computer.


  2. Hi Ram
    U R work is appreciated well.
    I too need the dataset "wealth_survival.csv". Could you please share the file.

    Thanks & Regs

    Jayanthi Murugesan

  3. I want to predict retirement for next 2 years.I have age and time.Could i find the baseline hazard and take exp of coefficients to find individual hazard rates

  4. Hi Ram,

    I have read your blog with interest particularly your articles on survival analysis or time-to-event modeling. I would like to run your tutorial in RStudio but couldn't locate the survival_wealth.csv data on the blog. Would you mind emailing the data set? I am on the online experimentation team here at Nordstrom and am researching time-to-event analysis as a way to model merchandise returns.

    Thanks so much,


Leave a Comment