Survival analysis entails a set of standard tasks beyond model estimation which need to be performed in most real world applications. These task include:
pammtools package provides many convenience functions to facilitate these tasks. The overall philosophy of the package is to provide functions that return the underlying data used for visualization in a tidy format (Wickham and others 2014), such that everybody can use familiar tools for further processing. In addition, some (high level) convenience functions for visualization are also included.
To illustrate the usual workflow and some of the convenience functions provided by
pammtools, we will use the
tumor in the following sections. For a more complete overview see Bender and Scheipl (2018).
Often one is interested in calculating the hazard, cumulative hazard or survival probability given specific values for some covariates (while other covariates are fixed at their mean values). For PAMMs, one usually also needs to create a data set where all intervals occur, in order to be able to calculate predicted values at different time points.
pammtools provides a flexible interface to create such data sets (
make_newdata) and the functions
to add respective predicted values (including confidence intervals) of the respective quantities. We illustrate the workflow using the
tumor data and the model given below:
In the following, we create a new data set that contains all intervals as well as mean values for all other covariates. Note that within
make_newdata you can specify the desired covariate values by using any function applicable to the data type of the respective column. The new data will contain one row for each combination of the two variables (similar to
Calling the respective functions, we can add the (predicted) hazard to the newly created data.
ped_df <- ped_df %>% add_hazard(pam) ped_df %>% select(interval, hazard, ci_lower, ci_upper) %>% head()
## interval hazard ci_lower ci_upper ## 1 (0,1] 0.0001955248 0.0001095045 0.0003491175 ## 2 (0,1] 0.0005049800 0.0003750227 0.0006799715 ## 3 (0,1] 0.0013042070 0.0008609207 0.0019757403 ## 4 (1,2] 0.0001952127 0.0001093736 0.0003484204 ## 5 (1,2] 0.0005041739 0.0003747135 0.0006783617 ## 6 (1,2] 0.0013021252 0.0008600158 0.0019715103
## interval hazard ci_lower ci_upper ## 973 (2595,2808] 6.569813e-05 2.690993e-05 0.0001603959 ## 974 (2595,2808] 1.696779e-04 7.946854e-05 0.0003622893 ## 975 (2595,2808] 4.382256e-04 1.914665e-04 0.0010030042 ## 976 (2808,3034] 5.205515e-05 1.613385e-05 0.0001679536 ## 977 (2808,3034] 1.344423e-04 4.594680e-05 0.0003933841 ## 978 (2808,3034] 3.472230e-04 1.127654e-04 0.0010691554
Which in turn can be conveniently plotted using respective functions, e.g.
ggplot(ped_df, aes(x = tend, group = age)) + geom_stephazard(aes(y = hazard, col = age)) + geom_stepribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = age), alpha = 0.2) + ylab(expression(hat(lambda)(t))) + xlab(expression(t))
The hazard ratio can be calculated by specifying the
# the code below gives hazard reatio of a 30 year old relative to a 58 year old (cp) # ped %>% # make_newdata(age = c(30)) %>% # add_hazard(pam, reference = list(age = c(58))) %>% # select(hazard, ci_lower, ci_upper) # -> 30 year old has about half the risk of a 58 year old
This might seem a little convoluted, but can be easily generalized in case of time-varying effects, time-dependent covariates, cumulative effects, etc.
Analogously, we can obtain the cumulative hazard (piece-wise linear function). Note that for the cumulative hazard and survival probability it is important to group the data accordingly, such that the cumulative hazard is cumulated for each value of the grouping variable. Otherwise it would be cumulated over the whole data set (which is usually not intended, except if the specified variable is a time-dependent covariate).
ped_df %>% group_by(age) %>% add_cumu_hazard(pam) %>% ggplot(aes(x = tend, y = cumu_hazard, ymin = cumu_lower, ymax = cumu_upper, group = age)) + geom_hazard(aes(col = age)) + geom_ribbon(aes(fill = age), alpha = 0.2) + ylab(expression(hat(Lambda)(t))) + xlab(expression(t))
ped_df %>% group_by(age) %>% add_surv_prob(pam) %>% ggplot(aes(x = tend, y = surv_prob, ymax = surv_lower, ymin = surv_upper, group = age)) + geom_line(aes(col = age)) + geom_ribbon(aes(fill = age), alpha = 0.2) + ylab(expression(hat(S)(t))) + xlab(expression(t))
This approach is very convenient, as it can be extended to arbitrary covariate specifications:
ped_df_sex <- ped %>% make_newdata(tend = unique(tend), age = seq_range(age, 3), complications = levels(complications)) %>% group_by(complications, age) %>% add_surv_prob(pam) ggplot(ped_df_sex, aes(x = tend, y = surv_prob, group = age)) + geom_ribbon(aes(ymin = surv_lower, ymax = surv_upper, fill = age), alpha = 0.3) + geom_surv(aes(col = age)) + ylab(expression(hat(S)(t))) + xlab(expression(t)) + facet_wrap(~complications, labeller = label_both)
Data sets for the visualization of the (non-linear) effect of a continuous covariates can be created in a similar fashion. For illustration, we add a non-linear age effect to the previous model (see also high level functions
mgcv::plot.gam like plots):
add_term we obtain:
term_charlson <- ped %>% make_newdata(age = seq_range(age, n = 25)) %>% add_term(pam, term = "age", reference = list(age = mean(.$age))) ggplot(term_charlson, aes(x = age, y = fit)) + geom_line() + geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = 0.2)
Similarly, we could use high level function
gg_slice to obtain the estimated effect of the age variable, relative to the mean age and age of 30, respectively:
Bender, Andreas, and Fabian Scheipl. 2018. “Pammtools: Piece-Wise Exponential Additive Mixed Modeling Tools.” arXiv:1806.01042 [Stat]. http://arxiv.org/abs/1806.01042.
Wickham, Hadley, and others. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23.