vignettes/recurrent-events.Rmd
recurrent-events.Rmd
We illustrate application of PAMMs in the analysis of the effect of HIV exposure on the time to staphylococcus aureaus infection in children, with possible recurrences. The children are anonymized and random id
s were generated. The following packages are used in this analysis:
library(mgcv)
library(pammtools)
library(dplyr) #data wrangling
library(ggplot2) #visualization
theme_set(theme_bw())
The data set (staph
), is contained within the pammtools
package and includes 374 observations from 137 children (from the Drakenstein child health study) with a maximum of 6 recurrences. The staph
data is in longitudinal format reflecting the recurrences for children over different rows. Here
t.start
and t.stop
indicate the entry and exit time into the risk set for the respective recurrencesevent
indicates whether the \(k\)-th recurrence was observed (1 = yes, 0 = censored for the \(k\)-th recurrence)enum
is the event number \(k\)
HIVexposure
indicates whether the mother of the child was HIV positive (1 = yes, 0 = no)## # A tibble: 6 × 6
## id t.start t.stop event enum hiv
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 16 1 1 0
## 2 1 16 324 1 2 0
## 3 1 324 365 0 3 0
## 4 2 0 59 1 1 1
## 5 2 59 157 1 2 1
## 6 2 157 227 1 3 1
The data for the first two children are
## # A tibble: 7 × 6
## id t.start t.stop event enum hiv
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 16 1 1 0
## 2 1 16 324 1 2 0
## 3 1 324 365 0 3 0
## 4 2 0 59 1 1 1
## 5 2 59 157 1 2 1
## 6 2 157 227 1 3 1
## 7 2 227 368 0 4 1
In order to apply PAMMs to recurrent events, we first transform the data to the piece-wise exponential data (PED) format (see here for details). We use the as_ped()
function in pammtools
. For the data transformation in the recurrent events context, we first need to decide whether analysis should be performed on the calendar- or gap-time scale. This is controlled by the timescale
argument (see below). For these data, we use the gap-time.
The individual inputs for the data transformation are given as follows:
formula
: specifies the Surv
object on the left hand side which contains information about the risk set entry and exit times as well as the event indicator; and the variables that should be retained in the data set after data transformation. Note that the variables id
and enum
will be retained in the data without specification.id
: specifies the variable in the data set the indicates individual subjectsdata
: the data to be transformedtransition
: the variable that indicates transitions from one state to another (here state transitions are transitions from event number \(k-1\) to \(k\))timescale
: the time scale of the output data (defaults to gap time)max_time
: The maximum time considered. All observations with \(t>max\_time\) will be set to max_time
and their event indicator set to \(0\). Here we restrict the follow-up to 366
days, as it marks one year under observation and few children were under observation beyond that time.cut
: This argument is unspecified here, but could be used to control the time points at which the follow-up is partitioned. If unspecified, all unique event times are used as cut points.ped <- as_ped(
formula = Surv(t.start,t.stop,event) ~ hiv,
id = "id",
data = staph,
transition = "enum",
timescale = "gap",
max_time = 366)
The resulting data for the first two infants is indicated below (we show the first and last observation of each infant for each event number they were at risk):
## # A tibble: 14 × 8
## # Groups: id, enum [7]
## id tstart tend interval offset ped_status hiv enum
## * <fct> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 1 (0,1] 0 0 0 1
## 2 1 15 16 (15,16] 0 1 0 1
## 3 1 0 1 (0,1] 0 0 0 2
## 4 1 295 308 (295,308] 2.56 1 0 2
## 5 1 0 1 (0,1] 0 0 0 3
## 6 1 40 41 (40,41] 0 0 0 3
## 7 2 0 1 (0,1] 0 0 1 1
## 8 2 57 59 (57,59] 0.693 1 1 1
## 9 2 0 1 (0,1] 0 0 1 2
## 10 2 97 98 (97,98] 0 1 1 2
## 11 2 0 1 (0,1] 0 0 1 3
## 12 2 69 70 (69,70] 0 1 1 3
## 13 2 0 1 (0,1] 0 0 1 4
## 14 2 140 141 (140,141] 0 0 1 4
We first model the baseline hazards over time. Biologically, the infection incidence in gap time may be different for the first event compared with the recurrences. However, estimation of the baseline hazard for each of the event numbers is not useful/feasible since only a few subjects experienced 3 or more events. Therefore, we will create a new variable to indicate whether the event a child is at risk for is the first event or a recurrence. Note, however, that we do this after PED data transformation and use the full information to create the PED data. We only use enum_strata
for stratification when estimating the baseline hazard.
Using this data, we fit the following model
\[ \lambda(t,k|z_i) = \exp(\beta_{0k} + f_{0,k}(t) + z_i), \]
where \(z_i\sim N(0, \sigma_z^2)\) are Gaussian, child specific random effects that account for child-specific frailty and \(\beta_{0k}\) and \(f_{0k}\) are the constant (intercept) and non-linear parts of the log-baseline hazard of the \(k\)-th event. Given the data transformation, this is equivalent to fitting a stratified PAM (see here for details), where stratification is done w.r.t. to enum_strata
.
Details on the model specification in R
are given below. Note that we use the pamm
function to fit the model, which is a wrapper to mgcv::gam
or mgcv::bam
, but direct modeling of the data using any suitable package/function would also be possible. The other arguments are passed to the respective fitting function.
In the formula argument,
s(tend, by = enum_strata)
indicates the non-linear smooth functions, estimated via penalized splines (with thin-plate spline basis functions). By using by=enum_strata
inside s()
and including enum_strata
as a fixed effect in the model, we are modeling stratified smooth functions (i.e., stratified baseline hazards) for first and recurrent events respectively, ands(id, bs = "re")
indicates a random effect (frailty) for each child.pamm0 <- pamm(
formula = ped_status ~ enum_strata + s(tend, by = enum_strata) + s(id, bs = "re"),
data = ped,
engine = "bam",
method = "fREML",
discrete = TRUE)
As usual for mgcv
objects, the resulting model summary output shown below is separated into two parts, one for the “parametric coefficients” and one for the “smooth terms”. In case of smooth terms, the estimated degrees of freedom (edf) gives us an idea of how “wiggly” the respective smooth functions are, and the p-values test whether overall these are different from a flat line (Wood 2013). For the random effects terms, we report their estimated variances (gam.vcomp
) and p-values (summary output). From the output, we see that the random effects are statistically insignificant \((p=0.249)\).
summary(pamm0)
##
## Family: poisson
## Link function: log
##
## Formula:
## ped_status ~ enum_strata + s(tend, by = enum_strata) + s(id,
## bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.9287 0.1213 -40.62 < 2e-16 ***
## enum_stratarecurrent -1.0991 0.1857 -5.92 3.23e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(tend):enum_stratafirst 1.995 2.487 35.69 < 2e-16 ***
## s(tend):enum_stratarecurrent 6.101 7.240 33.49 3.7e-05 ***
## s(id) 14.052 136.000 15.17 0.249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = -0.0265 Deviance explained = -0.97%
## fREML = 18453 Scale est. = 1 n = 18889
The standard deviation of the random effects is also relatively small and can be extracted using the gam.vcomp
command:
gam.vcomp(pamm0)
##
## Standard deviations and 0.95 confidence intervals:
##
## std.dev lower upper
## s(tend):enum_stratafirst 0.0005741588 8.308116e-05 0.003967907
## s(tend):enum_stratarecurrent 0.0062687807 2.563565e-03 0.015329284
## s(id) 0.2565797067 4.015591e-02 1.639438617
##
## Rank: 3/3
The results indicate that random effects are not required in this model. We therefore omit the random effects in the following.
##
## Family: poisson
## Link function: log
##
## Formula:
## ped_status ~ enum_strata + s(tend, by = enum_strata)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.0624 0.1231 -41.124 < 2e-16 ***
## enum_stratarecurrent -0.8860 0.1801 -4.919 8.71e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(tend):enum_stratafirst 2.099 2.616 41.71 < 2e-16 ***
## s(tend):enum_stratarecurrent 6.099 7.238 33.07 3.98e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = -0.0276 Deviance explained = -2.34%
## fREML = 18453 Scale est. = 1 n = 18889
To obtain and visualize any quantity based on the fitted model, it is easiest to create a suitable data set and calculate the quantity of interest. pammtools
provides the respective convenience functions.
Below we illustrate the calculation and visualization of the baseline hazard over time, where we use
make_newdata
to create a data set with one row for each unique event time used when fitting the model, stratified by event stratum. make_newdata
constructs a combination of the variables specified (here unique values of tend
for each value of enum_strata
, similar to expand.grid
, and sets all other variables to mean and modus values, if applicable). It is also aware of the specific PED structure.add_hazard
function to augment the created data set with the predicted hazard and respective confidence intervals.Once the data is created and augmented with the predicted hazard, we can use standard visualization function to create the graphics:
newdata <- ped %>%
make_newdata(
tend = unique(tend),
enum_strata = unique(enum_strata)) %>%
add_hazard(pamm0, type = "response")
## Warning in warn_about_new_time_points.pamm(object, newdata, time_var): Time
## points/intervals in new data not equivalent to time points/intervals during
## model fit. Setting intervals to values not used for original fitcan invalidate
## the PEM assumption and yield incorrect predictions.
ggplot(newdata, aes(x = tend/(365.25/12), y = hazard*365.25)) +
geom_line() +
geom_ribbon(aes(ymin = ci_lower*365.25, ymax = ci_upper*365.25), alpha = .3) +
ylab(expression(hat(h)(t))) + xlab("Time (months)") +
scale_x_continuous(limits = c(0, 12.5),breaks=seq(0,12,2),expand=c(0,0)) +
facet_wrap(~enum_strata)
We can see, that the hazard for the first event is high in the beginning and declines over time, as most children will have had the first infection after one year of life. Consequently, the hazard for recurrence is low in the beginning and increases over time before it flattens out after about two months.
Calculation of survival probabilities works equivalently, except that we use the add_surv_prob
function instead of add_hazard
. Note that calculating the survival probability involves calculating a cumulative hazard. Thus, when there is a grouping structure in the data (here the different strata), we have to group by this variable before we calculate cumulative hazards or survival probabilities. Note also, that we use geom_surv
in order to force the function to start at a survival probability of 1.
newdata <-newdata %>%
group_by(enum_strata) %>%
add_surv_prob(pamm0)
## Warning in warn_about_new_time_points.pamm(object, newdata, time_var): Time
## points/intervals in new data not equivalent to time points/intervals during
## model fit. Setting intervals to values not used for original fitcan invalidate
## the PEM assumption and yield incorrect predictions.
ggplot(newdata, aes(x = tend/(365.25/12), y = surv_prob)) +
geom_surv() +
geom_ribbon(aes(ymin = surv_lower, ymax = surv_upper), alpha = .3) +
ylab(expression(hat(S)(t))) + xlab("Time (months)") +
scale_x_continuous(limits = c(0, 12.1)) +
facet_wrap(~enum_strata)
The HIV exposure variable indicates whether children were HIV exposed and uninfected (HEU) by being born to HIV positive mothers or HIV uninfected (HU). We will fit this model in the PAMM framework to evaluate the effect of HIV exposure while assuming proportional hazards. This means that the effects of HIV shift the log-hazard by some constant over time. We fit two models. In the first model, we assume that the hazard ratio is the same for first and recurrent infections. In the second model, we allow different hazard ratios for first and recurrent infections, but we still assume both these hazard ratios are proportional over time.
pam1 <- pamm(
formula = ped_status ~ enum_strata + s(tend, by = enum_strata) + hiv,
data = ped,
engine = "bam",
method = "fREML",
discrete = TRUE)
summary(pam1)
##
## Family: poisson
## Link function: log
##
## Formula:
## ped_status ~ enum_strata + s(tend, by = enum_strata) + hiv
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.1351 0.1303 -39.416 < 2e-16 ***
## enum_stratarecurrent -0.8779 0.1801 -4.874 1.09e-06 ***
## hiv 0.2681 0.1434 1.870 0.0615 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(tend):enum_stratafirst 2.121 2.643 41.78 < 2e-16 ***
## s(tend):enum_stratarecurrent 6.084 7.224 33.22 3.71e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = -0.0279 Deviance explained = -2.18%
## fREML = 18453 Scale est. = 1 n = 18889
We can see that estimated effect of HIV exposure in terms of the hazard ratio is \(HR=\exp(0.2681)=1.31\;(p=0.062)\).
pam2 <- pamm(
formula = ped_status ~ enum_strata + s(tend, by = enum_strata) + hiv:enum_strata,
data = ped,
engine = "bam",
method = "fREML",
discrete = TRUE)
summary(pam2)
##
## Family: poisson
## Link function: log
##
## Formula:
## ped_status ~ enum_strata + s(tend, by = enum_strata) + hiv:enum_strata
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.05753 0.13330 -37.940 < 2e-16 ***
## enum_stratarecurrent -1.03583 0.19753 -5.244 1.57e-07 ***
## enum_stratafirst:hiv -0.02007 0.21300 -0.094 0.92493
## enum_stratarecurrent:hiv 0.54272 0.19553 2.776 0.00551 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(tend):enum_stratafirst 2.098 2.615 41.71 < 2e-16 ***
## s(tend):enum_stratarecurrent 6.067 7.208 33.42 3.24e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = -0.0274 Deviance explained = -2.01%
## fREML = 18451 Scale est. = 1 n = 18889
We can see that estimated effect of HIV exposure in terms of the hazard ratio is \(HR=\exp(-0.0201)=0.98\;(p=0.925)\) for the first infection and \(HR=\exp(0.5427)=1.72\;(p=0.006)\) for recurrent infections.
Wood, Simon N. 2013. “On P-Values for Smooth Components of an Extended Generalized Additive Model.” Biometrika 100 (1): 221–28. https://doi.org/10.1093/biomet/ass048.