This vignette provides a short reference for the estimation and interpretation of cumulative effects using pammtools
. For a more detailed overview see Bender and Scheipl (2018). In the first section, a short introduction to cumulative effects is provided. In the second section, we present a worked example on a real data set, including necessary data transformation, model estimation and visualization.
Cumulative effects can be thought of as an extension to the modeling of time-constant effects of time-dependent covariates (TDCs; see vignette on time-dependent covariates). Standard modeling of TDCs assumes that only the current (or one lagged) value of the covariate can affect the hazard at time \(t\), i.e., \[\lambda(t|z(t)) = \lambda_0(t)\exp(\beta z(t)).\]
Cumulative effects on the other hand allow the hazard at time \(t\) to depend on multiple past observations of the covariate, such that the effect at time \(t\) is the sum of all weighted effects of past observations (partial effects). to make this more concrete, consider the following quantities:
Given these definitions, one possible specification of a model with a cumulative effect was suggested by Sylvestre and Abrahamowicz (2009) and is given below:
\[ \lambda(t|\mathbf{z}) = \lambda_0(t)\exp\left(\sum_{t_z \leq t} h(t-t_z)z(t_z)\right) \]
Here, past observations are weighted by a (potentially) non-linear function \(h\) of the latency (\(t-t_z\)), i.e., the time that has passed since covariate \(z(t_z)\) was observed, relative to the time of interest \(t\). The cumulative effect is then the sum of all partial effects \(h(t-t_z)z(t_z)\).
The above model also makes some simplifying assumptions, e.g.:
A more flexible cumulative effect can be defined as follows:
\[ g(\mathbf{z}, t) = \int_{\mathcal{T}(t)}h(t, t_z, z(t_z))\mathrm{d}t_z, \]
with
A visual representation of exemplary lag-lead windows is depicted below:
p_ll1 <- gg_laglead(0:10, 0:9, ll_fun = function(t, tz) t >= tz)
my_ll_fun <- function(t, tz, tlag = 2, tlead = 5) {
t >= tz + tlag & t < tz + tlag + tlead
}
p_ll2 <- gg_laglead(0:10, 0:9, ll_fun = function(t, tz) t == tz)
The left panel represents the minimal requirement \(\mathcal{T}(t)=\{t_{z}: t_z \leq t\}\). For example, the observation of the TDC at time \(t_z=4\), starts to contribute to the cumulative effect in interval \((4,5]\) and continues to contribute until the last interval \((9,10]\), although with potentially different weights, as, in general, \(h(t=5, t_z = 4, z(4))\) can be different to \(h(t=10, t_z=4, z(4))\).
The right panel illustrates that a “standard” model for time-to-event data with a constant effect follows as a special case of the general definition of the cumulative effect with lag-lead window \(\mathcal{T}(t)=\{t_{z}: t_z = t\}\) and \(h(t, t_z, z(t_z)) \equiv \beta z(t)\)
For an illustration of the functionality that pammtools
provides to work with cumulative effects, we use (a sample) of the data analyzed in Heyard et al. (2018) and provided in the package TBFmultinomial
(note that their analyses had a totally different goal, including modeling of competing risks; here we simply use the data for illustrative purposes).
In the following analysis, we will model the (cause-specific) hazard for the event extubation
, considering all competing events as censoring events. Covariates include gender
, type
(admission type), SAPSadmission
(SAPS score at admission) and the time-dependent covariate SOFA
score (lower scores indicate better health).
To use the data for our purposes, we first perform some preprocessing, which produces two data sets:
event_df
: Data in “standard” format (one row per subject), that only includes time-constant covariatestdc_df
: Data containing information on time-dependent covariates (here SOFA
) and exposure time \(t_z\)
data(VAP_data, package = "TBFmultinomial")
# create unique IDs
VAP_data <- VAP_data %>%
mutate(tmp_id = paste(ID, day, sep = ".")) %>%
group_by(ID) %>%
mutate(csdup = cumsum(duplicated(tmp_id))) %>%
ungroup() %>%
mutate(ID = ifelse(csdup > 0, ID + 1000, ID))
# assume constant SOFA score between updates
VAP_complete <- VAP_data %>%
group_by(ID) %>%
mutate(time = max(day)) %>%
ungroup() %>%
complete(ID, day = full_seq(day, 1)) %>%
fill(gender, type, SAPSadmission, SOFA, outcome, time, .direction = "down") %>%
filter(day <= time)
event_df <- VAP_complete %>%
select(ID, gender, type, SAPSadmission, outcome, time) %>%
group_by(ID) %>%
slice(n()) %>%
mutate(outcome = 1 * (outcome == "extubated")) %>%
ungroup()
tdc_df <- VAP_complete %>%
select(ID, day, SOFA) %>%
mutate(day = day - 1) %>% # assume that SOFA available at the beginning of the day
filter(day <= 49)
An overview of the preprocessed data is given below:
## # A tibble: 6 x 6
## ID gender type SAPSadmission outcome time
## <dbl> <fct> <fct> <int> <dbl> <int>
## 1 2 0 Medical 50 0 5
## 2 3 1 Medical 34 1 9
## 3 4 0 Medical 45 0 7
## 4 5 1 Medical 21 0 15
## 5 6 1 Medical 50 0 6
## 6 8 0 Medical 54 1 3
## # A tibble: 6 x 3
## ID day SOFA
## <dbl> <dbl> <int>
## 1 2 0 9
## 2 2 1 8
## 3 2 2 9
## 4 2 3 9
## 5 2 4 8
## 6 3 0 10
For illustration, we fit a DLNM (Gasparrini et al. 2017) with cumulative effect of the SOFA score (\(z\)) as defined below:
\[ g(\mathbf{z}, t) = \int_{\mathcal{T}(t)}h(t-t_z, z(t_z))\mathrm{d}t_z, \]
with \(\mathcal{T}(t) = \{t_z: t_z \leq t \leq t_z + 5\}\), which means that only the SOFA scores observed within the last five days can affect the hazard at time \(t\).
Remember that the specification of the partial effect \(h(t,t_z, z(t_z))\) as well as the definition of the lag-lead window are important parts of the analysis and need to be considered at the beginning of the analysis as this information goes into the code for the data transformation:
ped <- as_ped(
list(event_df, tdc_df),
Surv(time, outcome) ~ . + cumulative(latency(day), SOFA, tz_var = "day",
ll_fun = function(t, tz) t >= tz & t <= tz + 5),
cut = 0:49, # administrative censoring at t = 49
id = "ID")
ped$day_latency <- ped$day_latency * ped$LL
Above, we
cumulative
to inform as_ped
to perform data preprocessing in order to fit cumulative effectsday
(\(t_z\)) within the latency
function to calculate \(t - t_z\)
ll_fun
argument to match the desired specification of \(\mathcal{T}(t)\)
Note that the data now contains 3 matrix columns (day_latency
, SOFA
and LL
; the latter stores the lag-lead matrix \(\mathcal{T}(t)\)):
str(ped, 1)
## Classes 'fped', 'ped' and 'data.frame': 1514 obs. of 12 variables:
## $ ID : num 2 2 2 2 2 3 3 3 3 3 ...
## $ tstart : num 0 1 2 3 4 0 1 2 3 4 ...
## $ tend : int 1 2 3 4 5 1 2 3 4 5 ...
## $ SAPSadmission: int 50 50 50 50 50 34 34 34 34 34 ...
## $ day_latency : num [1:1514, 1:50] 0 1 2 3 4 0 1 2 3 4 ...
## $ SOFA : num [1:1514, 1:50] 9 9 9 9 9 10 10 10 10 10 ...
## ..- attr(*, "dimnames")=List of 2
...
After successful data transformation with as_ped
, the model can be estimated directly using mgcv::gam
. Including matrix columns into the model specification will inform gam
to estimate cumulative effects. In our case the the following call estimates the DLNM with a cumulative effect (of the SOFA score) as specified above:
mod <- gam(ped_status ~ s(tend) + type + gender + SAPSadmission +
te(day_latency, SOFA, by = LL),
method = "REML", offset = offset, family = poisson(), data = ped)
summary(mod)
##
## Family: poisson
## Link function: log
##
## Formula:
## ped_status ~ s(tend) + type + gender + SAPSadmission + te(day_latency,
## SOFA, by = LL)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.423e+01 4.304e+00 -3.307 0.000943 ***
## typeSurgical -6.229e-01 2.751e-01 -2.264 0.023551 *
## gender1 -2.211e-01 2.354e-01 -0.939 0.347707
## SAPSadmission -3.738e-04 7.893e-03 -0.047 0.962226
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(tend) 1.000 1.000 2.269 0.132
## te(day_latency,SOFA):LL 5.395 6.046 65.433 3.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.121 Deviance explained = 22.5%
## -REML = 258.2 Scale est. = 1 n = 1514
Interpretation of the estimated cumulative effects is best performed by visualization of either the partial effects or the cumulative effect. pammtools
provides a couple of convenience functions that facilitate this process:
gg_tensor
: This visualizes 2D effect surfaces as heat maps/contour plots and is based on the output of mgcv::plot.gam
gg_partial
and gg_partial_ll
: The former visualizes the estimated partial effect, e.g., \(\hat h(t-t_z, z(t_z))\) for each combination of the (specified) input, the latter visualizes the partial effects as a heat-map within the specified lag-lead window (this makes it easier to see which partial effects actually contribute to the cumulative effect in a specific interval)gg_slice
: Plots 1D slices of 2D surfaces and can be used more generally to plot covariate effects conditional on pre-specified values of other covariates.gg_cumu_eff
: Either plots the estimated cumulative effect \(\hat g(\mathbf{z}, t)\) for a given \(\mathbf{z}\) or the (log-) hazard ratio \(\log\left(\frac{\hat g(\mathbf{z}_1, t)}{\hat g(\mathbf{z}_2, t)}\right)\)
Note that for each of these plot functions there are respective functions to retrieve the data used for plotting such that the plots can be generated and customized manually (see especially examples in make_newdata
).
The code below visualizes the estimated 2D partial effect \(\hat h(t-t_z, z(t_z))\) for each combination of latency and SOFA score (including confidence intervals). The gray areas indicate combinations that did not occur in the data.
gg_tensor(mod, ci = TRUE) + ylab("SOFA") + xlab("latency (day)") +
theme(legend.position = "bottom")
This illustrates that a low SOFA score substantially increases the log-hazard if it was observed recently (latency \(<\) 2 days), while partial effects of the SOFA score observed further in the past (latency \(>\) 3 days) go towards zero (the event of interest is “extubation”, therefore increased (log-)hazards imply increased probabilities of extubation). Note that for this graphic, the function \(\hat h(t-t_z, z(t_z))\) is evaluated at values of \(t-t_z\) (day_latency
) that are not present in the data.
Alternatively, the function gg_partial
can be used to produce similar visualizations and allows more control over the inputs. For example, the following code produces the partial effect surface plot evaluated only at latencies 0 through 5 and calculates the partial effects relative to a patient with SOFA score 10 (everything else being equal):
gg_partial(ped, mod, term = "SOFA", day_latency = 0:5,
SOFA = seq_range(SOFA, n = 20), LL = c(1), reference = list(SOFA = 10))
See also gg_partial_ll
that shows the lag-lead window and partial effects that contribute to the cumulative effect in each interval. Due to the definition of the partial effect, however, these are constant in this case, as no time-variation was specified.
p_slice_latency <- gg_slice(ped, mod, term = "SOFA",
day_latency = unique(day_latency), SOFA = c(0, 5, 10))
p_slice_sofa <- gg_slice(ped, mod, term = "SOFA",
day_latency = c(0, 1, 4), SOFA = unique(SOFA))
gridExtra::grid.arrange(p_slice_latency, p_slice_sofa, nrow = 1)
These plots contain essentially the same information as the 2D surface, but focus on isolating the effects of the individual variables. This can be especially useful for three-dimensional partial effects which are hard to visualize otherwise.
Since it can be difficult to assess how the estimated partial effects actually affect estimated hazard rates, pammtools
provides additional functions to visualize and compare estimated cumulative effects on the level of the (log-)hazard rates for given TDCs \(\mathbf{z}\).
Below, we visualize the cumulative effect in each interval of the follow-up. Since the effect depends on a TDC \(\mathbf{z}\), its values must be provided in the call to gg_cumu_eff
as z1
argument (either length 1 for time-constant values or length equal to maximum number of times \(z(t_z)\) was observed). If z2
is additionally specified, the log-hazard ratio is calculated as \(\log\left(\frac{\hat g(\mathbf{z}_1, t)}{\hat g(\mathbf{z}_2, t)}\right)\). Below, two such log-hazard ratios are visualized for two different comparisons of SOFA score profiles:
p_cumu1 <- gg_cumu_eff(ped, mod, term = "SOFA",
z1 = seq(20, 0, length.out = 50), z2 = 10) +
geom_point(size = rel(.5)) + geom_vline(xintercept = c(5, 25), lty = 2)
p_cumu2 <- gg_cumu_eff(ped, mod, term = "SOFA",
z1 = seq(0, 20, length.out = 50),
z2 = c(rep(0, 10), rep(20, 20), rep(5, 20))) +
geom_point(size = rel(.5)) + geom_vline(xintercept = c(10, 30), lty = 2)
gridExtra::grid.arrange(p_cumu1, p_cumu2, nrow = 1)
Bender, Andreas, and Fabian Scheipl. 2018. “Pammtools: Piece-Wise Exponential Additive Mixed Modeling Tools.” arXiv:1806.01042 [Stat]. http://arxiv.org/abs/1806.01042.
Gasparrini, Antonio, Fabian Scheipl, Ben Armstrong, and Michael G. Kenward. 2017. “A Penalized Framework for Distributed Lag Non-Linear Models.” Biometrics 73: 938–48. https://doi.org/10.1111/biom.12645.
Heyard, Rachel, Jean-François Timsit, Wafa Ibn Essaied, and Leonhard Held. 2018. “Dynamic Clinical Prediction Models for Discrete Time-to-Event Data with Competing Risks—A Case Study on the OUTCOMEREA Database.” Biometrical Journal 0 (0). https://doi.org/10.1002/bimj.201700259.
Sylvestre, Marie-Pierre, and Michal Abrahamowicz. 2009. “Flexible Modeling of the Cumulative Effects of Time-Dependent Exposures on the Hazard.” Statistics in Medicine 28 (27): 3437–53. https://doi.org/10.1002/sim.3701.