library(dplyr)
library(tidyr)
library(pammtools)

In this vignette we provide details on transforming data into a format suitable to fit piece-wise exponential (additive) models (PAM). Three main cases need to be distinguished

  1. Data without time-dependent covariates

  2. Data with time-dependent covariates

  3. Data with time-dependent covariates that should be modeled as cumulative effects

Data without time-dependent covariates

In this simple case the data transformation is relatively straight forward and handled by the as_ped (as piece-wise exponential data) function. This function internally calls survival::survSplit, thus all arguments available for the survSplit function are also available to the as_ped function.

Using defaults

As an example data set we first consider Veterans’ Administration lung cancer study (Kalbfleisch and Prentice 1980) available from the survival package:

data("veteran", package = "survival")
veteran <- veteran %>%
  mutate(
    id    = row_number(),
    trt   = 1L * (trt == 2),
    prior = 1L * (prior != 0)) %>%
  select(id, everything())
head(veteran)
##   id trt celltype time status karno diagtime age prior
## 1  1   0 squamous   72      1    60        7  69     0
## 2  2   0 squamous  411      1    70        5  64     1
## 3  3   0 squamous  228      1    60        3  38     0
## 4  4   0 squamous  126      1    60        9  63     1
## 5  5   0 squamous  118      1    70       11  65     1
## 6  6   0 squamous   10      1    20        5  49     0

Each row contains information on the survival time (time), an event indicator (status) and the treatment information (6-MP vs. placebo) as the only covariate.

To transform the data into piece-wise exponential data (ped) format, we need to

  • define \(J+1\) interval break points \(t_{\min} = \kappa_0 < \kappa_1 < \cdots < \kappa_J = t_{\max}\)

  • create pseudo-observations for each interval \(j = 1,\ldots, J\) in which subject \(i, i = 1,\ldots,n\) was under risk.

Using the pammtools package this is easily achieved with the as_ped function as follows:

ped <- as_ped(Surv(time, status) ~ ., data = veteran, id = "id")
ped %>% filter(id %in% c(42, 85, 112)) %>% select(-diagtime, -prior)
##     id tstart tend interval    offset ped_status trt  celltype karno age
## 1   42      0    1    (0,1] 0.0000000          0   0 smallcell    50  72
## 2   42      1    2    (1,2] 0.0000000          0   0 smallcell    50  72
## 3   42      2    3    (2,3] 0.0000000          0   0 smallcell    50  72
## 4   42      3    4    (3,4] 0.0000000          0   0 smallcell    50  72
## 5   42      4    7    (4,7] 1.0986123          1   0 smallcell    50  72
## 6   85      0    1    (0,1] 0.0000000          1   1  squamous    50  35
## 7  112      0    1    (0,1] 0.0000000          0   1     adeno    60  62
## 8  112      1    2    (1,2] 0.0000000          0   1     adeno    60  62
## 9  112      2    3    (2,3] 0.0000000          0   1     adeno    60  62
## 10 112      3    4    (3,4] 0.0000000          0   1     adeno    60  62
## 11 112      4    7    (4,7] 1.0986123          0   1     adeno    60  62
## 12 112      7    8    (7,8] 0.0000000          0   1     adeno    60  62
## 13 112      8   10   (8,10] 0.6931472          0   1     adeno    60  62
## 14 112     10   11  (10,11] 0.0000000          0   1     adeno    60  62
## 15 112     11   12  (11,12] 0.0000000          0   1     adeno    60  62
## 16 112     12   13  (12,13] 0.0000000          0   1     adeno    60  62
## 17 112     13   15  (13,15] 0.6931472          0   1     adeno    60  62
## 18 112     15   16  (15,16] 0.0000000          0   1     adeno    60  62
## 19 112     16   18  (16,18] 0.6931472          0   1     adeno    60  62
## 20 112     18   19  (18,19] 0.0000000          0   1     adeno    60  62
## 21 112     19   20  (19,20] 0.0000000          0   1     adeno    60  62
## 22 112     20   21  (20,21] 0.0000000          0   1     adeno    60  62
## 23 112     21   22  (21,22] 0.0000000          0   1     adeno    60  62
## 24 112     22   24  (22,24] 0.6931472          0   1     adeno    60  62
## 25 112     24   25  (24,25] 0.0000000          0   1     adeno    60  62
## 26 112     25   27  (25,27] 0.6931472          0   1     adeno    60  62
## 27 112     27   29  (27,29] 0.6931472          0   1     adeno    60  62
## 28 112     29   30  (29,30] 0.0000000          0   1     adeno    60  62
## 29 112     30   31  (30,31] 0.0000000          0   1     adeno    60  62
## 30 112     31   33  (31,33] 0.6931472          0   1     adeno    60  62
## 31 112     33   35  (33,35] 0.6931472          0   1     adeno    60  62
## 32 112     35   36  (35,36] 0.0000000          0   1     adeno    60  62
## 33 112     36   42  (36,42] 1.7917595          0   1     adeno    60  62
## 34 112     42   43  (42,43] 0.0000000          0   1     adeno    60  62
## 35 112     43   44  (43,44] 0.0000000          0   1     adeno    60  62
## 36 112     44   45  (44,45] 0.0000000          0   1     adeno    60  62
## 37 112     45   48  (45,48] 1.0986123          0   1     adeno    60  62
## 38 112     48   49  (48,49] 0.0000000          0   1     adeno    60  62
## 39 112     49   51  (49,51] 0.6931472          1   1     adeno    60  62

When no cut points are specified, the default is to use the unique event times. As can be seen from the above output, the function creates an id variable, indicating the subjects \(i = 1,\ldots,n\), with one row per id for each interval the subject “visited”. Thus,

  • subject 42 was under risk during 5 intervals,
  • subject 85 was under risk during one interval,
  • and subject 112 in 33 intervals.

In addition to the optional id variable the function also creates

  • tstart: the beginning of each interval
  • tend: the end of each interval
  • interval: a factor variable denoting the interval
  • offset: the log of the duration during which the subject was under risk in that interval

Additionally the time-constant covariates (here: trt, celltype, karno, etc.) are repeated \(n_i\) number of times, where \(n_i\) is the number of intervals during which subject \(i\) was in the risk set.

Custom cut points

Per default, the as_ped function uses all unique event times as cut points (which is a sensible default as for example the (cumulative) baseline hazard estimates in Cox-PH functions are also updated at event times).

In some cases, however, one might want to reduce the number of cut points, to reduce the size of the resulting data and/or faster estimation times. To do so the cut argument has to be specified. Following the above example, we can use only 4 cut points \(\kappa_0 = 0, \kappa_1 = 50, \kappa_2 = 200, \kappa_3 = 400\), resulting in \(J = 3\) intervals:

ped2 <- as_ped(Surv(time, status) ~ ., data = veteran, cut = c(0, 50, 200, 400),
  id = "id")
ped2 %>% filter(id %in% c(42, 85, 112)) %>% select(-diagtime, -prior)
##    id tstart tend interval   offset ped_status trt  celltype karno age
## 1  42      0   50   (0,50] 1.945910          1   0 smallcell    50  72
## 2  85      0   50   (0,50] 0.000000          1   1  squamous    50  35
## 3 112      0   50   (0,50] 3.912023          0   1     adeno    60  62
## 4 112     50  200 (50,200] 0.000000          1   1     adeno    60  62

Note that now subjects 42 and 85 have only one row in the data set. The fact that subject 42 was in the risk set longer than subject 85 is, however, still accounted for by the offset variable. Note that the offsets for subject 85 and subject 112 in interval (50,200] are only zero because subject 85 had an observed event time of 1, thus log(1-0)=0 and subject 112 had an event at time 51, thus log(51-50)=0:

veteran %>% slice(c(85, 112))
##    id trt celltype time status karno diagtime age prior
## 1  85   1 squamous    1      1    50        7  35     0
## 2 112   1    adeno   51      1    60        5  62     0

Data with time-dependent covariates

In case of data with time-dependent covariates (that should not be modeled as cumulative effects), the follow-up is usually split at desired cut-points and additionally at the time-points at which the TDC changes its value.

We assume that the data is provided in two data sets, one that contains time-to-event data and time-constant covariates and one data set that contains information of time-dependent covariates.

For illustration, we use the pbc data from the survival package. Before the actual data transformation we perform the data preprocessing in vignette("timedep", package="survival"):

data("pbc", package = "survival") # loads pbc and pbcseq
pbc <- pbc %>% filter(id <= 312) %>%
  mutate(status = 1L * (status == 2)) %>%
  select(id:sex, bili, protime)

We want to transform the data such that we could fit a concurrent effect of the time-dependent covariates bili and protime, therefore, the RHS of the formula passed to as_ped needs to contain an additional component, separated by | and all variables for which should be transformed to the concurrent format specified within the concurrent special. The data sets are provided within a list. In concurrent, we also have to specify the name of the variable which stores information on the time points at which the TDCs are updated.

ped_pbc <- as_ped(
  data    = list(pbc, pbcseq),
  formula = Surv(time, status) ~ . + concurrent(bili, protime, tz_var = "day"),
  id      = "id")

Multiple data sets can be provided within the list and multiple concurrent terms with different tz_var arguments can be specified with different lags. This will increase the data set, as an (additional) split point will be generated for each (lagged) time-point at which the TDC was observed:

pbc_ped <- as_ped(data = list(pbc, pbcseq),
  formula = Surv(time, status) ~ . + concurrent(bili, tz_var = "day") +
    concurrent(protime, tz_var = "day", lag = 10),
    id = "id")

If different tz_var arguments are provided, the union of both will be used as cut points in addition to usual cut points.

Warning: For time-points on the follow up that have no observation of the TDC, the last observation will be carried forward until the next update is available.

Data with time-dependent covariates (with cumulative effects)

Here we demonstrate data transformation of data with TDCs (\(\mathbf{z}=\{z(t_{z,q}),q=1,\ldots,Q\}\), that subsequently will be modeled as cumulative effects defined as

\[ g(\mathbf{z}, t) = \int_{\mathcal{T}(t)} h(t, t_z, z(t_z))\mathrm{d}t_z \] Here

  • the three-variate function \(h(t,t_z,z(t_z))\) defines the so-called partial effects of the TDC \(z(t_z)\) observed at exposure time \(t_z\) on the hazard at time \(t\) (Bender et al. 2018). The general partial effect definition given above is only one possibility, other common choices are
     
    • \(h(t-t_z)z(t_z)\) (This is the WCE model by Sylvestre and Abrahamowicz (2009))
       
    • \(h(t-t_z, z(t_z))\) (This is the DLNM model by Gasparrini et al. (2017))
       
  • the cumulative effect \(g(\mathbf{z}, t)\) at follow-up time \(t\) is the integral (or sum) of the partial effects over exposure times \(t_z\) contained within \(\mathcal{T}(t)\)
     

  • the so called lag-lead window (or window of effectiveness) is denoted by \(\mathcal{T}(t)\). This represents the set of exposure times at which exposures can affect the hazard rate at time \(t\). The most common definition is \(\mathcal{T}(t) = \{t_{z,q}: t \geq t_{z,q}, q=1,\ldots, Q\}\), which means that all exposures that were observed prior to \(t\) or at \(t\) are eligible.

As before we use the function as_ped to transform the data and additionally use the formula special cumulative (for functional covariate) to specify the partial effect structure on the right-hand side of the formula.

Let \(t\) (time) the follow up time, \(\mathbf{z}_1\) (z1) and \(\mathbf{z}_2\) (z2) two TDCs observed at different exposure time grids \(t_z\) (tz1) and \(s_z\) (tz2) with lag-lead windows \(\mathcal{T}_1(t) = \{t_{z,q}: t \geq t_{z,q}\}\) and \(\mathcal{T}_2(t) = \{s_{z,k}: t \geq s_{z,k} + 2\}\) (defined by ll2 <- function(t, tz) { t >= tz + 2}).

The table below gives a selection of possible partial effects and the usage of cumulative to create matrices needed to estimate different types of cumulative effects of \(\mathbf{z}_1\) and \(\mathbf{z}_2\):

partial effect as_ped specification
\(\int_{\mathcal{T}_1}h(t-t_z, z_1(t_z))\) cumulative(latency(tz1), z1, tz_var= "tz1")
\(\int_{\mathcal{T}_1}h(t, t-t_z, z_1(t_z))\) cumulative(time, latency(tz1), z1, tz_var="tz1")
\(\int_{\mathcal{T}_1}h(t, t_z, z_1(t_z))\) cumulative(time, tz1, z1, tz_var="tz1")
\(\int_{\mathcal{T}_1}h(t, t_z, z_1(t_z)) + \int_{\mathcal{T}_2}h(t-s_z, z_2(s_z))\) cumulative(time, tz1, z1, tz_var="tz1") + cumulative(latency(tz2), z2, tz_var="tz2", ll_fun=ll2)

Note that

  • the variable representing follow-up time \(t\) in cumulative (here time) must match the time variable specified on the left-hand side of the formula

  • the variable representing exposure time \(t_z\) (here tz1 and tz2) must be wrapped within latency() to tell as_ped to calculate \(t-t_z\)

  • by default, \(\mathcal{T}(t)\) is defined as function(t, tz) {t >= tz}, thus for \(\mathcal{T}_1\) it is not necessary to explicitly specify the lag-lead window. In case you want to use a custom lag-lead window, provide the respective function to the ll_fun argument in cumulative (see ll2 in the examples above)

  • cumulative makes no difference between partial effects such as \(h(t-t_z, z(t_z))\) and \(h(t-t_z)z(t_z)\) as the necessary data transformation is the same in both cases. Later, however, when fitting the model, a distinction must be made

  • more than one \(z\) variable can be provided to cumulative, which can be convenient if multiple covariates should have the same time components and the same lag-lead window

  • multiple cumulative terms can be specified, having different exposure times t_z, s_z and/or different lag-lead windows for different covariates \(\mathbf{z}_1\), \(\mathbf{z}_2\)

  • To tell cumulative which of the variables specified is the exposure time \(t_z\), the tz_var argument must be specified within each cumulative term. The follow-up time component \(t\) will be automatically recognized via the left-hand side of the formula.

One time-dependent covariate

For illustration we use the ICU patients data sets patient and daily provided in the pammtools package, with follow-up time survhosp (\(t\)), exposure time Study_Day (\(t_z\)) and TDCs caloriesPercentage \(\mathbf{z}_1\) and proteinGproKG (\(\mathbf{z}_2\)):

head(patient)
##   Year CombinedicuID CombinedID Survdays PatientDied survhosp Gender Age
## 1 2007          1114       1110     30.1           0     30.1   Male  68
## 2 2007          1114       1111     30.1           0     30.1 Female  57
## 3 2007          1114       1116      9.8           1      9.8 Female  68
## 4 2007           598       1316     30.1           0     30.1   Male  47
## 5 2007           365       1410     30.1           0     30.1   Male  69
## 6 2007           365       1414     30.1           0      5.4   Male  47
##            AdmCatID ApacheIIScore      BMI           DiagID2
## 1 Surgical Elective            20 31.60321   Cardio-Vascular
## 2           Medical            22 24.34176       Respiratory
## 3 Surgical Elective            25 17.99308   Cardio-Vascular
## 4           Medical            16 33.74653 Orthopedic/Trauma
## 5 Surgical Elective            20 38.75433       Respiratory
## 6           Medical            21 45.00622       Respiratory
head(daily)
## # A tibble: 6 x 4
##   CombinedID Study_Day caloriesPercentage proteinGproKG
##        <int>     <int>              <dbl>         <dbl>
## 1       1110         1               0            0    
## 2       1110         2               0            0    
## 3       1110         3               4.05         0    
## 4       1110         4              35.1          0.259
## 5       1110         5              77.2          0.647
## 6       1110         6              17.3          0

Below we illustrate the usage of as_ped and cumulative to obtain covariate matrices for the latency matrix of follow up time and exposure time and a matrix of the TDC (caloriesPercentage). The follow up will be split in 30 intervals with interval breaks points 0:30. By default, the exposure time provided to tz_var will be used as a suffix for the names of the new matrix columns to indicate the exposure they refer to, however, you can override the suffix by specifying the suffix argument to cumulative.

ped <- as_ped(
  data    = list(patient, daily),
  formula = Surv(survhosp, PatientDied) ~ . +
    cumulative(latency(Study_Day), caloriesPercentage, tz_var = "Study_Day"),
  cut     = 0:30,
  id = "CombinedID")
str(ped, 1)
## Classes 'fped', 'ped' and 'data.frame':  37258 obs. of  18 variables:
##  $ CombinedID        : int  1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 ...
##  $ tstart            : num  0 1 2 3 4 5 6 7 8 9 ...
##  $ tend              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ interval          : Factor w/ 30 levels "(0,1]","(1,2]",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ offset            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ped_status        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Year              : Factor w/ 4 levels "2007","2008",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ CombinedicuID     : Factor w/ 456 levels "21","24","25",..: 355 355 355 355 355 355 355 355 355 355 ...
##  $ Survdays          : num  30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 ...
##  $ Gender            : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Age               : int  68 68 68 68 68 68 68 68 68 68 ...
##  $ AdmCatID          : Factor w/ 3 levels "Medical","Surgical Elective",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ ApacheIIScore     : int  20 20 20 20 20 20 20 20 20 20 ...
##  $ BMI               : num  31.6 31.6 31.6 31.6 31.6 ...
##  $ DiagID2           : Factor w/ 9 levels "Gastrointestinal",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Study_Day_latency : num [1:37258, 1:12] 0 0 1 2 3 4 5 6 7 8 ...
##  $ caloriesPercentage: num [1:37258, 1:12] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##  $ LL                : num [1:37258, 1:12] 0 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "breaks")= int [1:31] 0 1 2 3 4 5 6 7 8 9 ...
##  - attr(*, "id_var")= chr "CombinedID"
##  - attr(*, "intvars")= chr [1:6] "CombinedID" "tstart" "tend" "interval" ...
##  - attr(*, "trafo_args")=List of 3
##  - attr(*, "time_var")= chr "survhosp"
##  - attr(*, "status_var")= chr "PatientDied"
##  - attr(*, "func")=List of 1
##  - attr(*, "ll_funs")=List of 1
##  - attr(*, "tz")=List of 1
##  - attr(*, "tz_vars")=List of 1
##  - attr(*, "ll_weights")=List of 1
##  - attr(*, "func_mat_names")=List of 1

As you can see, the data is transformed to the PED format as before and two additional matrix columns are added, Study_Day_latency and caloriesPercentage.
 

Two time-dependent covariates observed on the same exposure time grid

Using the same data, we show a slightly more complex example with two functional covariate terms with partial effects \(h(t, t-t_z, z_1(t_z))\) and \(h(t-t_z, z_2(t_z))\). As in the case of ICU patient data, both TDCs were observed on the same exposure time grid, we want to use the same lag-lead window and the latency term \(t-t_z\) occurs in both partial effects, we only need to specify one cumulative term.

ped <- as_ped(
  data = list(patient, daily),
  formula = Surv(survhosp, PatientDied) ~ . +
    cumulative(survhosp, latency(Study_Day), caloriesPercentage, proteinGproKG,
      tz_var = "Study_Day"),
  cut = 0:30,
  id = "CombinedID")
str(ped, 1)
## Classes 'fped', 'ped' and 'data.frame':  37258 obs. of  20 variables:
##  $ CombinedID        : int  1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 ...
##  $ tstart            : num  0 1 2 3 4 5 6 7 8 9 ...
##  $ tend              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ interval          : Factor w/ 30 levels "(0,1]","(1,2]",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ offset            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ped_status        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Year              : Factor w/ 4 levels "2007","2008",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ CombinedicuID     : Factor w/ 456 levels "21","24","25",..: 355 355 355 355 355 355 355 355 355 355 ...
##  $ Survdays          : num  30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 ...
##  $ Gender            : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Age               : int  68 68 68 68 68 68 68 68 68 68 ...
##  $ AdmCatID          : Factor w/ 3 levels "Medical","Surgical Elective",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ ApacheIIScore     : int  20 20 20 20 20 20 20 20 20 20 ...
##  $ BMI               : num  31.6 31.6 31.6 31.6 31.6 ...
##  $ DiagID2           : Factor w/ 9 levels "Gastrointestinal",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ survhosp_mat      : int [1:37258, 1:12] 0 1 2 3 4 5 6 7 8 9 ...
##  $ Study_Day_latency : num [1:37258, 1:12] 0 0 1 2 3 4 5 6 7 8 ...
##  $ caloriesPercentage: num [1:37258, 1:12] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##  $ proteinGproKG     : num [1:37258, 1:12] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##  $ LL                : num [1:37258, 1:12] 0 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "breaks")= int [1:31] 0 1 2 3 4 5 6 7 8 9 ...
##  - attr(*, "id_var")= chr "CombinedID"
##  - attr(*, "intvars")= chr [1:6] "CombinedID" "tstart" "tend" "interval" ...
##  - attr(*, "trafo_args")=List of 3
##  - attr(*, "time_var")= chr "survhosp"
##  - attr(*, "status_var")= chr "PatientDied"
##  - attr(*, "func")=List of 1
##  - attr(*, "ll_funs")=List of 1
##  - attr(*, "tz")=List of 1
##  - attr(*, "tz_vars")=List of 1
##  - attr(*, "ll_weights")=List of 1
##  - attr(*, "func_mat_names")=List of 1

Multiple TDCs observed at different exposure time grids

To illustrate data transformation when TDCs \(\mathbf{z}_1\) and \(\mathbf{z}_2\) were observed on different exposure time grids (\(t_z\) and \(s_z\)) and are assumed to have different lag_lead windows \(\mathcal{T}_1(t)\) and \(\mathcal{T}_2(t)\), we use simulated data simdf_elra contained in pammtools (see example in sim_pexp for data generation).

simdf_elra
## # A tibble: 250 x 9
##       id   time status      x1    x2 tz1        z.tz1      tz2        z.tz2     
##  * <int>  <dbl>  <int>   <dbl> <dbl> <list>     <list>     <list>     <list>    
##  1     1  3.22       1  1.59   4.61  <int [10]> <dbl [10]> <int [11]> <dbl [11]>
##  2     2 10          0 -0.530  0.178 <int [10]> <dbl [10]> <int [11]> <dbl [11]>
##  3     3  0.808      1 -2.43   3.25  <int [10]> <dbl [10]> <int [11]> <dbl [11]>
##  4     4  3.04       1  0.696  1.26  <int [10]> <dbl [10]> <int [11]> <dbl [11]>
##  5     5  3.36       1  2.54   3.71  <int [10]> <dbl [10]> <int [11]> <dbl [11]>
##  6     6  4.07       1 -0.0231 0.607 <int [10]> <dbl [10]> <int [11]> <dbl [11]>
##  7     7  0.534      1  1.68   1.74  <int [10]> <dbl [10]> <int [11]> <dbl [11]>
##  8     8  1.57       1 -1.89   2.51  <int [10]> <dbl [10]> <int [11]> <dbl [11]>
##  9     9  2.39       1  0.126  3.38  <int [10]> <dbl [10]> <int [11]> <dbl [11]>
## 10    10  6.61       1  1.79   1.06  <int [10]> <dbl [10]> <int [11]> <dbl [11]>
## # … with 240 more rows
simdf_elra %>% slice(1) %>% select(id, tz1) %>% unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(tz1)`
## # A tibble: 10 x 2
##       id   tz1
##    <int> <int>
##  1     1     1
##  2     1     2
##  3     1     3
##  4     1     4
##  5     1     5
##  6     1     6
##  7     1     7
##  8     1     8
##  9     1     9
## 10     1    10
simdf_elra %>% slice(1) %>% select(id, tz2) %>% unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(tz2)`
## # A tibble: 11 x 2
##       id   tz2
##    <int> <int>
##  1     1    -5
##  2     1    -4
##  3     1    -3
##  4     1    -2
##  5     1    -1
##  6     1     0
##  7     1     1
##  8     1     2
##  9     1     3
## 10     1     4
## 11     1     5

Note that tz1 has a maximum length of 10, while tz2 has a maximum length of 11 and while tz1 was observed during the follow up, tz2 was partially observed before the beginning of the follow up.

If we wanted to estimate the following cumulative effects

  • \(g(\mathbf{z}_1, t) = \int_{\mathcal{T}_1(t)} h(t, t-t_z, z_1(s_z))\) and
  • \(g(\mathbf{z}_2, t) = \int_{\mathcal{T}_2(t)} h(t-t_z, z_2(s_z))\)

with \(\mathcal{T}_1(t) = \{t_z: t \geq t_z + 2\}\) and \(\mathcal{T}_2(t) = \{s_z: t \geq s_z\}\) the data transformation function must reflect the different exposure times as well as different lag-lead window specifications as illustrated below:

ped_sim <- as_ped(
  data = simdf_elra,
  formula = Surv(time, status) ~ . +
    cumulative(time, latency(tz1), z.tz1, tz_var = "tz1") +
    cumulative(latency(tz2), z.tz2, tz_var = "tz2"),
  cut = 0:10,
  id = "id")
str(ped_sim, 1)
## Classes 'fped', 'ped' and 'data.frame':  1004 obs. of  15 variables:
##  $ id          : int  1 1 1 1 2 2 2 2 2 2 ...
##  $ tstart      : num  0 1 2 3 0 1 2 3 4 5 ...
##  $ tend        : int  1 2 3 4 1 2 3 4 5 6 ...
##  $ interval    : Factor w/ 10 levels "(0,1]","(1,2]",..: 1 2 3 4 1 2 3 4 5 6 ...
##  $ offset      : num  0 0 0 -1.53 0 ...
##  $ ped_status  : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ x1          : num  1.59 1.59 1.59 1.59 -0.53 ...
##  $ x2          : num  4.612 4.612 4.612 4.612 0.178 ...
##  $ time_tz1_mat: int [1:1004, 1:10] 0 1 2 3 0 1 2 3 4 5 ...
##  $ tz1_latency : num [1:1004, 1:10] 0 0 1 2 0 0 1 2 3 4 ...
##  $ z.tz1_tz1   : num [1:1004, 1:10] -2.014 -2.014 -2.014 -2.014 -0.978 ...
##   ..- attr(*, "dimnames")=List of 2
##  $ LL_tz1      : num [1:1004, 1:10] 0 1 1 1 0 1 1 1 1 1 ...
##  $ tz2_latency : num [1:1004, 1:11] 5 6 7 8 5 6 7 8 9 10 ...
##  $ z.tz2_tz2   : num [1:1004, 1:11] -0.689 -0.689 -0.689 -0.689 0.693 ...
##   ..- attr(*, "dimnames")=List of 2
##  $ LL_tz2      : num [1:1004, 1:11] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "breaks")= int [1:11] 0 1 2 3 4 5 6 7 8 9 ...
##  - attr(*, "id_var")= chr "id"
##  - attr(*, "intvars")= chr [1:6] "id" "tstart" "tend" "interval" ...
##  - attr(*, "trafo_args")=List of 3
##  - attr(*, "time_var")= chr "time"
##  - attr(*, "status_var")= chr "status"
##  - attr(*, "func")=List of 2
##  - attr(*, "ll_funs")=List of 2
##  - attr(*, "tz")=List of 2
##  - attr(*, "tz_vars")=List of 2
##  - attr(*, "ll_weights")=List of 2
##  - attr(*, "func_mat_names")=List of 2

Note how the matrix covariates associated with tz1 have 10 columns while matrix covariates associated with tz2 have 11 columns. Note also that the lag-lead matrices differ

gg_laglead(ped_sim)

References

Bender, Andreas, Fabian Scheipl, Wolfgang Hartl, Andrew G. Day, and Helmut Küchenhoff. 2018. “Penalized Estimation of Complex, Non-Linear Exposure-Lag-Response Associations.” Biostatistics. https://doi.org/10.1093/biostatistics/kxy003.

Gasparrini, Antonio, Fabian Scheipl, Ben Armstrong, and Michael G. Kenward. 2017. “A Penalized Framework for Distributed Lag Non-Linear Models.” Biometrics 73: 938–48. https://doi.org/10.1111/biom.12645.

Kalbfleisch, J., and R. Prentice. 1980. The Statistical Analysis of Failure Time Data. New York: Wiley.

Sylvestre, Marie-Pierre, and Michal Abrahamowicz. 2009. “Flexible Modeling of the Cumulative Effects of Time-Dependent Exposures on the Hazard.” Statistics in Medicine 28 (27): 3437–53. https://doi.org/10.1002/sim.3701.