Background

Predictive biomarkers are biometric measurements that subdivide patient populations into groups which draw differing benefits from a given treatment. These biomarkers therefore play an important role in precision medicine: as indicators of treatment effect modification, they can inform patient treatment decisions. Predictive biomarkers may also be used for drug target discovery, leading to an improved understanding of novel medications’ functioning.

The discovery of predictive biomarkers through statistical techniques has, to date, largely been a byproduct of treatment rule estimation. As implied by its name, this tasks consists of learning which treatment would most benefit a patient given their characteristics. These characteristics could include, but are not limited to, potentially predictive biomarkers. When interpretable statistical learning methods are used to estimate these rules, biomarkers found to be “important” might be declared predictive. For example, estimation methods based on regularized linear regression might define a biomarker as predictive if its treatment interaction coefficient estimate is non-zero (Tian et al. 2014).

While this approach to predictive biomarker discovery works well in traditional asymptotic settings, it is not so when the number of biomarkers is similar to or larger than the number of patients in a study. These methods offer no guarantee of false positive rate control (Tian et al. 2014; Chen et al. 2017), which can have severe consequences in practice. The inclusion of false positive biomarkers in diagnostic assays decreases their clinical utility, and already limited resources are wasted on biological validation experiments of duds for drug target discovery purposes. Patient outcomes are ultimately negatively affected.

We proposed in recent work a variable importance parameter that directly measures individual biomarkers’ roles in treatment effect modification. The methodology, dubbed uniCATE, relies on semiparametric theory to perform assumption-lean inference about this parameter. We demonstrated that our procedure controls the rate of false discoveries in high-dimensional randomized control trials, allowing for the accurate identification of predictive biomarkers. This methodology is implemented in the uniCATE package.

uniCATE in Action

This section provides a high-level overview of the statistical methodology for various outcome types, and includes brief tutorials of the package’s methods applied to simulated data.

Continuous Outcomes

Setup

Suppose we perform a randomized control trial (RCT) where each patient is represented by one of \(n\) independent and identically distributed (i.i.d.) copies of the full-data random vector \(X_i = (W_i, A_i, Y_i^{(0)}, Y_i^{(1)})\) distributed according \(P_X\), \(i = 1, \ldots, n\). Here, \(W_i = (V_i, B_i)\) is a \((q+p)\)-length random vector of \(q\) pre-treatment covariates, \(V\), like location and nationality, and \(p\) pre-treatment biomarkers, \(B\), such as gene expression data, \(A\) is a binary random variable representing treatment assignment, and \(Y^{(1)}\) and \(Y^{(0)}\) are continuous random variables corresponding to the potential outcomes of clinical interest under both treatment and control conditions, respectively. Of course, we do not have access to patients’ full-data random vector since we only get to observe the outcomes associated with their treatment assignments. These \(n\) i.i.d. random observations are represented by \(O_i = (W_i, A_i, Y_i)\) distributed according to the unknown data-generating distribution \(P_0\). \(W_i\) and \(A_i\) are defined as before, and \(Y_i = Y_i^{(0)}(1-A_i) + Y^{(1)}_iA_i\). Throughout the rest of the tutorial, we drop the index \(i\) where possible to simplify notation.

Target Parameter

Ideally, we would like to estimate the full-data parameter \(\Psi(P_x) = (\Psi_1(P_X), \ldots, \Psi_p(P_X))\) corresponding to \(p\)-length random vector whose elements are defined as \[ \Psi_j(P_X) = \frac{\mathbb{E}_{P_X}\left[\left(Y^{(1)}-Y^{(0)}\right)B_j\right]} {\mathbb{E}_{P_X}\left[B_j^2\right]} \] for \(j=1, \ldots, p\) where we assume that the biomarkers are centered at zero. Under the assumption that the mean difference in potential outcomes is a linear function of \(B_j\), \(\Psi(P_X)\) is the vector of expected simple linear regression coefficients produced by regressing the difference in potential outcomes against each biomarker. While the true relationship between the difference of potential outcomes and a predictive biomarker is almost surely nonlinear, \(\Psi(P_X)\) is a generally informative target of inference. These parameters are reasonable indicators of all but pathological relationships between biomarkers and potential outcome differences.

Under conditions that are generally satisfied in the clinical trial setting, we find that \(\Psi(P_X)\) can be estimated from realizations of the observed data-generating distribution. Further, the estimator of this parameter implemented in the unicate() function is consistent and asymptotically linear, permitting valid statistical inference in most RCT scenarios. Details are provided in the manuscript.

Example

We simulate some RCT data to demonstrate the application of unicate(). We refrain from generating a high-dimensional vector of biomarkers for computational convenience.

# simulate n trial participants
n <- 200

# simulate randomized control trial data
set.seed(514)
data <- tibble(
    B1 = rnorm(n, mean = 2, sd = 0.2),
    B2 = rnorm(n, mean = -2, sd = 0.2),
    B3 = rnorm(n, mean = 0, sd = 0.1),
    B4 = rnorm(n, mean = 0, sd = 0.1),
    V1 = rbinom(n, 1, 0.4),
    V2 = rnorm(n),
    A = rbinom(n, 1, 0.5),
    Y = A + V1 - V2 + 2 * B1 * A + B2 * A
)

# define the propensity scores
propensity_score_ls <- list("1" = 0.5, "0" = 0.5)

# glimpse of data structure
head(data)
#> # A tibble: 6 × 8
#>      B1    B2      B3      B4    V1      V2     A      Y
#>   <dbl> <dbl>   <dbl>   <dbl> <int>   <dbl> <int>  <dbl>
#> 1  2.62 -2.26  0.0881  0.0449     1 -0.0815     1  5.07 
#> 2  1.68 -1.72 -0.0746  0.0828     1  0.518      0  0.482
#> 3  2.23 -2.12 -0.0437  0.0317     1 -0.737      1  5.08 
#> 4  2.20 -2.08 -0.0942  0.0964     1  0.534      1  3.79 
#> 5  2.31 -1.71 -0.0477 -0.0700     1  0.648      0  0.352
#> 6  2.35 -1.96 -0.0888 -0.103      1  1.93       0 -0.928

This simulated trial is made up of 200 participants. Four pre-treatment biomarkers and two pre-treatment covariates are generated for each participant. Observations are then randomly assigned with equal probability to either the treatment or the control condition. From the definition of the outcome, it is clear that B1 and B2 are predictive biomarkers. Finally, we define the propensity_score_ls, a list indicating the probability that any given patient is assigned to the treatment condition, "1", or the control condition, "0". Note that the treatment condition’s entry in this list always precedes the control condition’s.

We now apply unicate() to this data to recover the variable importance measures of the biomarkers.

unicate(
  data,
  outcome = "Y",
  treatment = "A",
  covariates = c("B1", "B2", "B3", "B4", "V1", "V2"),
  biomarkers =  c("B1", "B2", "B3", "B4"),
  propensity_score_ls = propensity_score_ls
)
#> # A tibble: 4 × 7
#>   biomarker   coef     se      z   p_value p_value_bh p_value_holm
#>   <chr>      <dbl>  <dbl>  <dbl>     <dbl>      <dbl>        <dbl>
#> 1 B1         2.00  0.0650 30.9   5.30e-209  2.12e-208    2.12e-208
#> 2 B2         1.03  0.168   6.13  9.03e- 10  1.81e-  9    2.71e-  9
#> 3 B4         0.590 0.388   1.52  1.28e-  1  1.70e-  1    2.56e-  1
#> 4 B3        -0.152 0.348  -0.436 6.63e-  1  6.63e-  1    6.63e-  1

The output is a table summarizing the variable importance parameter estimates of each biomarker (coef), the estimators’ standard errors (se), the two-sided hypothesis tests’ Z-scores (z), and the tests’ (adjusted) p-values (p_value, p_value_bh, and p_value_holm).

As expected, B1 modifies the treatment effect the most, followed by B2. Both of their variable importance measures are found to be significantly different from zero (using a 5% significance threshold) at the nominal level and after adjusting for multiple testing. As patients’ B1 and B2 values marginally increase, so too does their benefit from being assigned to the treatment condition over the control condition. The remaining biomarkers’ variable importance parameters are not significantly different from zero at the same level of significance. There is insufficient evidence to categorize them as predictive.

Binary Outcomes

Setup

The assumed full-data and observed data-generating processes are identical to those of the continuous outcome save that \(Y^{(0)}, Y^{(1)}\) and \(Y\) are binary random variables.

Target Parameter

The full-data parameter of interest looks slightly different from the continuous outcome scenario to reflect the use of a binary outcome. We target \(\Psi(P_X) = (\Psi_1(P_X), \ldots, \Psi_p(P_X))\) where \[ \Psi_j(P_X) = \frac{\mathbb{E}_{P_X}\left[\left(\mathbb{P}_{P_X}(Y^{(1)} =1| W) - \mathbb{P}_{P_X}(Y^{(0)} = 1| W)\right)B_j\right]} {\mathbb{E}_{P_X}\left[B_j^2\right]} \] for centered biomarkers indexed by \(j = 1, \ldots, p\). Assuming a linear relationship between the difference of the potential outcomes’ probability of success and the covariates, this variable importance parameter consists of the simple linear regression coefficients of the difference in the conditional potential outcome success probabilities regressed on each biomarker. While the true relationship between potential outcome probabilities and covariates is again unlikely to be linear, this parameter is telling of biomarkers’ predictive capacities. Biomarkers with the largest absolute values will correspond to those that meaningfully modify the treatment effect. In truth, this is the same variable importance parameter as the for continuous outcomes, but written in a more intuitive form.

As in the continuous case, this parameter can generally be estimated from the observed RCT data under non-stringent assumptions about the data-generating process. Inference is in fact based on the same estimator used for continuous outcomes, and so we again use unicate() to learn about the biomarkers’ marginal importance in regards to binary treatment effect modification. For details, please review the manuscript. We do not provide an example since it would be nearly identical to that given in the previous subsection.

Survival Outcomes

Setup

We perform an RCT where the outcome of interest is a right-censored time-to-event outcome. We consider \(n\) i.i.d. copies of the full-data random vector \(X = (W, A, T^{(0)}, C^{(0)}, T^{(1)}, C^{(1)})\) distributed according to \(P_X\) where \(W\) and \(A\) are defined as in the continuous outcome case. \(T^{(0)}\) and \(T^{(1)}\) correspond to the event times under control and treatment conditions, and \(C^{(0)}\) and \(C^{(1)}\) correspond to the censoring times under control and treatment conditions. As before, the full-data random vectors are never observed; instead, we observe \(n\) i.i.d. copies of \(O = (W, A, \Delta, \tilde{T})\) distributed according to \(P_0\). Here, \(\Delta\) is a censoring indicator and \(\tilde{T}=(1-A)\;\text{min}(T^{(0)}, C^{(0)})+A\;\text{min}(T^{(1)}, C^{(1)})\) is the last measurement time.

Target Parameter

In this scenario, our variance important parameter is a function of the treatment effect on survival at time \(t_0\). Recall that survival at time \(t_0\) in the full-data is defined as \(S^{(A)}(t_0|W) = \mathbb{P}_{P_X}(T^{(A)} > t_0 | W)\). The target parameter is then represented by \(\Psi(P_X; t_0) = (\Psi_1(P_X; t_0), \ldots, \Psi_p(P_X; t_0))\) where \[ \Psi_j(P_X;t_0) = \frac{\mathbb{E}_{P_X}\left[\left(S^{(1)}(t_0|W)-S^{(0)}(t_0|W)\right)B_j\right]} {\mathbb{E}_{P_X}\left[B_j^2\right]} \] for centered biomarkers indexed by \(j = 1, \ldots, p\). Similarly to the binary outcome scenario, this variable importance measure assumes a linear relationship between the contrast of survival outcomes at time \(t_0\) and each biomarker, and consists of the simple linear regression coefficients of this contrast regressed on each biomarker. Again, while the true relationship between the difference in survival probabilities and biomarkers is unlikely to be linear, it provides a metric by which to evaluate biomarkers’ abilities to modify treatment effects on the absolute scale.

As with the other variable importance metrics, this parameter can often be estimated from the observed data in a RCT. The estimator of this parameter is implemented in the sunicate() function. Under the assumption that the censoring mechanism is consistently estimated by our procedure, the sunicate() estimator is consistent and asymptotically linear. Valid statistical inference about the strength of biomarkers’ treatment effect modification is therefore made possible with survival outcomes.

Example

We now simulate some RCT data with a right-censored time to event outcome. Again, we consider a low-dimensional biomarker vector for computational efficiency.

# simulate n trial participants
n <- 200

# generate a tibble of baseline characteristics
baseline_data <- tibble(
  A = rbinom(n, 1, 0.5),
  V1 = rnorm(n),
  V2 = rnorm(n),
  B1 = rnorm(n),
  B2 = rnorm(n)
)

# simulate event and censoring times
# define hazard functions
cond_surv_hazard <- function(t, a, b1) {
  (t < 9) / (1 + exp(- (-2 - 3 * a * b1))) +
    (t == 9)
}

# generate the event indicators for t = 1 to 9
event_time <- sapply(
  seq_len(n),
  function(obs) {
    event_time <- NA
    for (t in 1:9) {
      prob <- cond_surv_hazard(t, baseline_data$A[obs], baseline_data$B1[obs])
      status <- rbinom(1, 1, prob)
      if (status == 1) {
        event_time <- t
        break
      }
    }
    return(event_time)
  })

# generate the censoring events for t = 1 to 9
censor_time <- sapply(
  seq_len(n),
  function(obs) {
    censor_time <- NA
    for (t in 1:9) {
      status <- rbinom(1, 1, 0.15)
      if (status == 1) {
        censor_time <- t
        break
      }
    }
    if (is.na(censor_time)) censor_time <- 10
    return(censor_time)
  })

# assemble the event and censoring times into a tibble
status_df <- tibble(
    "event_time" = as.integer(event_time),
    "censor_time" = as.integer(censor_time)
  ) %>%
    transmute(
      time = if_else(censor_time < event_time, censor_time, event_time),
      censor = if_else(censor_time < event_time, 1, 0),
      event = if_else(censor_time >= event_time, 1, 0)
    )

# assemble the complete RCT data
# uniCATE functions can handle treatment indicators as factor variables, too
data <- baseline_data %>%
  mutate(
    A = if_else(A == 1, "treatment", "control"),
    A = factor(A, levels = c("treatment", "control"))
    ) %>%
  bind_cols(status_df)

# glimpse of data structure
head(data)
#> # A tibble: 6 × 8
#>   A              V1      V2     B1     B2  time censor event
#>   <fct>       <dbl>   <dbl>  <dbl>  <dbl> <int>  <dbl> <dbl>
#> 1 treatment  0.786   0.509   0.735  0.950     9      0     1
#> 2 control   -1.07   -1.69   -0.386  0.555     2      0     1
#> 3 control   -0.0627 -2.07    1.43  -0.169     1      1     0
#> 4 control   -0.979   0.732   0.475  0.967     1      0     1
#> 5 control    0.267  -0.0529 -0.171 -0.252     9      0     1
#> 6 control    0.212  -0.208   1.30   0.162     1      1     0

The data simulated in the previous chunk relies on an uninformative censoring mechanism and a event mechanism that is a function of the interaction between the treatment assignment \(A\) and the biomarker \(B1\). These event and censoring times occur at one of 9 discrete relative follow-up times for all 500 study participants. Take special notice of the censor and event columns that describe the relative time measurement in time: the former is a binary indicator of censoring, and the later of event. Finally, note that sunicate() (and unicate()) accept treatment indicators that are factor variables, like A.

We now assess the predictive importance of biomarkers with respect to survival at time \(t_0=6\) by setting the time_cutoff argument to 6. Note when the time_cutoff argument is not specified, \(t_0\) defaults to the median relative time in the data argument.

sunicate(
  data = data,
  event = "event",
  censor = "censor",
  relative_time = "time",
  treatment = "A",
  covariates = c("V1", "V2", "B1", "B2"),
  biomarkers = c("B1", "B2"),
  time_cutoff = 6,
  propensity_score_ls = list("treatment" = 0.5, "control" = 0.5)
)
#> # A tibble: 2 × 7
#>   biomarker    coef     se      z   p_value p_value_bh p_value_holm
#>   <chr>       <dbl>  <dbl>  <dbl>     <dbl>      <dbl>        <dbl>
#> 1 B1         0.343  0.0831  4.13  0.0000362  0.0000725    0.0000725
#> 2 B2        -0.0414 0.101  -0.411 0.681      0.681        0.681

sunicate() produces a table identical to that generated by unicate(). It provides very strong evidence that B1 is a treatment effect modifier with a non-zero variance important parameter. B2, on the other hand, does not appear to be a predictive biomarker since we fail to reject its associated test. These results capture the ground truth of our data-generating process.

Additional Features

unicate() and sunicate() rely on sensible default arguments to simplify their usage. However, advanced users may be interested changing these defaults. In particular, parallelization routines relying on the future suite might be employed when analyzing large data sets by setting the parallel argument to TRUE. The number of folds used during the K-fold cross-validation routines might be changed through the v_folds argument. Finally, the semiparametric estimators’ nuisance parameters are learned using pre-packaged sl3 Super Learners. Custom Super Learners can be employed in their place.

Session Information

sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.0.7   uniCATE_0.4.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] nlme_3.1-155         matrixStats_0.61.0   fs_1.5.2            
#>   [4] lubridate_1.8.0      progress_1.2.2       rprojroot_2.0.2     
#>   [7] tools_4.1.2          backports_1.4.1      bslib_0.3.1         
#>  [10] utf8_1.2.2           R6_2.5.1             rpart_4.1.16        
#>  [13] DBI_1.1.2            colorspace_2.0-2     nnet_7.3-17         
#>  [16] withr_2.4.3          tidyselect_1.1.1     prettyunits_1.1.1   
#>  [19] compiler_4.1.2       glmnet_4.1-3         textshaping_0.3.6   
#>  [22] cli_3.1.1            desc_1.4.0           sass_0.4.0          
#>  [25] scales_1.1.1         checkmate_2.0.0      randomForest_4.6-14 
#>  [28] pkgdown_2.0.2        systemfonts_1.0.3    stringr_1.4.0       
#>  [31] digest_0.6.29        rmarkdown_2.11       pkgconfig_2.0.3     
#>  [34] htmltools_0.5.2      parallelly_1.30.0    fastmap_1.1.0       
#>  [37] htmlwidgets_1.5.4    rlang_1.0.1          BBmisc_1.11         
#>  [40] shape_1.4.6          visNetwork_2.1.0     jquerylib_0.1.4     
#>  [43] generics_0.1.2       jsonlite_1.7.3       ModelMetrics_1.2.2.2
#>  [46] magrittr_2.0.2       delayed_0.3.0        Matrix_1.4-0        
#>  [49] Rcpp_1.0.8           munsell_0.5.0        fansi_1.0.2         
#>  [52] abind_1.4-5          lifecycle_1.0.1      pROC_1.18.0         
#>  [55] stringi_1.7.6        yaml_2.2.2           MASS_7.3-55         
#>  [58] plyr_1.8.6           recipes_0.1.17       grid_4.1.2          
#>  [61] parallel_4.1.2       listenv_0.8.0        crayon_1.4.2        
#>  [64] lattice_0.20-45      splines_4.1.2        hms_1.1.1           
#>  [67] knitr_1.37           pillar_1.7.0         igraph_1.2.11       
#>  [70] uuid_1.0-3           origami_1.0.5        stats4_4.1.2        
#>  [73] reshape2_1.4.4       future.apply_1.8.1   codetools_0.2-18    
#>  [76] glue_1.6.1           evaluate_0.14        data.table_1.14.2   
#>  [79] vctrs_0.3.8          Rdpack_2.1.3         foreach_1.5.2       
#>  [82] gtable_0.3.0         purrr_0.3.4          rstackdeque_1.1.1   
#>  [85] future_1.23.0        assertthat_0.2.1     cachem_1.0.6        
#>  [88] ggplot2_3.3.5        xfun_0.29            sl3_1.4.4           
#>  [91] gower_0.2.2          rbibutils_2.2.7      prodlim_2019.11.13  
#>  [94] ragg_1.2.1           class_7.3-20         survival_3.2-13     
#>  [97] timeDate_3043.102    tibble_3.1.6         iterators_1.0.13    
#> [100] memoise_2.0.1        lava_1.6.10          globals_0.14.0      
#> [103] imputeMissings_0.0.3 ellipsis_0.3.2       caret_6.0-90        
#> [106] ROCR_1.0-11          ipred_0.9-12

References

Chen, Shuai, Lu Tian, Tianxi Cai, and Menggang Yu. 2017. “A General Statistical Framework for Subgroup Identification and Comparative Treatment Scoring.” Biometrics 73 (4): 1199–209. https://doi.org/https://doi.org/10.1111/biom.12676.
Tian, Lu, Ash A. Alizadeh, Andrew J. Gentles, and Robert Tibshirani. 2014. “A Simple Method for Estimating Interactions Between a Treatment and a Large Number of Covariates.” Journal of the American Statistical Association 109 (508): 1517–32. https://doi.org/10.1080/01621459.2014.951443.