Predictive biomarkers are biometric measurements that subdivide patient populations into groups which draw differing benefits from a given treatment. These biomarkers therefore play an important role in precision medicine: as indicators of treatment effect modification, they can inform patient treatment decisions. Predictive biomarkers may also be used for drug target discovery, leading to an improved understanding of novel medications’ functioning.
The discovery of predictive biomarkers through statistical techniques has, to date, largely been a byproduct of treatment rule estimation. As implied by its name, this tasks consists of learning which treatment would most benefit a patient given their characteristics. These characteristics could include, but are not limited to, potentially predictive biomarkers. When interpretable statistical learning methods are used to estimate these rules, biomarkers found to be “important” might be declared predictive. For example, estimation methods based on regularized linear regression might define a biomarker as predictive if its treatment interaction coefficient estimate is non-zero (Tian et al. 2014).
While this approach to predictive biomarker discovery works well in traditional asymptotic settings, it is not so when the number of biomarkers is similar to or larger than the number of patients in a study. These methods offer no guarantee of false positive rate control (Tian et al. 2014; Chen et al. 2017), which can have severe consequences in practice. The inclusion of false positive biomarkers in diagnostic assays decreases their clinical utility, and already limited resources are wasted on biological validation experiments of duds for drug target discovery purposes. Patient outcomes are ultimately negatively affected.
We proposed in recent work a variable importance parameter that
directly measures individual biomarkers’ roles in treatment effect
modification. The methodology, dubbed uniCATE, relies on
semiparametric theory to perform assumption-lean inference about this
parameter. We demonstrated that our procedure controls the rate of false
discoveries in high-dimensional randomized control trials, allowing for
the accurate identification of predictive biomarkers. This methodology
is implemented in the uniCATE
package.
uniCATE
in Action
This section provides a high-level overview of the statistical methodology for various outcome types, and includes brief tutorials of the package’s methods applied to simulated data.
Suppose we perform a randomized control trial (RCT) where each patient is represented by one of \(n\) independent and identically distributed (i.i.d.) copies of the full-data random vector \(X_i = (W_i, A_i, Y_i^{(0)}, Y_i^{(1)})\) distributed according \(P_X\), \(i = 1, \ldots, n\). Here, \(W_i = (V_i, B_i)\) is a \((q+p)\)-length random vector of \(q\) pre-treatment covariates, \(V\), like location and nationality, and \(p\) pre-treatment biomarkers, \(B\), such as gene expression data, \(A\) is a binary random variable representing treatment assignment, and \(Y^{(1)}\) and \(Y^{(0)}\) are continuous random variables corresponding to the potential outcomes of clinical interest under both treatment and control conditions, respectively. Of course, we do not have access to patients’ full-data random vector since we only get to observe the outcomes associated with their treatment assignments. These \(n\) i.i.d. random observations are represented by \(O_i = (W_i, A_i, Y_i)\) distributed according to the unknown data-generating distribution \(P_0\). \(W_i\) and \(A_i\) are defined as before, and \(Y_i = Y_i^{(0)}(1-A_i) + Y^{(1)}_iA_i\). Throughout the rest of the tutorial, we drop the index \(i\) where possible to simplify notation.
Ideally, we would like to estimate the full-data parameter \(\Psi(P_x) = (\Psi_1(P_X), \ldots, \Psi_p(P_X))\) corresponding to \(p\)-length random vector whose elements are defined as \[ \Psi_j(P_X) = \frac{\mathbb{E}_{P_X}\left[\left(Y^{(1)}-Y^{(0)}\right)B_j\right]} {\mathbb{E}_{P_X}\left[B_j^2\right]} \] for \(j=1, \ldots, p\) where we assume that the biomarkers are centered at zero. Under the assumption that the mean difference in potential outcomes is a linear function of \(B_j\), \(\Psi(P_X)\) is the vector of expected simple linear regression coefficients produced by regressing the difference in potential outcomes against each biomarker. While the true relationship between the difference of potential outcomes and a predictive biomarker is almost surely nonlinear, \(\Psi(P_X)\) is a generally informative target of inference. These parameters are reasonable indicators of all but pathological relationships between biomarkers and potential outcome differences.
Under conditions that are generally satisfied in the clinical trial
setting, we find that \(\Psi(P_X)\) can
be estimated from realizations of the observed data-generating
distribution. Further, the estimator of this parameter implemented in
the unicate()
function is consistent and asymptotically
linear, permitting valid statistical inference in most RCT scenarios.
Details are provided in the manuscript.
We simulate some RCT data to demonstrate the application of
unicate()
. We refrain from generating a high-dimensional
vector of biomarkers for computational convenience.
# simulate n trial participants
n <- 200
# simulate randomized control trial data
set.seed(514)
data <- tibble(
B1 = rnorm(n, mean = 2, sd = 0.2),
B2 = rnorm(n, mean = -2, sd = 0.2),
B3 = rnorm(n, mean = 0, sd = 0.1),
B4 = rnorm(n, mean = 0, sd = 0.1),
V1 = rbinom(n, 1, 0.4),
V2 = rnorm(n),
A = rbinom(n, 1, 0.5),
Y = A + V1 - V2 + 2 * B1 * A + B2 * A
)
# define the propensity scores
propensity_score_ls <- list("1" = 0.5, "0" = 0.5)
# glimpse of data structure
head(data)
#> # A tibble: 6 × 8
#> B1 B2 B3 B4 V1 V2 A Y
#> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl>
#> 1 2.62 -2.26 0.0881 0.0449 1 -0.0815 1 5.07
#> 2 1.68 -1.72 -0.0746 0.0828 1 0.518 0 0.482
#> 3 2.23 -2.12 -0.0437 0.0317 1 -0.737 1 5.08
#> 4 2.20 -2.08 -0.0942 0.0964 1 0.534 1 3.79
#> 5 2.31 -1.71 -0.0477 -0.0700 1 0.648 0 0.352
#> 6 2.35 -1.96 -0.0888 -0.103 1 1.93 0 -0.928
This simulated trial is made up of 200 participants. Four
pre-treatment biomarkers and two pre-treatment covariates are generated
for each participant. Observations are then randomly assigned with equal
probability to either the treatment or the control condition. From the
definition of the outcome, it is clear that B1
and
B2
are predictive biomarkers. Finally, we define the
propensity_score_ls
, a list indicating the probability that
any given patient is assigned to the treatment condition,
"1"
, or the control condition, "0"
. Note that
the treatment condition’s entry in this list always precedes the control
condition’s.
We now apply unicate()
to this data to recover the
variable importance measures of the biomarkers.
unicate(
data,
outcome = "Y",
treatment = "A",
covariates = c("B1", "B2", "B3", "B4", "V1", "V2"),
biomarkers = c("B1", "B2", "B3", "B4"),
propensity_score_ls = propensity_score_ls
)
#> # A tibble: 4 × 7
#> biomarker coef se z p_value p_value_bh p_value_holm
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 B1 2.00 0.0650 30.9 5.30e-209 2.12e-208 2.12e-208
#> 2 B2 1.03 0.168 6.13 9.03e- 10 1.81e- 9 2.71e- 9
#> 3 B4 0.590 0.388 1.52 1.28e- 1 1.70e- 1 2.56e- 1
#> 4 B3 -0.152 0.348 -0.436 6.63e- 1 6.63e- 1 6.63e- 1
The output is a table summarizing the variable importance parameter
estimates of each biomarker (coef
), the estimators’
standard errors (se
), the two-sided hypothesis tests’
Z-scores (z
), and the tests’ (adjusted) p-values
(p_value
, p_value_bh
, and
p_value_holm
).
As expected, B1
modifies the treatment effect the most,
followed by B2
. Both of their variable importance measures
are found to be significantly different from zero (using a 5%
significance threshold) at the nominal level and after adjusting for
multiple testing. As patients’ B1
and B2
values marginally increase, so too does their benefit from being
assigned to the treatment condition over the control condition. The
remaining biomarkers’ variable importance parameters are not
significantly different from zero at the same level of significance.
There is insufficient evidence to categorize them as predictive.
The assumed full-data and observed data-generating processes are identical to those of the continuous outcome save that \(Y^{(0)}, Y^{(1)}\) and \(Y\) are binary random variables.
The full-data parameter of interest looks slightly different from the continuous outcome scenario to reflect the use of a binary outcome. We target \(\Psi(P_X) = (\Psi_1(P_X), \ldots, \Psi_p(P_X))\) where \[ \Psi_j(P_X) = \frac{\mathbb{E}_{P_X}\left[\left(\mathbb{P}_{P_X}(Y^{(1)} =1| W) - \mathbb{P}_{P_X}(Y^{(0)} = 1| W)\right)B_j\right]} {\mathbb{E}_{P_X}\left[B_j^2\right]} \] for centered biomarkers indexed by \(j = 1, \ldots, p\). Assuming a linear relationship between the difference of the potential outcomes’ probability of success and the covariates, this variable importance parameter consists of the simple linear regression coefficients of the difference in the conditional potential outcome success probabilities regressed on each biomarker. While the true relationship between potential outcome probabilities and covariates is again unlikely to be linear, this parameter is telling of biomarkers’ predictive capacities. Biomarkers with the largest absolute values will correspond to those that meaningfully modify the treatment effect. In truth, this is the same variable importance parameter as the for continuous outcomes, but written in a more intuitive form.
As in the continuous case, this parameter can generally be estimated
from the observed RCT data under non-stringent assumptions about the
data-generating process. Inference is in fact based on the same
estimator used for continuous outcomes, and so we again use
unicate()
to learn about the biomarkers’ marginal
importance in regards to binary treatment effect modification. For
details, please review the manuscript. We do not provide an example
since it would be nearly identical to that given in the previous
subsection.
We perform an RCT where the outcome of interest is a right-censored time-to-event outcome. We consider \(n\) i.i.d. copies of the full-data random vector \(X = (W, A, T^{(0)}, C^{(0)}, T^{(1)}, C^{(1)})\) distributed according to \(P_X\) where \(W\) and \(A\) are defined as in the continuous outcome case. \(T^{(0)}\) and \(T^{(1)}\) correspond to the event times under control and treatment conditions, and \(C^{(0)}\) and \(C^{(1)}\) correspond to the censoring times under control and treatment conditions. As before, the full-data random vectors are never observed; instead, we observe \(n\) i.i.d. copies of \(O = (W, A, \Delta, \tilde{T})\) distributed according to \(P_0\). Here, \(\Delta\) is a censoring indicator and \(\tilde{T}=(1-A)\;\text{min}(T^{(0)}, C^{(0)})+A\;\text{min}(T^{(1)}, C^{(1)})\) is the last measurement time.
In this scenario, our variance important parameter is a function of the treatment effect on survival at time \(t_0\). Recall that survival at time \(t_0\) in the full-data is defined as \(S^{(A)}(t_0|W) = \mathbb{P}_{P_X}(T^{(A)} > t_0 | W)\). The target parameter is then represented by \(\Psi(P_X; t_0) = (\Psi_1(P_X; t_0), \ldots, \Psi_p(P_X; t_0))\) where \[ \Psi_j(P_X;t_0) = \frac{\mathbb{E}_{P_X}\left[\left(S^{(1)}(t_0|W)-S^{(0)}(t_0|W)\right)B_j\right]} {\mathbb{E}_{P_X}\left[B_j^2\right]} \] for centered biomarkers indexed by \(j = 1, \ldots, p\). Similarly to the binary outcome scenario, this variable importance measure assumes a linear relationship between the contrast of survival outcomes at time \(t_0\) and each biomarker, and consists of the simple linear regression coefficients of this contrast regressed on each biomarker. Again, while the true relationship between the difference in survival probabilities and biomarkers is unlikely to be linear, it provides a metric by which to evaluate biomarkers’ abilities to modify treatment effects on the absolute scale.
As with the other variable importance metrics, this parameter can
often be estimated from the observed data in a RCT. The estimator of
this parameter is implemented in the sunicate()
function.
Under the assumption that the censoring mechanism is consistently
estimated by our procedure, the sunicate()
estimator is
consistent and asymptotically linear. Valid statistical inference about
the strength of biomarkers’ treatment effect modification is therefore
made possible with survival outcomes.
We now simulate some RCT data with a right-censored time to event outcome. Again, we consider a low-dimensional biomarker vector for computational efficiency.
# simulate n trial participants
n <- 200
# generate a tibble of baseline characteristics
baseline_data <- tibble(
A = rbinom(n, 1, 0.5),
V1 = rnorm(n),
V2 = rnorm(n),
B1 = rnorm(n),
B2 = rnorm(n)
)
# simulate event and censoring times
# define hazard functions
cond_surv_hazard <- function(t, a, b1) {
(t < 9) / (1 + exp(- (-2 - 3 * a * b1))) +
(t == 9)
}
# generate the event indicators for t = 1 to 9
event_time <- sapply(
seq_len(n),
function(obs) {
event_time <- NA
for (t in 1:9) {
prob <- cond_surv_hazard(t, baseline_data$A[obs], baseline_data$B1[obs])
status <- rbinom(1, 1, prob)
if (status == 1) {
event_time <- t
break
}
}
return(event_time)
})
# generate the censoring events for t = 1 to 9
censor_time <- sapply(
seq_len(n),
function(obs) {
censor_time <- NA
for (t in 1:9) {
status <- rbinom(1, 1, 0.15)
if (status == 1) {
censor_time <- t
break
}
}
if (is.na(censor_time)) censor_time <- 10
return(censor_time)
})
# assemble the event and censoring times into a tibble
status_df <- tibble(
"event_time" = as.integer(event_time),
"censor_time" = as.integer(censor_time)
) %>%
transmute(
time = if_else(censor_time < event_time, censor_time, event_time),
censor = if_else(censor_time < event_time, 1, 0),
event = if_else(censor_time >= event_time, 1, 0)
)
# assemble the complete RCT data
# uniCATE functions can handle treatment indicators as factor variables, too
data <- baseline_data %>%
mutate(
A = if_else(A == 1, "treatment", "control"),
A = factor(A, levels = c("treatment", "control"))
) %>%
bind_cols(status_df)
# glimpse of data structure
head(data)
#> # A tibble: 6 × 8
#> A V1 V2 B1 B2 time censor event
#> <fct> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 treatment 0.786 0.509 0.735 0.950 9 0 1
#> 2 control -1.07 -1.69 -0.386 0.555 2 0 1
#> 3 control -0.0627 -2.07 1.43 -0.169 1 1 0
#> 4 control -0.979 0.732 0.475 0.967 1 0 1
#> 5 control 0.267 -0.0529 -0.171 -0.252 9 0 1
#> 6 control 0.212 -0.208 1.30 0.162 1 1 0
The data simulated in the previous chunk relies on an uninformative
censoring mechanism and a event mechanism that is a function of the
interaction between the treatment assignment \(A\) and the biomarker \(B1\). These event and censoring times occur
at one of 9 discrete relative follow-up times for all 500 study
participants. Take special notice of the censor
and
event
columns that describe the relative time measurement
in time
: the former is a binary indicator of censoring, and
the later of event. Finally, note that sunicate()
(and
unicate()
) accept treatment indicators that are factor
variables, like A
.
We now assess the predictive importance of biomarkers with respect to
survival at time \(t_0=6\) by setting
the time_cutoff
argument to 6
. Note when the
time_cutoff
argument is not specified, \(t_0\) defaults to the median relative time
in the data
argument.
sunicate(
data = data,
event = "event",
censor = "censor",
relative_time = "time",
treatment = "A",
covariates = c("V1", "V2", "B1", "B2"),
biomarkers = c("B1", "B2"),
time_cutoff = 6,
propensity_score_ls = list("treatment" = 0.5, "control" = 0.5)
)
#> Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
#> dplyr 1.1.0.
#> ℹ Please use `reframe()` instead.
#> ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
#> always returns an ungrouped data frame and adjust accordingly.
#> ℹ The deprecated feature was likely used in the uniCATE package.
#> Please report the issue to the authors.
#> # A tibble: 2 × 7
#> biomarker coef se z p_value p_value_bh p_value_holm
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 B1 0.343 0.0831 4.13 0.0000362 0.0000725 0.0000725
#> 2 B2 -0.0414 0.101 -0.411 0.681 0.681 0.681
sunicate()
produces a table identical to that generated
by unicate()
. It provides very strong evidence that
B1
is a treatment effect modifier with a non-zero variance
important parameter. B2
, on the other hand, does not appear
to be a predictive biomarker since we fail to reject its associated
test. These results capture the ground truth of our data-generating
process.
unicate()
and sunicate()
rely on sensible
default arguments to simplify their usage. However, advanced users may
be interested changing these defaults. In particular, parallelization
routines relying on the future
suite might be employed when analyzing large data sets by setting the
parallel
argument to TRUE
. The number of folds
used during the K-fold cross-validation routines might be changed
through the v_folds
argument. Finally, the semiparametric
estimators’ nuisance parameters are learned using pre-packaged sl3
Super Learners.
Custom Super Learners can be employed in their place.
sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.0 uniCATE_0.4.0
#>
#> loaded via a namespace (and not attached):
#> [1] nlme_3.1-162 matrixStats_0.63.0 fs_1.6.1
#> [4] lubridate_1.9.2 progress_1.2.2 rprojroot_2.0.3
#> [7] tools_4.2.2 backports_1.4.1 bslib_0.4.2
#> [10] utf8_1.2.3 R6_2.5.1 rpart_4.1.19
#> [13] colorspace_2.1-0 nnet_7.3-18 withr_2.5.0
#> [16] tidyselect_1.2.0 prettyunits_1.1.1 compiler_4.2.2
#> [19] glmnet_4.1-6 textshaping_0.3.6 cli_3.6.0
#> [22] desc_1.4.2 sass_0.4.5 scales_1.2.1
#> [25] checkmate_2.1.0 randomForest_4.7-1.1 pkgdown_2.0.7
#> [28] systemfonts_1.0.4 stringr_1.5.0 digest_0.6.31
#> [31] rmarkdown_2.20 R.utils_2.12.2 pkgconfig_2.0.3
#> [34] htmltools_0.5.4 parallelly_1.34.0 fastmap_1.1.0
#> [37] htmlwidgets_1.6.1 rlang_1.0.6 BBmisc_1.13
#> [40] shape_1.4.6 visNetwork_2.1.2 jquerylib_0.1.4
#> [43] generics_0.1.3 jsonlite_1.8.4 ModelMetrics_1.2.2.2
#> [46] R.oo_1.25.0 magrittr_2.0.3 delayed_0.4.0
#> [49] Matrix_1.5-3 Rcpp_1.0.10 munsell_0.5.0
#> [52] fansi_1.0.4 abind_1.4-5 lifecycle_1.0.3
#> [55] R.methodsS3_1.8.2 pROC_1.18.0 stringi_1.7.12
#> [58] yaml_2.3.7 MASS_7.3-58.2 plyr_1.8.8
#> [61] recipes_1.0.5 grid_4.2.2 parallel_4.2.2
#> [64] listenv_0.9.0 crayon_1.5.2 lattice_0.20-45
#> [67] splines_4.2.2 hms_1.1.2 knitr_1.42
#> [70] pillar_1.8.1 igraph_1.4.0 uuid_1.1-0
#> [73] origami_1.0.7 stats4_4.2.2 reshape2_1.4.4
#> [76] future.apply_1.10.0 codetools_0.2-19 glue_1.6.2
#> [79] evaluate_0.20 data.table_1.14.8 vctrs_0.5.2
#> [82] Rdpack_2.4 foreach_1.5.2 gtable_0.3.1
#> [85] purrr_1.0.1 rstackdeque_1.1.1 future_1.31.0
#> [88] assertthat_0.2.1 cachem_1.0.6 ggplot2_3.4.1
#> [91] xfun_0.37 sl3_1.4.4 gower_1.0.1
#> [94] prodlim_2019.11.13 rbibutils_2.2.13 ragg_1.2.5
#> [97] class_7.3-21 survival_3.5-3 timeDate_4022.108
#> [100] tibble_3.1.8 iterators_1.0.14 memoise_2.0.1
#> [103] hardhat_1.2.0 lava_1.7.1 timechange_0.2.0
#> [106] globals_0.16.2 imputeMissings_0.0.3 ellipsis_0.3.2
#> [109] caret_6.0-93 ROCR_1.0-11 ipred_0.9-13