In this vignette we are going to briefly describe the high level structure of the functions used (the main ones and the helper ones) in this package.
Helper functions
Generate_stacked_data
The generate_stacked_data
function is used to generate a
data frame with the data stacked by the subgroups.
We gather the columns of the different variables and we create two columns: one with the subgroup variable and the other one with the value of the subgroup variable. Doing this we are going to increase the number of rows of the data frame (the new number of rows is going to be the initial one times the number of subgroup variables). Finally, we sort the rows of the new data frame by the subgroup and we output this sorted data frame.
Preprocess
The preprocess
function is used to obtain from the
original data frame 3 different design matrices and a vector with the
name of the subgroups that are going to be studied.
The first of the matrices is design_ia
which is the
design matrix of a model just with the interactions between treatment
and the subgrouping covariates.
Another matrix is design_main
which is the design matrix
of a model with the main effects of treatment and the subgroup
variables.
The last matrix is design_dummy
which is the matrix of a
model just with the subgrouping covariates and using one-hot
encoding.
To obtain all these matrices we just obtain the design matrices of some specified models (for example just considering one of the covariates) and when needed we bind them to obtain the three desired matrices.
Finally we also obtain the vector subgr_names
with the
exact name of the different subgroups that we are going to study.
lor_estimation
The lor_estimation
function is used to obtain the
log-odds ratio of a subgroup in the case of binary data.
The idea is to create two matrices with the same structure as the dummy matrix that defines the subgroup data but in one of them considering that the treatment is present in all observations and in the other one considering that the treatment is not present in any observation.
Then we obtain for each one of this matrices (considering the estimated coefficients from the fitted model and the general idea of logistic regression) the probability of having an event averaging over all individuals. With this probability obtained for the case of just having treatment and the case of just having control, we can obtain the log odds ratio (which is the output of this function).
surv_prob
The surv_prob
function is used to obtain an average
survival curve considering the curves from each one of the
individuals.
The idea is that we have two matrices where one (h
) has
the values of the cumulative baseline hazard at each one of the event
times and the other one (k
) has the values -exponential of
the product between the estimated coefficients and the values of the
covariates of an individual. So we are going to have two matrices one
with as many rows as event times and the other one with as many rows as
individuals. If we want to obtain the survival curve for one of this
individuals we would just have to multiply the correspondent element of
k times the matrix h and take exponential of that value. We can repeat
this for every individual and then we can take an average of all these
values to obtain the average survival curve.
Survival_curves
The survival_curves
function is used to obtain the
average survival curve given the design matrix, the cumulative baseline
hazard and the estimated coefficients.
In this function we obtain the matrix K2
(using the
design matrix and the estimated coefficients of the fitted model). Then
we apply the surv_prob
function to this K2
and
the cumulative baseline hazard h0
in order to obtain the
average survival curve.
ahr_estimation
The ahr_estimation
function is used to estimate the
average hazard ratio of a subgroup.
We first build two matrices: one where all the individuals are in the control group and another one where all the individuals are in the treatment group. Given each one of these matrices we obtain the average survival probabilities associated to them.
Having these survival probabilities we can consider a value of
gamma
that defines a specific weight function to estimate
the average hazard ratio. For the value of gamma
equal to 1
we have that the average hazard ratio is the same as the odds of
concordance.
subgroups
The subgroups
function is used to estimate the treatment
effect of each one of the subgroups considered.
We find the design and the dummy matrix associated with each one of
the subgroups and given these matrices we call
ahr_estimation
(in the case of survival data) to obtain the
average hazard ratio of the subgroup or we call
lor_estimation
(in the case of binary data) to obtain the
log odds ratio of the subgroup.
trt_horseshoe
The trt_horseshoe
function is used to estimate the
posterior distribution of the treatment effects of the subgroups
considered.
We find the number of iteration that we need to consider in order to
obtain all the samples from the posterior distribution of the
coefficients. In the case of binary data we can directly call
subgroups
once that we know the matrix of the estimated
coefficients. In the case of survival data we also need to obtain the
estimated cumulative baseline hazard h0
. In this case, the
model fits a smoothing spline to obtain it. The posterior distribution
of the coefficients of this spline can be found in the model information
under the name of sbhaz
. We then obtain the basis matrix of
the spline considering the same knots and boundaries as in the model
specification and we obtain the values of the cumulative baseline hazard
at a sequence of equally spaced time points. Finally we call
subgroups
.
compare
The compare
function is used to obtain a
data.frame
with the treatment effects of the subgroups that
we want to compare.
The arguments of this function are objects of class
bonsaiforest
. We take the summary of these objects and we
save in a data frame the subgroup treatment effect estimates. If we have
a naivepop
object then we also save the overall treatment
effect. Finally we also add the response type of the data and we export
a list with all these elements.
Main functions
naivepop
The naivepop
function is used to estimate the subgroup
treatment effects with the naive overall population treatment
effect.
In this function we make a distinction depending on the type of data that we have. In the case of survival data we fit a Cox regression model (using coxph) with treatment as explanatory variable. Instead, if we have binary data, we fit a logistic regression model (using glm with the binomial family) with treatment as explanatory variable. This fitted model is going to be included as part of the output of our function.
naive
The naive
function is used to estimate the subgroup
treatment effects with the naive treatment effects per subgroup.
In this function we make a distinction between binary and survival
data (the main difference between them is that in binary we are going to
use glm models with the binomial family and in survival we are going to
use coxph models). With generate_stacked_data
we obtain the
stacked by subgroups data frame and we fit a model (depending on the
type of response the glm or the coxph) on each one of the data sets
corresponding to a specific subgroup. These fitted models are going to
be part of the output as well as a data set with the components of the
summary of those models that are going to be useful to obtain the
treatment effect.
elastic_net
The elastic_net
function is used to estimate the
subgroup treatment effects using penalization on the interaction terms
between treatment and the subgroup covariates.
We are going to use the function cv.glmnet to fit the model. We are
going to consider the design matrix (obtained after applying to the data
the function preprocess
) with the main effects and the
interaction effects and we are going to create a penalty_factor vector
with entries 0 if the coefficients are not going to be penalized (as for
the main effects) and 1 if they are going to be penalized (as for the
interaction effects). Depending on the kind of data that we have we are
going to consider the cox family (for survival) or the binomial family
(for binary). The value of alpha is the elastic net mixing parameter and
can take values between 0 and 1. If we put it to 1 we are going to apply
lasso penalty on the interaction effects and if we put it to 0 we are
going to apply ridge penalty on the interaction effects.
horseshoe
The horseshoe
function is used to estimate the subgroup
treatment effects using a bayesian model with a horseshoe prior on the
interaction terms between treatment and the subgroup covariates.
We are going to apply the function preprocess
to our
data in order to obtain the design matrix needed to fit the model. Then
we use the brm function to fit our model and we are going to consider a
normal N(0,5) prior on the main effects and a regularized horseshoe
prior on the interaction effects. Depending on the kind of data we are
going to choose the cox family (for survival data) or the bernoulli
family (for binary data). In the case of survival data before fitting
the model we specify what is the shape (the boundary and the internal
knots) of the spline that is going to fit the baseline hazard.
Summary functions
With the summary
function we are going to obtain the
estimated treatment effect of all the models fitted on the data.
Depending on the class of the fitted object we are going to apply
different functions:
summary.naivepop
The summary.naivepop
function obtains the overall
treatment effect of the data given an object of class
naivepop
.
If we have survival data the estimated overall hazard ratio is going to be just the exponential of the coefficient of the cox model fitted on the data. If we have binary data then the estimated overall log-odds ratio is going to be the coefficient of the fitted model associated with the treatment.
summary.naive
The summary.naive
function obtains the subgroup
treatment effects and their confidence intervals given an object of
class naive
.
In a naive
object we have an element which already has
the information of the estimated treatment effects (obtained from each
one of the models fitted on the subgroup data). To obtain the boundaries
of the confidence interval we just have to consider the standard error
of these estimated treatment effects and the needed normal quantile. In
the case of survival data we are going to use the exponential function
in order to show hazard ratios.
summary.elastic_net
The summary.elastic_net
function obtains the subgroup
treatment effects given an object of class elastic_net
.
We first have to obtain the estimated coefficients of the fitted
model given a value of lambda
(the default one is minimizes
the cross validation error). After that, if our data is binary we just
have to call subgroups
with the information in the
elastic_net
object. If we have survival data then we first
have to obtain the cumulative baseline hazard. To do so we first sort
the response time, we predict the values of the regression model in the
log hazard scale and then we compute the Breslow estimator of the
cumulative baseline hazard. In the end we are just going to consider the
values of this cumulative baseline hazard in the event time points (the
time points where an event happened) that are smaller than a value
L
(by default it is just going to be the maximum of the
event time points). Once that we have the cumulative baseline hazard
h0
we can call the function subgroups
to
obtain the treatment estimates. We can also specify the parameter
gamma
(it defines the weights to obtain the average hazard
ratio). Default is 1 (in this case the average hazard ratio obtained can
be interpreted as the odds of concordance).
summary.horseshoe
The summary.horseshoe
function obtains the estimated
posterior distribution of the treatment effects and summarizes its
information establishing the median of this distribution as the
estimated treatment effect and using the quantiles to define a credible
interval.
We first have to call trt_horseshoe
to obtain the
samples from the estimated posterior distribution of the treatment
effects and then for each one of the subgroups we obtain the median and
the correspondent quantiles of the samples to obtain the estimated
subgroup treatment effects and the bounds of the credible intervals.
Print functions
With the print
function we are going to show the
important section of the summary of a fitted object.
print.summary.naivepop
The print.summary.naivepop
function is used to print the
important information of the summary of a naivepop
object.
It prints the overall treatment effect estimate.
print.summary.naive
The print.summary.naive
function is used to print the
important information of the summary of a naive
object.
It prints the subgroups, the subgroup treatment effect estimates and the bounds of the credible intervals for them.
Plot functions
With the plot
function we are going to obtain the forest
plot of the summary of a fitted object.
plot.summary.naive
The plot.summary.naive
function is used to plot the
forest plot of a naive
object.
plot.summary.elastic_net
The plot.summary.elastic_net
function is used to plot
the forest plot of a elastic_net
object.
plot.summary.horseshoe
The plot.summary.horseshoe
function is used to plot the
forest plot of a horseshoe
object.
compareplot
The compareplot
function is used to create a forest plot
with all the methods fitted in the data to obtain the subgroup treatment
effects.
The arguments of the function have to be summaries of
bonsaiforest
objects, i.e, objects of class
summary.naivepop
, summary.naive
,
summary.elastic_net
or summary.horseshoe
.
These objects have the treatment effect estimates and the bounds of
the credible or confidence intervals (when it applies). We plot them
with the shape of a forest plot and if the summary.naivepop
object is included in the arguments we add a vertical line indicating
the overall treatment effect.