high_level_structure • bonsaiforest

library(bonsaiforest)

In this vignette we are going to briefly describe the high level structure of the functions used (the main ones and the helper ones) in this package.

Helper functions

Generate_stacked_data

The generate_stacked_data function is used to generate a data frame with the data stacked by the subgroups.

We gather the columns of the different variables and we create two columns: one with the subgroup variable and the other one with the value of the subgroup variable. Doing this we are going to increase the number of rows of the data frame (the new number of rows is going to be the initial one times the number of subgroup variables). Finally, we sort the rows of the new data frame by the subgroup and we output this sorted data frame.

Preprocess

The preprocess function is used to obtain from the original data frame 3 different design matrices and a vector with the name of the subgroups that are going to be studied.

The first of the matrices is design_ia which is the design matrix of a model just with the interactions between treatment and the subgrouping covariates.

Another matrix is design_main which is the design matrix of a model with the main effects of treatment and the subgroup variables.

The last matrix is design_dummy which is the matrix of a model just with the subgrouping covariates and using one-hot encoding.

To obtain all these matrices we just obtain the design matrices of some specified models (for example just considering one of the covariates) and when needed we bind them to obtain the three desired matrices.

Finally we also obtain the vector subgr_names with the exact name of the different subgroups that we are going to study.

lor_estimation

The lor_estimation function is used to obtain the log-odds ratio of a subgroup in the case of binary data.

The idea is to create two matrices with the same structure as the dummy matrix that defines the subgroup data but in one of them considering that the treatment is present in all observations and in the other one considering that the treatment is not present in any observation.

Then we obtain for each one of this matrices (considering the estimated coefficients from the fitted model and the general idea of logistic regression) the probability of having an event averaging over all individuals. With this probability obtained for the case of just having treatment and the case of just having control, we can obtain the log odds ratio (which is the output of this function).

surv_prob

The surv_prob function is used to obtain an average survival curve considering the curves from each one of the individuals.

The idea is that we have two matrices where one (h) has the values of the cumulative baseline hazard at each one of the event times and the other one (k) has the values -exponential of the product between the estimated coefficients and the values of the covariates of an individual. So we are going to have two matrices one with as many rows as event times and the other one with as many rows as individuals. If we want to obtain the survival curve for one of this individuals we would just have to multiply the correspondent element of k times the matrix h and take exponential of that value. We can repeat this for every individual and then we can take an average of all these values to obtain the average survival curve.

Survival_curves

The survival_curves function is used to obtain the average survival curve given the design matrix, the cumulative baseline hazard and the estimated coefficients.

In this function we obtain the matrix K2 (using the design matrix and the estimated coefficients of the fitted model). Then we apply the surv_prob function to this K2 and the cumulative baseline hazard h0 in order to obtain the average survival curve.

ahr_estimation

The ahr_estimation function is used to estimate the average hazard ratio of a subgroup.

We first build two matrices: one where all the individuals are in the control group and another one where all the individuals are in the treatment group. Given each one of these matrices we obtain the average survival probabilities associated to them.

Having these survival probabilities we can consider a value of gamma that defines a specific weight function to estimate the average hazard ratio. For the value of gamma equal to 1 we have that the average hazard ratio is the same as the odds of concordance.

subgroups

The subgroups function is used to estimate the treatment effect of each one of the subgroups considered.

We find the design and the dummy matrix associated with each one of the subgroups and given these matrices we call ahr_estimation (in the case of survival data) to obtain the average hazard ratio of the subgroup or we call lor_estimation (in the case of binary data) to obtain the log odds ratio of the subgroup.

trt_horseshoe

The trt_horseshoe function is used to estimate the posterior distribution of the treatment effects of the subgroups considered.

We find the number of iteration that we need to consider in order to obtain all the samples from the posterior distribution of the coefficients. In the case of binary data we can directly call subgroups once that we know the matrix of the estimated coefficients. In the case of survival data we also need to obtain the estimated cumulative baseline hazard h0. In this case, the model fits a smoothing spline to obtain it. The posterior distribution of the coefficients of this spline can be found in the model information under the name of sbhaz. We then obtain the basis matrix of the spline considering the same knots and boundaries as in the model specification and we obtain the values of the cumulative baseline hazard at a sequence of equally spaced time points. Finally we call subgroups.

compare

The compare function is used to obtain a data.frame with the treatment effects of the subgroups that we want to compare.

The arguments of this function are objects of class bonsaiforest. We take the summary of these objects and we save in a data frame the subgroup treatment effect estimates. If we have a naivepop object then we also save the overall treatment effect. Finally we also add the response type of the data and we export a list with all these elements.

Main functions

naivepop

The naivepop function is used to estimate the subgroup treatment effects with the naive overall population treatment effect.

In this function we make a distinction depending on the type of data that we have. In the case of survival data we fit a Cox regression model (using coxph) with treatment as explanatory variable. Instead, if we have binary data, we fit a logistic regression model (using glm with the binomial family) with treatment as explanatory variable. This fitted model is going to be included as part of the output of our function.

naive

The naivefunction is used to estimate the subgroup treatment effects with the naive treatment effects per subgroup.

In this function we make a distinction between binary and survival data (the main difference between them is that in binary we are going to use glm models with the binomial family and in survival we are going to use coxph models). With generate_stacked_data we obtain the stacked by subgroups data frame and we fit a model (depending on the type of response the glm or the coxph) on each one of the data sets corresponding to a specific subgroup. These fitted models are going to be part of the output as well as a data set with the components of the summary of those models that are going to be useful to obtain the treatment effect.

elastic_net

The elastic_net function is used to estimate the subgroup treatment effects using penalization on the interaction terms between treatment and the subgroup covariates.

We are going to use the function cv.glmnet to fit the model. We are going to consider the design matrix (obtained after applying to the data the function preprocess) with the main effects and the interaction effects and we are going to create a penalty_factor vector with entries 0 if the coefficients are not going to be penalized (as for the main effects) and 1 if they are going to be penalized (as for the interaction effects). Depending on the kind of data that we have we are going to consider the cox family (for survival) or the binomial family (for binary). The value of alpha is the elastic net mixing parameter and can take values between 0 and 1. If we put it to 1 we are going to apply lasso penalty on the interaction effects and if we put it to 0 we are going to apply ridge penalty on the interaction effects.

horseshoe

The horseshoefunction is used to estimate the subgroup treatment effects using a bayesian model with a horseshoe prior on the interaction terms between treatment and the subgroup covariates.

We are going to apply the function preprocess to our data in order to obtain the design matrix needed to fit the model. Then we use the brm function to fit our model and we are going to consider a normal N(0,5) prior on the main effects and a regularized horseshoe prior on the interaction effects. Depending on the kind of data we are going to choose the cox family (for survival data) or the bernoulli family (for binary data). In the case of survival data before fitting the model we specify what is the shape (the boundary and the internal knots) of the spline that is going to fit the baseline hazard.

Summary functions

With the summary function we are going to obtain the estimated treatment effect of all the models fitted on the data. Depending on the class of the fitted object we are going to apply different functions:

summary.naivepop

The summary.naivepop function obtains the overall treatment effect of the data given an object of class naivepop.

If we have survival data the estimated overall hazard ratio is going to be just the exponential of the coefficient of the cox model fitted on the data. If we have binary data then the estimated overall log-odds ratio is going to be the coefficient of the fitted model associated with the treatment.

summary.naive

The summary.naive function obtains the subgroup treatment effects and their confidence intervals given an object of class naive.

In a naive object we have an element which already has the information of the estimated treatment effects (obtained from each one of the models fitted on the subgroup data). To obtain the boundaries of the confidence interval we just have to consider the standard error of these estimated treatment effects and the needed normal quantile. In the case of survival data we are going to use the exponential function in order to show hazard ratios.

summary.elastic_net

The summary.elastic_net function obtains the subgroup treatment effects given an object of class elastic_net.

We first have to obtain the estimated coefficients of the fitted model given a value of lambda (the default one is minimizes the cross validation error). After that, if our data is binary we just have to call subgroups with the information in the elastic_net object. If we have survival data then we first have to obtain the cumulative baseline hazard. To do so we first sort the response time, we predict the values of the regression model in the log hazard scale and then we compute the Breslow estimator of the cumulative baseline hazard. In the end we are just going to consider the values of this cumulative baseline hazard in the event time points (the time points where an event happened) that are smaller than a value L (by default it is just going to be the maximum of the event time points). Once that we have the cumulative baseline hazard h0 we can call the function subgroups to obtain the treatment estimates. We can also specify the parameter gamma (it defines the weights to obtain the average hazard ratio). Default is 1 (in this case the average hazard ratio obtained can be interpreted as the odds of concordance).

summary.horseshoe

The summary.horseshoe function obtains the estimated posterior distribution of the treatment effects and summarizes its information establishing the median of this distribution as the estimated treatment effect and using the quantiles to define a credible interval.

We first have to call trt_horseshoe to obtain the samples from the estimated posterior distribution of the treatment effects and then for each one of the subgroups we obtain the median and the correspondent quantiles of the samples to obtain the estimated subgroup treatment effects and the bounds of the credible intervals.

Print functions

With the print function we are going to show the important section of the summary of a fitted object.

print.summary.naivepop

The print.summary.naivepop function is used to print the important information of the summary of a naivepop object.

It prints the overall treatment effect estimate.

print.summary.naive

The print.summary.naive function is used to print the important information of the summary of a naive object.

It prints the subgroups, the subgroup treatment effect estimates and the bounds of the credible intervals for them.

print.summary.elastic_net

The print.summary.elstic_net function is used to print the important information of the summary of a elastic_net object.

It prints the subgroups and the subgroup treatment effect estimates.

print.summary.horseshoe

The print.summary.horseshoe function is used to print the important information of the summary of a horseshoe object.

It prints the subgroups, the subgroup treatment effect estimates and the bounds of the credible intervals for them.

Plot functions

With the plot function we are going to obtain the forest plot of the summary of a fitted object.

plot.summary.naive

The plot.summary.naive function is used to plot the forest plot of a naive object.

plot.summary.elastic_net

The plot.summary.elastic_net function is used to plot the forest plot of a elastic_net object.

plot.summary.horseshoe

The plot.summary.horseshoe function is used to plot the forest plot of a horseshoe object.

compareplot

The compareplot function is used to create a forest plot with all the methods fitted in the data to obtain the subgroup treatment effects.

The arguments of the function have to be summaries of bonsaiforest objects, i.e, objects of class summary.naivepop, summary.naive, summary.elastic_net or summary.horseshoe.

These objects have the treatment effect estimates and the bounds of the credible or confidence intervals (when it applies). We plot them with the shape of a forest plot and if the summary.naivepop object is included in the arguments we add a vertical line indicating the overall treatment effect.