Fit the base imputation model and get parameter estimates

draws fits the base imputation model to the observed outcome data according to the given multiple imputation methodology. According to the user's method specification, it returns either draws from the posterior distribution of the model parameters as required for Bayesian multiple imputation or frequentist parameter estimates from the original data and bootstrapped or leave-one-out datasets as required for conditional mean imputation. The purpose of the imputation model is to estimate model parameters in the absence of intercurrent events (ICEs) handled using reference-based imputation methods. For this reason, any observed outcome data after ICEs, for which reference-based imputation methods are specified, are removed and considered as missing for the purpose of estimating the imputation model, and for this purpose only. The imputation model is a mixed model for repeated measures (MMRM) that is valid under a missing-at-random (MAR) assumption. It can be fit using maximum likelihood (ML) or restricted ML (REML) estimation, a Bayesian approach, or an approximate Bayesian approach according to the user's method specification. The ML/REML approaches and the approximate Bayesian approach support several possible covariance structures, while the Bayesian approach based on MCMC sampling supports only an unstructured covariance structure. In any case the covariance matrix can be assumed to be the same or different across each group.

Usage

draws(data, data_ice = NULL, vars, method, ncores = 1, quiet = FALSE)

# S3 method for class 'approxbayes'
draws(data, data_ice = NULL, vars, method, ncores = 1, quiet = FALSE)

# S3 method for class 'condmean'
draws(data, data_ice = NULL, vars, method, ncores = 1, quiet = FALSE)

# S3 method for class 'bmlmi'
draws(data, data_ice = NULL, vars, method, ncores = 1, quiet = FALSE)

# S3 method for class 'bayes'
draws(data, data_ice = NULL, vars, method, ncores = 1, quiet = FALSE)

Arguments

data: A data.frame containing the data to be used in the model. See details.
data_ice: A data.frame that specifies the information related to the ICEs and the imputation strategies. See details.
vars: A vars object as generated by set_vars(). See details.
method: A method object as generated by either method_bayes(), method_approxbayes(), method_condmean() or method_bmlmi(). It specifies the multiple imputation methodology to be used. See details.
ncores: A single numeric specifying the number of cores to use in creating the draws object. Note that this parameter is ignored for method_bayes() (Default = 1). Can also be a cluster object generated by make_rbmi_cluster()
quiet: Logical, if TRUE will suppress printing of progress information that is printed to the console.

Value

A draws object which is a named list containing the following:

data: R6 longdata object containing all relevant input data information.
method: A method object as generated by either method_bayes(), method_approxbayes() or method_condmean().
samples: list containing the estimated parameters of interest. Each element of samples is a named list containing the following:
- ids: vector of characters containing the ids of the subjects included in the original dataset.
- beta: numeric vector of estimated regression coefficients.
- sigma: list of estimated covariance matrices (one for each level of vars$group).
- theta: numeric vector of transformed covariances.
- failed: Logical. TRUE if the model fit failed.
- ids_samp: vector of characters containing the ids of the subjects included in the given sample.
fit: if method_bayes() is chosen, returns the MCMC Stan fit object. Otherwise NULL.
n_failures: absolute number of failures of the model fit. Relevant only for method_condmean(type = "bootstrap"), method_approxbayes() and method_bmlmi().
formula: fixed effects formula object used for the model specification.

Details

draws performs the first step of the multiple imputation (MI) procedure: fitting the base imputation model. The goal is to estimate the parameters of interest needed for the imputation phase (i.e. the regression coefficients and the covariance matrices from a MMRM model).

The function distinguishes between the following methods:

Bayesian MI based on MCMC sampling: draws returns the draws from the posterior distribution of the parameters using a Bayesian approach based on MCMC sampling. This method can be specified by using method = method_bayes().
Approximate Bayesian MI based on bootstrapping: draws returns the draws from the posterior distribution of the parameters using an approximate Bayesian approach, where the sampling from the posterior distribution is simulated by fitting the MMRM model on bootstrap samples of the original dataset. This method can be specified by using method = method_approxbayes()].
Conditional mean imputation with bootstrap re-sampling: draws returns the MMRM parameter estimates from the original dataset and from n_samples bootstrap samples. This method can be specified by using method = method_condmean() with argument type = "bootstrap".
Conditional mean imputation with jackknife re-sampling: draws returns the MMRM parameter estimates from the original dataset and from each leave-one-subject-out sample. This method can be specified by using method = method_condmean() with argument type = "jackknife".
Bootstrapped Maximum Likelihood MI: draws returns the MMRM parameter estimates from a given number of bootstrap samples needed to perform random imputations of the bootstrapped samples. This method can be specified by using method = method_bmlmi().

Bayesian MI based on MCMC sampling has been proposed in Carpenter, Roger, and Kenward (2013) who first introduced reference-based imputation methods. Approximate Bayesian MI is discussed in Little and Rubin (2002). Conditional mean imputation methods are discussed in Wolbers et al (2022). Bootstrapped Maximum Likelihood MI is described in Von Hippel & Bartlett (2021).

The argument data contains the longitudinal data. It must have at least the following variables:

subjid: a factor vector containing the subject ids.
visit: a factor vector containing the visit the outcome was observed on.
group: a factor vector containing the group that the subject belongs to.
outcome: a numeric vector containing the outcome variable. It might contain missing values. Additional baseline or time-varying covariates must be included in data.

data must have one row per visit per subject. This means that incomplete outcome data must be set as NA instead of having the related row missing. Missing values in the covariates are not allowed. If data is incomplete then the expand_locf() helper function can be used to insert any missing rows using Last Observation Carried Forward (LOCF) imputation to impute the covariates values. Note that LOCF is generally not a principled imputation method and should only be used when appropriate for the specific covariate.

Please note that there is no special provisioning for the baseline outcome values. If you do not want baseline observations to be included in the model as part of the response variable then these should be removed in advance from the outcome variable in data. At the same time if you want to include the baseline outcome as covariate in the model, then this should be included as a separate column of data (as any other covariate).

Character covariates will be explicitly cast to factors. If you use a custom analysis function that requires specific reference levels for the character covariates (for example in the computation of the least square means computation) then you are advised to manually cast your character covariates to factor in advance of running draws().

The argument data_ice contains information about the occurrence of ICEs. It is a data.frame with 3 columns:

Subject ID: a character vector containing the ids of the subjects that experienced the ICE. This column must be named as specified in vars$subjid.
Visit: a character vector containing the first visit after the occurrence of the ICE (i.e. the first visit affected by the ICE). The visits must be equal to one of the levels of data[[vars$visit]]. If multiple ICEs happen for the same subject, then only the first non-MAR visit should be used. This column must be named as specified in vars$visit.
Strategy: a character vector specifying the imputation strategy to address the ICE for this subject. This column must be named as specified in vars$strategy. Possible imputation strategies are:
- "MAR": Missing At Random.
- "CIR": Copy Increments in Reference.
- "CR": Copy Reference.
- "JR": Jump to Reference.
- "LMCF": Last Mean Carried Forward. For explanations of these imputation strategies, see Carpenter, Roger, and Kenward (2013), Cro et al (2021), and Wolbers et al (2022). Please note that user-defined imputation strategies can also be set.

The data_ice argument is necessary at this stage since (as explained in Wolbers et al (2022)), the model is fitted after removing the observations which are incompatible with the imputation model, i.e. any observed data on or after data_ice[[vars$visit]] that are addressed with an imputation strategy different from MAR are excluded for the model fit. However such observations will not be discarded from the data in the imputation phase (performed with the function (impute()). To summarize, at this stage only pre-ICE data and post-ICE data that is after ICEs for which MAR imputation is specified are used.

If the data_ice argument is omitted, or if a subject doesn't have a record within data_ice, then it is assumed that all of the relevant subject's data is pre-ICE and as such all missing visits will be imputed under the MAR assumption and all observed data will be used to fit the base imputation model. Please note that the ICE visit cannot be updated via the update_strategy argument in impute(); this means that subjects who didn't have a record in data_ice will always have their missing data imputed under the MAR assumption even if their strategy is updated.

The vars argument is a named list that specifies the names of key variables within data and data_ice. This list is created by set_vars() and contains the following named elements:

subjid: name of the column in data and data_ice which contains the subject ids variable.
visit: name of the column in data and data_ice which contains the visit variable.
group: name of the column in data which contains the group variable.
outcome: name of the column in data which contains the outcome variable.
covariates: vector of characters which contains the covariates to be included in the model (including interactions which are specified as "covariateName1*covariateName2"). If no covariates are provided the default model specification of outcome ~ 1 + visit + group will be used. Please note that the group*visit interaction is not included in the model by default.
strata: covariates used as stratification variables in the bootstrap sampling. By default only the vars$group is set as stratification variable. Needed only for method_condmean(type = "bootstrap") and method_approxbayes().
strategy: name of the column in data_ice which contains the subject-specific imputation strategy.

In our experience, Bayesian MI (method = method_bayes()) with a relatively low number of samples (e.g. n_samples below 100) frequently triggers STAN warnings about R-hat such as "The largest R-hat is X.XX, indicating chains have not mixed". In many instances, this warning might be spurious, i.e. standard diagnostics analysis of the MCMC samples do not indicate any issues and results look reasonable. Increasing the number of samples to e.g. above 150 usually gets rid of the warning.

References

James R Carpenter, James H Roger, and Michael G Kenward. Analysis of longitudinal trials with protocol deviation: a framework for relevant, accessible assumptions, and inference via multiple imputation. Journal of Biopharmaceutical Statistics, 23(6):1352–1371, 2013.

Suzie Cro, Tim P Morris, Michael G Kenward, and James R Carpenter. Sensitivity analysis for clinical trials with missing continuous outcome data using controlled multiple imputation: a practical guide. Statistics in Medicine, 39(21):2815–2842, 2020.

Roderick J. A. Little and Donald B. Rubin. Statistical Analysis with Missing Data, Second Edition. John Wiley & Sons, Hoboken, New Jersey, 2002. [Section 10.2.3]

Marcel Wolbers, Alessandro Noci, Paul Delmar, Craig Gower-Page, Sean Yiu, Jonathan W. Bartlett. Standard and reference-based conditional mean imputation. https://arxiv.org/abs/2109.11162, 2022.

Von Hippel, Paul T and Bartlett, Jonathan W. Maximum likelihood multiple imputation: Faster imputations and consistent standard errors without posterior draws. 2021.