Controlling Splitting Behavior
Gabriel Becker
2024-06-27
Source:vignettes/split_functions.Rmd
split_functions.Rmd
Controlling Facet Levels
Provided Functions
By default, split_*_by(varname, ...)
generates a facet
for each level the variable varname
takes in the
data - including unobserved ones in the factor
case. This
behavior can be customized in various ways.
The most straightforward way to customize which facets are generated
by a split is with one of the split functions or split function families
provided by rtables
.
These predefined split functions and function factories implement commonly desired customization patterns of splitting behavior (i.e., faceting behavior). They include:
-
remove_split_levels
- remove specified levels from the data for facet generation. -
keep_split_levels
- keep only specified levels in the data for facet generation (removing all others). -
drop_split_levels
- drop levels that are unobserved within the data being split, i.e., associated with the parent facet. -
reorder_split_levels
- reorder the levels (and thus the generated facets) to the specified order. -
trim_levels_in_group
- drop unobserved levels of another variable independently within the data associated with each facet generated by the current split. -
add_overall_level
,add_combo_levels
- add additional “virtual” levels which combine two or more levels of the variable being split. See the following section. -
trim_levels_to_map
- trim the levels of multiple variables to a pre-specified set of value combinations. See the following section.
The first four of these are fairly self-describing and for brevity,
we refer our readers to ?split_funcs
for details including
working examples.
Controlling Combinations of Levels Across Multiple Variables
Often with nested splitting involving multiple variables, the values of the variables in question are logically nested; meaning that certain values of the inner variable are only coherent in combination with a specific value or values of the outer variable.
As an example, suppose we have a variable vehicle_class
,
which can take the values "automobile"
, and
"boat"
, and a variable vehicle_type
, which can
take the values "car"
, "truck"
,
"suv"
,"sailboat"
, and
"cruiseliner"
. The combination ("automobile"
,
"cruiseliner"
) does not make sense and will never occur in
any (correctly cleaned) data set; nor does the combination
("boat"
, "truck"
).
We will showcase strategies to deal with this in the next sections using the following artificial data:
set.seed(0)
levs_type <- c("car", "truck", "suv", "sailboat", "cruiseliner")
vclass <- sample(c("auto", "boat"), 1000, replace = TRUE)
auto_inds <- which(vclass == "auto")
vtype <- rep(NA_character_, 1000)
vtype[auto_inds] <- sample(
c("car", "truck"), ## suv missing on purpose
length(auto_inds),
replace = TRUE
)
vtype[-auto_inds] <- sample(
c("sailboat", "cruiseliner"),
1000 - length(auto_inds),
replace = TRUE
)
vehic_data <- data.frame(
vehicle_class = factor(vclass),
vehicle_type = factor(vtype, levels = levs_type),
color = sample(
c("white", "black", "red"), 1000,
prob = c(1, 2, 1),
replace = TRUE
),
cost = ifelse(
vclass == "boat",
rnorm(1000, 100000, sd = 5000),
rnorm(1000, 40000, sd = 5000)
)
)
head(vehic_data)
## vehicle_class vehicle_type color cost
## 1 boat sailboat black 100393.81
## 2 auto car white 38150.17
## 3 boat sailboat white 98696.13
## 4 auto truck white 37677.16
## 5 auto truck black 38489.27
## 6 boat cruiseliner black 108709.72
trim_levels_in_group
The trim_levels_in_group
split function factory creates
split functions which deal with this issue empirically; any combination
which is observed in the data being tabulated will appear as
nested facets within the table, while those that do not, will not.
If we use default level-based faceting, we get several logically incoherent cells within our table:
library(rtables)
lyt <- basic_table() %>%
split_cols_by("color") %>%
split_rows_by("vehicle_class") %>%
split_rows_by("vehicle_type") %>%
analyze("cost")
build_table(lyt, vehic_data)
## black white red
## ————————————————————————————————————————————————
## auto
## car
## Mean 40431.92 40518.92 38713.14
## truck
## Mean 40061.70 40635.74 40024.41
## suv
## Mean NA NA NA
## sailboat
## Mean NA NA NA
## cruiseliner
## Mean NA NA NA
## boat
## car
## Mean NA NA NA
## truck
## Mean NA NA NA
## suv
## Mean NA NA NA
## sailboat
## Mean 99349.69 99996.54 101865.73
## cruiseliner
## Mean 100212.00 99340.25 100363.52
This is obviously not the table we want, as the majority of its space
is taken up by meaningless combinations. If we use
trim_levels_in_group
to trim the levels of
vehicle_type
separately within each level of
vehicle_class
, we get a table which only has meaningful
combinations:
lyt2 <- basic_table() %>%
split_cols_by("color") %>%
split_rows_by("vehicle_class", split_fun = trim_levels_in_group("vehicle_type")) %>%
split_rows_by("vehicle_type") %>%
analyze("cost")
build_table(lyt2, vehic_data)
## black white red
## ————————————————————————————————————————————————
## auto
## car
## Mean 40431.92 40518.92 38713.14
## truck
## Mean 40061.70 40635.74 40024.41
## boat
## sailboat
## Mean 99349.69 99996.54 101865.73
## cruiseliner
## Mean 100212.00 99340.25 100363.52
Note, however, that it does not contain all meaningful
combinations, only those that were actually observed in our data; which
happens to not include the perfectly valid "auto"
,
"suv"
combination.
To restrict level combinations to those which are valid
regardless of whether the combination was observed, we must use
trim_levels_to_map()
instead.
trim_levels_to_map
trim_levels_to_map
is similar to
trim_levels_in_group
in that its purpose is to avoid
combinatorial explosion when nesting splitting with logically nested
variables. Unlike its sibling function, however, with
trim_levels_to_map
we define the exact set of allowed
combinations a priori, and that exact set of combinations is
produced in the resulting table, regardless of whether they are observed
or not.
library(tibble)
map <- tribble(
~vehicle_class, ~vehicle_type,
"auto", "truck",
"auto", "suv",
"auto", "car",
"boat", "sailboat",
"boat", "cruiseliner"
)
lyt3 <- basic_table() %>%
split_cols_by("color") %>%
split_rows_by("vehicle_class", split_fun = trim_levels_to_map(map)) %>%
split_rows_by("vehicle_type") %>%
analyze("cost")
build_table(lyt3, vehic_data)
## black white red
## ————————————————————————————————————————————————
## auto
## car
## Mean 40431.92 40518.92 38713.14
## truck
## Mean 40061.70 40635.74 40024.41
## suv
## Mean NA NA NA
## boat
## sailboat
## Mean 99349.69 99996.54 101865.73
## cruiseliner
## Mean 100212.00 99340.25 100363.52
Now we see that the "auto"
, "suv"
combination is again present, even though it is populated with
NA
s (because there is no data in that category), but the
logically invalid combinations are still absent.
Combining Levels
Another very common manipulation of faceting in a table context is the introduction of combination levels that are not explicitly modeled in the data. Most often, this involves the addition of an “overall” category, but in both principle and practice it can involve any arbitrary combination of levels.
rtables
explicitly supports this via the
add_overall_level
(for the all case) and
add_combo_levels
split function factories.
add_overall_level
add_overall_level
accepts valname
which is
the name of the new level, as well as label
, and
first
(whether it should come first, if TRUE
,
or last, if FALSE
, in the ordering).
Building further on our arbitrary vehicles table, we can use this to create an “all colors” category:
lyt4 <- basic_table(show_colcounts = TRUE) %>%
split_cols_by("color", split_fun = add_overall_level("allcolors", label = "All Colors")) %>%
split_rows_by("vehicle_class", split_fun = trim_levels_to_map(map)) %>%
split_rows_by("vehicle_type") %>%
analyze("cost")
build_table(lyt4, vehic_data)
## All Colors black white red
## (N=1000) (N=521) (N=251) (N=228)
## —————————————————————————————————————————————————————————————
## auto
## car
## Mean 40095.49 40431.92 40518.92 38713.14
## truck
## Mean 40194.68 40061.70 40635.74 40024.41
## suv
## Mean NA NA NA NA
## boat
## sailboat
## Mean 100133.22 99349.69 99996.54 101865.73
## cruiseliner
## Mean 100036.76 100212.00 99340.25 100363.52
With the column counts turned on, we can see that the “All Colors” column encompasses the full 1000 (completely fake) vehicles in our data set.
To add more arbitrary combinations, we use
add_combo_levels
.
add_combo_levels
add_combo_levels
allows us to add one or more arbitrary
combination levels to the faceting structure of our table.
We do this by defining a combination data.frame which
describes the levels we want to add. A combination
data.frame
has the following columns and one row for each
combination to add:
-
valname
- string indicating the name of the value, which will appear in paths. -
label
- a string indicating the label which should be displayed when rendering. -
levelcombo
- character vector of the individual levels to be combined in this combination level. -
exargs
- a list (usuallylist()
) of extra arguments which should be passed to analysis and content functions when tabulated within this column or row.
Suppose we wanted combinations levels for all non-white colors, and for white and black colors. We do this like so:
combodf <- tribble(
~valname, ~label, ~levelcombo, ~exargs,
"non-white", "Non-White", c("black", "red"), list(),
"blackwhite", "Black or White", c("black", "white"), list()
)
lyt5 <- basic_table(show_colcounts = TRUE) %>%
split_cols_by("color", split_fun = add_combo_levels(combodf)) %>%
split_rows_by("vehicle_class", split_fun = trim_levels_to_map(map)) %>%
split_rows_by("vehicle_type") %>%
analyze("cost")
build_table(lyt5, vehic_data)
## black white red Non-White Black or White
## (N=521) (N=251) (N=228) (N=749) (N=772)
## —————————————————————————————————————————————————————————————————————————————
## auto
## car
## Mean 40431.92 40518.92 38713.14 39944.93 40460.77
## truck
## Mean 40061.70 40635.74 40024.41 40050.66 40243.57
## suv
## Mean NA NA NA NA NA
## boat
## sailboat
## Mean 99349.69 99996.54 101865.73 100179.72 99567.50
## cruiseliner
## Mean 100212.00 99340.25 100363.52 100258.56 99937.47
Fully Customizing Split (Facet) Behavior
Beyond the ability to select common splitting customizations from the
split functions and split function factories rtables
provides, we can also fully customize every aspect of splitting behavior
by creating our own split functions. While it is possible to do so by
hand, the primary way we do this is via the
make_split_fun()
function, which accepts functions
implementing different component behaviors and combines them into a
split function which can be used in a layout.
Splitting, or faceting as it is done in rtables
, can be
thought of as the combination of 3 steps:
- preprocessing - transformation of the incoming data which will be faceted
- e.g., dropping unused factor levels, etc.
- splitting - mapping the incoming data to a set of 1 or more subsets representing individual facets.
- postprocessing - operations on the facets - e.g., combining them, removing them, etc.
The make_split_fun()
function allows us to specify
custom behaviors for each of these steps independently when defining
custom splitting behavior via the pre
,
core_split
, and post
arguments, which dictate
the above steps, respectively.
The pre
argument accepts zero or more pre-processing
functions, which must accept: df
, spl
,
vals
, labels
, and can optionally accept
.spl_context
. They then manipulate df
(the
incoming data for the split) and return a modified data.frame. This
modified data.frame must contain all columns present in the
incoming data.frame, but can add columns if necessary. Although, we note
that these new columns cannot be used in the layout as split or
analysis variables, because they will not be present when validity
checking is done.
The pre-processing component is useful for things such as manipulating factor levels, e.g., to trim unobserved ones or to reorder levels based on observed counts, etc.
For a more detailed discussion on what custom split functions do, and
an example of a custom split function not implemented via
make_split_fun()
, see ?custom_split_funs
.
An Example Custom Split Function
Here we will implement an arbitrary, custom split function where we specify both pre- and post-processing instructions. It is unusual for users to need to override the core splitting logic - and, in fact, is only supported in row space currently - so we leave this off of our example here but will provide another narrow example of that usage below.
An Illustrative Example of A Custom Split Function
First, we define two aspects of ‘pre-processing step’ behavior:
- A function which reverses the order of the levels of a variable (while retaining which level is associated with which observation), and
- A function factory which creates a function that removes a level and the data associated with it.
## reverse order of levels
rev_lev <- function(df, spl, vals, labels, ...) {
## in the split_rows_by() and split_cols_by() cases,
## spl_variable() gives us the variable
var <- spl_variable(spl)
vec <- df[[var]]
levs <- if (is.character(vec)) unique(vec) else levels(vec)
df[[var]] <- factor(vec, levels = rev(levs))
df
}
rem_lev_facet <- function(torem) {
function(df, spl, vals, labels, ...) {
var <- spl_variable(spl)
vec <- df[[var]]
bad <- vec == torem
df <- df[!bad, ]
levs <- if (is.character(vec)) unique(vec) else levels(vec)
df[[var]] <- factor(as.character(vec[!bad]), levels = setdiff(levs, torem))
df
}
}
Finally we implement our post-processing function. Here we will reorder the facets based on the amount of data each of them represents.
sort_them_facets <- function(splret, spl, fulldf, ...) {
ord <- order(sapply(splret$datasplit, nrow))
make_split_result(
splret$values[ord],
splret$datasplit[ord],
splret$labels[ord]
)
}
Finally, we construct our custom split function and use it to create our table:
silly_splfun1 <- make_split_fun(
pre = list(
rev_lev,
rem_lev_facet("white")
),
post = list(sort_them_facets)
)
lyt6 <- basic_table(show_colcounts = TRUE) %>%
split_cols_by("color", split_fun = silly_splfun1) %>%
split_rows_by("vehicle_class", split_fun = trim_levels_to_map(map)) %>%
split_rows_by("vehicle_type") %>%
analyze("cost")
build_table(lyt6, vehic_data)
## red black
## (N=228) (N=521)
## —————————————————————————————————————
## auto
## car
## Mean 38713.14 40431.92
## truck
## Mean 40024.41 40061.70
## suv
## Mean NA NA
## boat
## sailboat
## Mean 101865.73 99349.69
## cruiseliner
## Mean 100363.52 100212.00
Overriding the Core Split Function
Currently, overriding core split behavior is only supported in functions used for row splits.
Next, we write a custom core-splitting function which divides the observations into 4 groups: the first 100, observations 101-500, observations 501-900, and the last hundred. We could claim this was to test for structural bias in the first and last observations, but really its to simply illustrate overriding the core splitting machinery and has no meaningful statistical purpose.
silly_core_split <- function(spl, df, vals, labels, .spl_context) {
make_split_result(
c("first", "lowmid", "highmid", "last"),
datasplit = list(
df[1:100, ],
df[101:500, ],
df[501:900, ],
df[901:1000, ]
),
labels = c(
"first 100",
"obs 101-500",
"obs 501-900",
"last 100"
)
)
}
We can use this to construct a splitting function. This can be combined with pre- and post-processing functions, as each of the stages is performed independently, but in this case, we won’t, because our core splitting behavior is such that pre- or post-processing do not make much sense.
even_sillier_splfun <- make_split_fun(core_split = silly_core_split)
lyt7 <- basic_table(show_colcounts = TRUE) %>%
split_cols_by("color") %>%
split_rows_by("vehicle_class", split_fun = even_sillier_splfun) %>%
split_rows_by("vehicle_type") %>%
analyze("cost")
build_table(lyt7, vehic_data)
## black white red
## (N=521) (N=251) (N=228)
## —————————————————————————————————————————————————
## first 100
## car
## Mean 40496.05 37785.41 37623.17
## truck
## Mean 41094.17 40437.29 37866.81
## suv
## Mean NA NA NA
## sailboat
## Mean 100560.80 102017.05 101185.96
## cruiseliner
## Mean 100838.12 96952.27 100610.71
## obs 101-500
## car
## Mean 39350.88 41185.98 37978.72
## truck
## Mean 40166.87 41385.32 39885.72
## suv
## Mean NA NA NA
## sailboat
## Mean 98845.47 99563.02 101462.79
## cruiseliner
## Mean 101558.62 99039.91 97335.05
## obs 501-900
## car
## Mean 40721.82 40379.48 38681.26
## truck
## Mean 39951.92 39846.89 39840.39
## suv
## Mean NA NA NA
## sailboat
## Mean 99533.20 100347.18 102732.12
## cruiseliner
## Mean 99140.43 100074.43 101994.99
## last 100
## car
## Mean 45204.44 40626.95 41214.33
## truck
## Mean 38920.70 40620.47 42899.14
## suv
## Mean NA NA NA
## sailboat
## Mean 99380.21 97644.77 101691.92
## cruiseliner
## Mean 100017.53 99581.94 100751.30
Design of Pre- and Post-Processing Functions For Use in
make_split_fun
Pre-processing and post-processing functions in the custom-splitting context are best thought of as (and implemented as) independent, atomic building blocks for the desired overall behavior. This allows them to be reused in a flexible mix-and-match way.
rtables
provides several behavior components implemented
as either functions or function factories:
- Pre-processing “behavior blocks”
-
drop_facet_levels
- drop unobserved levels in the variable being split
-
- Post-processing “behavior blocks”
-
trim_levels_in_facets
- providestrim_levels_in_group
behavior -
add_overall_facet
- add a combination facet for the full data -
add_combo_facet
- add a single combination facet (can be used more than once in a singlemake_split_fun
call)
-