Introduction to rtables
Gabriel Becker and Adrian Waddell
2023-05-19
Source:vignettes/introduction.Rmd
introduction.Rmd
Introduction
The rtables
R package provides a framework to create,
tabulate and output tables in R
. Most of the design
requirements for rtables
have their origin in studying
tables that are commonly used to report analyses from clinical trials;
however, we were careful to keep rtables
a general purpose
toolkit.
There are a number of other table frameworks available in
R
such as gt from
RStudio, xtable,
tableone, and
tables to name a
few. There is a number of reasons to implement rtables
(yet
another tables R package):
- output tables in ASCII to text files
- table rendering (ASCII, HTML, etc.) is separate from the data model. Hence, one always has access to the non-rounded/non-formatted numbers.
- pagination in both horizontal and vertical directions to meet the health authority submission requirements
- cell, row, column, table reference system
- titles, footers, and referential footnotes
- path based access to cell content which will be useful for automated content generation
In the remainder of this vignette, we give a short introduction into
rtables
and tabulating a table. The content is based on the
useR 2020
presentation from Gabriel Becker.
The packages used for this vignette are rtables
and
dplyr
:
Data
The data used in this vignette is a made up using random number generators. The data content is relatively simple: one row per imaginary person and one column per measurement: study arm, the country of origin, gender, handedness, age, and weight.
n <- 400
set.seed(1)
df <- tibble(
arm = factor(sample(c("Arm A", "Arm B"), n, replace = TRUE), levels = c("Arm A", "Arm B")),
country = factor(sample(c("CAN", "USA"), n, replace = TRUE, prob = c(.55, .45)), levels = c("CAN", "USA")),
gender = factor(sample(c("Female", "Male"), n, replace = TRUE), levels = c("Female", "Male")),
handed = factor(sample(c("Left", "Right"), n, prob = c(.6, .4), replace = TRUE), levels = c("Left", "Right")),
age = rchisq(n, 30) + 10
) %>% mutate(
weight = 35 * rnorm(n, sd = .5) + ifelse(gender == "Female", 140, 180)
)
head(df)
# # A tibble: 6 × 6
# arm country gender handed age weight
# <fct> <fct> <fct> <fct> <dbl> <dbl>
# 1 Arm A USA Female Left 31.3 139.
# 2 Arm B CAN Female Right 50.5 116.
# 3 Arm A USA Male Right 32.4 186.
# 4 Arm A USA Male Right 34.6 169.
# 5 Arm B USA Female Right 43.0 160.
# 6 Arm A USA Female Right 43.2 126.
Note that we use factor variables so that the level order is
represented in the row or column order when we tabulate the information
of df
below.
Building a Table
The aim of this vignette is to build the following table step by step:
# Arm A Arm B
# Female Male Female Male
# (N=96) (N=105) (N=92) (N=107)
# ————————————————————————————————————————————————————————————
# CAN 45 (46.9%) 64 (61.0%) 46 (50.0%) 62 (57.9%)
# Left 32 (33.3%) 42 (40.0%) 26 (28.3%) 37 (34.6%)
# mean 38.9 40.4 40.3 37.7
# Right 13 (13.5%) 22 (21.0%) 20 (21.7%) 25 (23.4%)
# mean 36.6 40.2 40.2 40.6
# USA 51 (53.1%) 41 (39.0%) 46 (50.0%) 45 (42.1%)
# Left 34 (35.4%) 19 (18.1%) 25 (27.2%) 25 (23.4%)
# mean 40.4 39.7 39.2 40.1
# Right 17 (17.7%) 22 (21.0%) 21 (22.8%) 20 (18.7%)
# mean 36.9 39.8 38.5 39.0
Starting Simple
In rtables
a basic table is defined to have 0 rows and
one column representing all data. Analyzing a variable is one way of
adding a row:
lyt <- basic_table() %>%
analyze("age", mean, format = "xx.x")
tbl <- build_table(lyt, df)
tbl
# all obs
# ——————————————
# mean 39.4
Layout Instructions
In the code above we first described the table and assigned that
description to a variable lyt
. We then built the table
using the actual data with build_table()
. The description
of a table is called a table layout. basic_table()
is the
start of every table layout and contains the information that we have in
one column representing all data. The analyze()
instruction
adds to the layout that the age
variable should be analyzed
with the mean()
analysis function and the result should be
rounded to 1
decimal place.
Hence, a layout is “pre-data”, that is, it’s a description of how to build a table once we get data. We can look at the layout isolated:
lyt
# A Pre-data Table Layout
#
# Column-Split Structure:
# ()
#
# Row-Split Structure:
# age (** analysis **)
The general layouting instructions are summarized below:
-
basic_table()
is a layout representing a table with zero rows and one column - Nested splitting
- Summarizing Groups:
summarize_row_groups()
- Analyzing Variables:
analyze()
,analyze_colvars()
Using those functions, it is possible to create a wide variety of tables as we will show in this document.
Adding Column Structure
We will now add more structure to the columns by adding a column
split based on the factor variable arm
:
lyt <- basic_table() %>%
split_cols_by("arm") %>%
analyze("age", afun = mean, format = "xx.x")
tbl <- build_table(lyt, df)
tbl
# Arm A Arm B
# ————————————————————
# mean 39.5 39.4
The resulting table has one column per factor level of
arm
. So the data represented by the first column is
df[df$arm == "ARM A", ]
. Hence, the
split_cols_by()
partitions the data among the columns by
default.
Column splitting can be done in a recursive/nested manner by adding
sequential split_cols_by()
layout instruction. It’s also
possible to add a non-nested split. Here we splitting each arm further
by the gender:
lyt <- basic_table() %>%
split_cols_by("arm") %>%
split_cols_by("gender") %>%
analyze("age", afun = mean, format = "xx.x")
tbl <- build_table(lyt, df)
tbl
# Arm A Arm B
# Female Male Female Male
# ————————————————————————————————————
# mean 38.8 40.1 39.6 39.2
The first column represents the data in df
where
df$arm == "A" & df$gender == "Female"
and the second
column the data in df
where
df$arm == "A" & df$gender == "Male"
, and so on.
Adding Row Structure
So far, we have created layouts with analysis and column splitting
instructions, i.e. analyze()
and
split_cols_by()
, respectively. This resulted with a table
with multiple columns and one data row. We will add more row structure
by stratifying the mean analysis by country (i.e. adding a split in the
row space):
lyt <- basic_table() %>%
split_cols_by("arm") %>%
split_cols_by("gender") %>%
split_rows_by("country") %>%
analyze("age", afun = mean, format = "xx.x")
tbl <- build_table(lyt, df)
tbl
# Arm A Arm B
# Female Male Female Male
# ——————————————————————————————————————
# CAN
# mean 38.2 40.3 40.3 38.9
# USA
# mean 39.2 39.7 38.9 39.6
In this table the data used to derive the first data cell (average of
age of female Canadians in Arm A) is where
df$country == "CAN" & df$arm == "Arm A" & df$gender == "Female"
.
This cell value can also be calculated manually:
mean(df$age[df$country == "CAN" & df$arm == "Arm A" & df$gender == "Female"])
# [1] 38.22447
Row structure can also be used to group the table into titled groups
of pages during rendering. We do this via ‘page by splits’, which are
declared via page_by = TRUE
within a call to
split_rows_by
:
lyt <- basic_table() %>%
split_cols_by("arm") %>%
split_cols_by("gender") %>%
split_rows_by("country", page_by = TRUE) %>%
split_rows_by("handed") %>%
analyze("age", afun = mean, format = "xx.x")
tbl <- build_table(lyt, df)
cat(export_as_txt(tbl, page_type = "letter",
page_break = "\n\n~~~~~~ Page Break ~~~~~~\n\n"))
# Arm A Arm B
# Female Male Female Male
# ————————————————————————————————————————
# CAN
# Left
# mean 38.9 40.4 40.3 37.7
# Right
# mean 36.6 40.2 40.2 40.6
# USA
# Left
# mean 40.4 39.7 39.2 40.1
# Right
# mean 36.9 39.8 38.5 39.0
We go into more detail on page-by splits and how to control the page-group specific titles in the Title and footer vignette.
Note that if you print or render a table without pagination, the page_by splits are currently rendered as normal row splits. This may change in future releases.
Adding Group Information
When adding row splits, we get by default label rows for each split
level, for example CAN
and USA
in the table
above. Besides the column space subsetting, we have now further
subsetted the data for each cell. It is often useful when defining a row
splitting to display information about each row group. In
rtables
this is referred to as content information,
i.e. mean()
on row 2 is a descendant of CAN
(visible via the indenting, though the table has an underlying tree
structure that is not of importance for this vignette). In order to add
content information and turn the CAN
label row into a
content row, the summarize_row_groups()
function is
required. By default, the count (nrows()
) and percentage of
data relative to the column associated data is calculated:
lyt <- basic_table() %>%
split_cols_by("arm") %>%
split_cols_by("gender") %>%
split_rows_by("country") %>%
summarize_row_groups() %>%
analyze("age", afun = mean, format = "xx.x")
tbl <- build_table(lyt, df)
tbl
# Arm A Arm B
# Female Male Female Male
# ——————————————————————————————————————————————————————————
# CAN 45 (46.9%) 64 (61.0%) 46 (50.0%) 62 (57.9%)
# mean 38.2 40.3 40.3 38.9
# USA 51 (53.1%) 41 (39.0%) 46 (50.0%) 45 (42.1%)
# mean 39.2 39.7 38.9 39.6
The relative percentage for average age of female Canadians is calculated as follows:
df_cell <- subset(df, df$country == "CAN" & df$arm == "Arm A" & df$gender == "Female")
df_col_1 <- subset(df, df$arm == "Arm A" & df$gender == "Female")
c(count = nrow(df_cell), percentage = nrow(df_cell) / nrow(df_col_1))
# count percentage
# 45.00000 0.46875
so the group percentages per row split sum up to 1 for each column.
We can further split the row space by dividing each country by handedness:
lyt <- basic_table() %>%
split_cols_by("arm") %>%
split_cols_by("gender") %>%
split_rows_by("country") %>%
summarize_row_groups() %>%
split_rows_by("handed") %>%
analyze("age", afun = mean, format = "xx.x")
tbl <- build_table(lyt, df)
tbl
# Arm A Arm B
# Female Male Female Male
# ————————————————————————————————————————————————————————————
# CAN 45 (46.9%) 64 (61.0%) 46 (50.0%) 62 (57.9%)
# Left
# mean 38.9 40.4 40.3 37.7
# Right
# mean 36.6 40.2 40.2 40.6
# USA 51 (53.1%) 41 (39.0%) 46 (50.0%) 45 (42.1%)
# Left
# mean 40.4 39.7 39.2 40.1
# Right
# mean 36.9 39.8 38.5 39.0
Next, we further add a count and percentage summary for handedness within each country:
lyt <- basic_table() %>%
split_cols_by("arm") %>%
split_cols_by("gender") %>%
split_rows_by("country") %>%
summarize_row_groups() %>%
split_rows_by("handed") %>%
summarize_row_groups() %>%
analyze("age", afun = mean, format = "xx.x")
tbl <- build_table(lyt, df)
tbl
# Arm A Arm B
# Female Male Female Male
# ————————————————————————————————————————————————————————————
# CAN 45 (46.9%) 64 (61.0%) 46 (50.0%) 62 (57.9%)
# Left 32 (33.3%) 42 (40.0%) 26 (28.3%) 37 (34.6%)
# mean 38.9 40.4 40.3 37.7
# Right 13 (13.5%) 22 (21.0%) 20 (21.7%) 25 (23.4%)
# mean 36.6 40.2 40.2 40.6
# USA 51 (53.1%) 41 (39.0%) 46 (50.0%) 45 (42.1%)
# Left 34 (35.4%) 19 (18.1%) 25 (27.2%) 25 (23.4%)
# mean 40.4 39.7 39.2 40.1
# Right 17 (17.7%) 22 (21.0%) 21 (22.8%) 20 (18.7%)
# mean 36.9 39.8 38.5 39.0
Introspecting rtables
Table Objects
Once we have created a table, we can inspect its structure using a number of functions.
The table_structure()
function prints a summary of a
table’s row structure at one of two levels of detail. By default, it
summarizes the structure at the subtable level.
table_structure(tbl)
# [TableTree] country
# [TableTree] CAN [cont: 1 x 4]
# [TableTree] handed
# [TableTree] Left [cont: 1 x 4]
# [ElementaryTable] age (1 x 4)
# [TableTree] Right [cont: 1 x 4]
# [ElementaryTable] age (1 x 4)
# [TableTree] USA [cont: 1 x 4]
# [TableTree] handed
# [TableTree] Left [cont: 1 x 4]
# [ElementaryTable] age (1 x 4)
# [TableTree] Right [cont: 1 x 4]
# [ElementaryTable] age (1 x 4)
When the detail
argument is set to "row"
,
however, it provides a more detailed row-level summary, which acts as a
useful alternative to how we might normally use the str()
function to interrogate compound nested lists.
table_structure(tbl, detail = "row")
# TableTree: [country] (country)
# labelrow: [country] (country) - <not visible>
# children:
# TableTree: [CAN] (CAN)
# labelrow: [CAN] (CAN) - <not visible>
# content:
# ElementaryTable: [CAN@content] ()
# labelrow: [] () - <not visible>
# children:
# ContentRow: [CAN] (CAN)
# children:
# TableTree: [handed] (handed)
# labelrow: [handed] (handed) - <not visible>
# children:
# TableTree: [Left] (Left)
# labelrow: [Left] (Left) - <not visible>
# content:
# ElementaryTable: [Left@content] ()
# labelrow: [] () - <not visible>
# children:
# ContentRow: [Left] (Left)
# children:
# ElementaryTable: [age] (age)
# labelrow: [age] (age) - <not visible>
# children:
# DataRow: [mean] (mean)
# TableTree: [Right] (Right)
# labelrow: [Right] (Right) - <not visible>
# content:
# ElementaryTable: [Right@content] ()
# labelrow: [] () - <not visible>
# children:
# ContentRow: [Right] (Right)
# children:
# ElementaryTable: [age] (age)
# labelrow: [age] (age) - <not visible>
# children:
# DataRow: [mean] (mean)
# TableTree: [USA] (USA)
# labelrow: [USA] (USA) - <not visible>
# content:
# ElementaryTable: [USA@content] ()
# labelrow: [] () - <not visible>
# children:
# ContentRow: [USA] (USA)
# children:
# TableTree: [handed] (handed)
# labelrow: [handed] (handed) - <not visible>
# children:
# TableTree: [Left] (Left)
# labelrow: [Left] (Left) - <not visible>
# content:
# ElementaryTable: [Left@content] ()
# labelrow: [] () - <not visible>
# children:
# ContentRow: [Left] (Left)
# children:
# ElementaryTable: [age] (age)
# labelrow: [age] (age) - <not visible>
# children:
# DataRow: [mean] (mean)
# TableTree: [Right] (Right)
# labelrow: [Right] (Right) - <not visible>
# content:
# ElementaryTable: [Right@content] ()
# labelrow: [] () - <not visible>
# children:
# ContentRow: [Right] (Right)
# children:
# ElementaryTable: [age] (age)
# labelrow: [age] (age) - <not visible>
# children:
# DataRow: [mean] (mean)
The make_row_df()
and make_col_df()
functions create a data.frame which has a variety of information about
the table’s structure. Most useful for introspection purposes are the
label
, name
, abs_rownumber
,
path
and node_class
columns (the remainder of
information in the returned data.frame is used for pagination)
make_row_df(tbl)[,c("label", "name", "abs_rownumber", "path", "node_class")]
# label name abs_rownumber path node_class
# 1 CAN CAN 1 country,.... ContentRow
# 2 Left Left 2 country,.... ContentRow
# 3 mean mean 3 country,.... DataRow
# 4 Right Right 4 country,.... ContentRow
# 5 mean mean 5 country,.... DataRow
# 6 USA USA 6 country,.... ContentRow
# 7 Left Left 7 country,.... ContentRow
# 8 mean mean 8 country,.... DataRow
# 9 Right Right 9 country,.... ContentRow
# 10 mean mean 10 country,.... DataRow
By default make_row_df()
summarizes only visible rows,
but setting visible_only
to FALSE
gives us a
structural summary of the table, including the full hierarchy of
subtables, including those that aren’t represented directly by any
visible rows:
make_row_df(tbl, visible_only = FALSE)[,c("label", "name", "abs_rownumber", "path", "node_class")]
# label name abs_rownumber path node_class
# 1 country NA country TableTree
# 2 CAN NA country, CAN TableTree
# 3 CAN@content NA country,.... ElementaryTable
# 4 CAN CAN 1 country,.... ContentRow
# 5 handed NA country,.... TableTree
# 6 Left NA country,.... TableTree
# 7 Left@content NA country,.... ElementaryTable
# 8 Left Left 2 country,.... ContentRow
# 9 age NA country,.... ElementaryTable
# 10 mean mean 3 country,.... DataRow
# 11 Right NA country,.... TableTree
# 12 Right@content NA country,.... ElementaryTable
# 13 Right Right 4 country,.... ContentRow
# 14 age NA country,.... ElementaryTable
# 15 mean mean 5 country,.... DataRow
# 16 USA NA country, USA TableTree
# 17 USA@content NA country,.... ElementaryTable
# 18 USA USA 6 country,.... ContentRow
# 19 handed NA country,.... TableTree
# 20 Left NA country,.... TableTree
# 21 Left@content NA country,.... ElementaryTable
# 22 Left Left 7 country,.... ContentRow
# 23 age NA country,.... ElementaryTable
# 24 mean mean 8 country,.... DataRow
# 25 Right NA country,.... TableTree
# 26 Right@content NA country,.... ElementaryTable
# 27 Right Right 9 country,.... ContentRow
# 28 age NA country,.... ElementaryTable
# 29 mean mean 10 country,.... DataRow
make_col_df()
similarly accepts
visible_only
, though here the meaning is slightly
different, indicating whether only leaf columns should be
summarized (TRUE
, the default) or whether higher level
groups of columns, analogous to subtables in row space, should be
summarized as well.
make_col_df(tbl)
# name label abs_pos path pos_in_siblings n_siblings leaf_indices
# 1 Female Female 1 arm, Arm.... 1 2 1
# 2 Male Male 2 arm, Arm.... 2 2 2
# 3 Female Female 3 arm, Arm.... 1 2 3
# 4 Male Male 4 arm, Arm.... 2 2 4
# total_span col_fnotes n_col_fnotes
# 1 1 0
# 2 1 0
# 3 1 0
# 4 1 0
make_col_df(tbl, visible_only = FALSE)
# name label abs_pos path pos_in_siblings n_siblings leaf_indices
# 1 Arm A Arm A NA arm, Arm A 1 2 1, 2
# 2 Female Female 1 arm, Arm.... 1 2 1
# 3 Male Male 2 arm, Arm.... 2 2 2
# 4 Arm B Arm B NA arm, Arm B 2 2 3, 4
# 5 Female Female 3 arm, Arm.... 1 2 3
# 6 Male Male 4 arm, Arm.... 2 2 4
# total_span col_fnotes n_col_fnotes
# 1 2 0
# 2 1 0
# 3 1 0
# 4 2 0
# 5 1 0
# 6 1 0
The row_paths_summary()
and
col_paths_summary()
functions wrap the respective
make_*_df
functions, printing the name
,
node_class
and path
information (in the row
case), or the label
and path
information (in
the column case), indented to illustrate table structure:
row_paths_summary(tbl)
# rowname node_class path
# ——————————————————————————————————————————————————————————————————————
# CAN ContentRow country, CAN, @content, CAN
# Left ContentRow country, CAN, handed, Left, @content, Left
# mean DataRow country, CAN, handed, Left, age, mean
# Right ContentRow country, CAN, handed, Right, @content, Right
# mean DataRow country, CAN, handed, Right, age, mean
# USA ContentRow country, USA, @content, USA
# Left ContentRow country, USA, handed, Left, @content, Left
# mean DataRow country, USA, handed, Left, age, mean
# Right ContentRow country, USA, handed, Right, @content, Right
# mean DataRow country, USA, handed, Right, age, mean
col_paths_summary(tbl)
# label path
# ——————————————————————————————————————
# Arm A arm, Arm A
# Female arm, Arm A, gender, Female
# Male arm, Arm A, gender, Male
# Arm B arm, Arm B
# Female arm, Arm B, gender, Female
# Male arm, Arm B, gender, Male
Summary
In this vignette you have learned:
- every cell has an associated subset of data
- this means that much of tabulation has to do with splitting/subsetting data
- tables can be described pre-data using layouts
- tables are a form of visualization of data
The other vignettes in the rtables
package will provide
more detailed information about the rtables
package. We
recommend that you continue with the tabulation_dplyr
vignette which compares the information derived by the table in this
vignette using dplyr
.