Reformatting.Rmd
Reformatting in dulin
consists in replacing
predetermined values by another in particular variables for selected
tables of a data set stored under the form of a dm
object.
This is performed in two steps:
A Reformatting Map (list
object) is created which
specifies the correspondence between the old and the new values
The reformatting itself is performed with the
dunlin::apply_reformat()
function.
The Reformatting Map is a list object with 4 hierarchic levels.
The Table Name: Typically one of the values returned by
names(db)
where db is dm
object. For instance,
in the case of an adam
data set, adsl
or
adae
etc…
The Variable Name: The name of the column in a table of a
dm
object. For instance ACTARM
.
The Key: the new value that will replace the old ones
The Values: a vector or values, that have to be replaced by their associated key.
The All
keyword, in HIGHER, lower or Mixed case, can be
used instead of a table name to indicate that a particular variable
should be changed in every table where it appears.
The values can be a string
, a character
vector or NA
. Values that are present in the original table
and not listed in the mapping are left unchanged, except that all
altered columns are turned into factor
class. If the
original variable is a factor
, the non-transformed levels
are preserved even if the corresponding value does not appear.
The order in which the keys appear determine the order of the levels in the output. Hence the function can be used to re level variables. However, the level replacing missing values will automatically be placed last.
my_map <- list(
# This is the Table Name.
airlines = list(
# This is the Variable Name.
name = list(
# `AE` is the Key, `c("American Airlines Inc.")` is the value.
"AE" = c("American Airlines Inc."),
"Alaska and Hawaiian Airlines" = c("Alaska Airlines Inc.", "Hawaiian Airlines Inc.")
)
),
planes = list(
manufacturer = list(
"Airbus" = "AIRBUS INDUSTRIE",
"New Level" = "new_level",
"<Missing>" = NA
),
model = list(
"EMB-145" = c("EMB-145XR"),
"Other 737" = c("737-824", "737-724", "737-732")
)
),
All = list(
year = list(
"Current" = "2013"
)
)
)
Once the re formatting map is defined, the reformatting can be
performed on a dm
object.
db <- dm::dm_nycflights13()
db <-
db %>%
dm_zoom_to(planes) %>%
mutate(manufacturer = ifelse(manufacturer == "EMBRAER", NA, manufacturer)) %>%
dm_update_zoomed()
res <- apply_reformat(db, format = my_map)
head(db$airlines$name)
#> [1] "Endeavor Air Inc." "American Airlines Inc."
#> [3] "Alaska Airlines Inc." "JetBlue Airways"
#> [5] "Delta Air Lines Inc." "ExpressJet Airlines Inc."
head(res$airlines$name)
#> [1] Endeavor Air Inc. AE
#> [3] Alaska and Hawaiian Airlines JetBlue Airways
#> [5] Delta Air Lines Inc. ExpressJet Airlines Inc.
#> 14 Levels: AE Alaska and Hawaiian Airlines ... Virgin America
# New levels are introduced and missing values can be replaced.
head(db$planes$manufacturer)
#> [1] NA "AIRBUS INDUSTRIE" NA "AIRBUS INDUSTRIE"
#> [5] "AIRBUS INDUSTRIE" NA
head(res$planes$manufacturer)
#> [1] <Missing> Airbus <Missing> Airbus Airbus <Missing>
#> 16 Levels: Airbus New Level AIRBUS BEECH BOEING BOMBARDIER INC ... <Missing>
# Using the `All` keyword changes several tables.
head(db$weather$year)
#> [1] 2013 2013 2013 2013 2013 2013
head(res$weather$year)
#> [1] Current Current Current Current Current Current
#> Levels: Current
head(db$flights$year)
#> [1] 2013 2013 2013 2013 2013 2013
head(res$flights$year)
#> [1] Current Current Current Current Current Current
#> Levels: Current
NULL
values
Reformatting functions tolerate NULL
values in the
reformatting maps. In the following example, the table df2
will be left unchanged as well as the logi
column of the
df1
table. In the char
column, only
a
and k
will be reformatted and the column
changed to factor
. B
wont be included in the
levels.
# This is a valid reformatting map
null_map <- list(
df1 = list(
char = list(
"A" = c("a", "k"),
"B" = NULL
),
logi = NULL
),
df2 = NULL
)
assert_reformat(null_map)