Reformatting
Reformatting.Rmd
Introduction
Reformatting in dunlin
consists in replacing
predetermined values by another in particular variables for selected
tables of a data set stored.
This is performed in two steps:
A Reformatting Map (
rule
object) is created which specifies the correspondence between the old and the new valuesThe reformatting itself is performed with the
dunlin::reformat()
function.
The Formatting Map Structure
The Reformatting Map is a rule
object inheriting from
character
. Its names are the new values to be used, and its
values are the old values to be used.
This rule will replace “a” with “A”, replace “c” or “d” with “B”.
Calling reformat
reformat
is a generic supports reformatting of
character
or factor
. Reformatting for other
types of variables is meaningless. reformat
will also
preserve the attributes of the original data, e.g. the data type or
labels will be unchanged.
An example of reformatting character
can be
r <- rule(A = "a", B = c("c", "d"))
reformat(c("a", "c", "d", NA), r)
#> [1] A B B <NA>
#> Levels: A B
We can see that the NA
values are not changed.
Now we test the factor reformatting:
r <- rule(A = "a", B = c("c", "d"))
reformat(factor(c("a", "c", "d", NA)), r)
#> [1] A B B <NA>
#> Levels: A B
The NA
values are also not changed. However, if we
including reformatting for the NA
, there is something
different:
r <- rule(A = "a", B = c("c", "d"), C = NA)
reformat(factor(c("a", "c", "d", NA)), r)
#> [1] A B B C
#> Levels: A B C
Please note that the level for NA
is always the last
one, if that new level only has NA
.
For dm
objects, the format
argument is
actually a nested list of rule. The first layer indicates the table
names, the second layer indicates the variables in that table.
The All
keyword, in first layer, lower or Mixed case,
can be used instead of a table name to indicate that a particular
variable should be changed in every table where it appears.
Example of Reformatting Map
my_map <- list(
# This is the Table Name.
airlines = list(
# This is the Variable Name.
name = rule(
"AE" = c("American Airlines Inc."),
"Alaska and Hawaiian Airlines" = c("Alaska Airlines Inc.", "Hawaiian Airlines Inc.")
)
),
planes = list(
manufacturer = rule(
"Airbus" = "AIRBUS INDUSTRIE",
"New Level" = "new_level",
"<Missing>" = NA
),
model = rule(
"EMB-145" = c("EMB-145XR"),
"Other 737" = c("737-824", "737-724", "737-732")
)
)
)
db <- dm::dm_nycflights13()
db_formatted <- reformat(db, my_map)
head(db_formatted$planes$model)
#> [1] EMB-145 A320-214 EMB-145LR A320-214 A320-214 EMB-145
#> 55 Levels: 150 172N 65-A90 717-200 737-3H4 737-401 737-524 ... R66