Reformatting
Reformatting.Rmd
Introduction
Reformatting in dunlin
consists in replacing
predetermined values by another in particular variables for selected
tables of a data set stored.
This is performed in two steps:
A Reformatting Map (
rule
object) is created which specifies the correspondence between the old and the new valuesThe reformatting itself is performed with the
dunlin::reformat()
function.
The Formatting Map Structure
The Reformatting Map is a rule
object inheriting from
character
. Its names are the new values to be used, and its
values are the old values to be used.
This rule will replace “a” with “A”, replace “c” or “d” with “B”.
Calling reformat
reformat
is a generic supports reformatting of
character
or factor
. Reformatting for other
types of variables is meaningless. reformat
will also
preserve the attributes of the original data, e.g. the data type or
labels will be unchanged.
An example of reformatting character
can be
r <- rule(A = "a", B = c("c", "d"))
reformat(c("a", "c", "d", NA), r)
#> [1] A B B <NA>
#> Levels: A B
We can see that the NA
values are not changed.
Now we test the factor reformatting:
r <- rule(A = "a", B = c("c", "d"))
reformat(factor(c("a", "c", "d", NA)), r)
#> [1] A B B <NA>
#> Levels: A B
The NA
values are also not changed. However, if we
including reformatting for the NA
, there is something
different:
r <- rule(A = "a", C = NA, B = c("c", "d"))
reformat(factor(c("a", "c", "d", NA)), r)
#> [1] A B B C
#> Levels: A B C
By default, the level replacing NA
is set as the last
one. This can be changed by setting na_last = FALSE
.
r <- rule(A = "a", C = NA, B = c("c", "d"))
reformat(factor(c("a", "c", "d", NA)), r, na_last = FALSE)
#> [1] A B B C
#> Levels: A C B
For list
of data.frames
, the
format
argument is actually a nested list of rule. The
first layer indicates the table names, the second layer indicates the
variables in that table. Reformatting is only available for columns of
characters or factors, reformatting columns of another types will result
in a warning.
Example
df1 <- data.frame(
"char" = c("", "b", NA, "a", "k", "x"),
"fact" = factor(c("f1", "f2", NA, NA, "f1", "f1"), levels = c("f2", "f1")),
"logi" = c(NA, FALSE, TRUE, NA, FALSE, NA)
)
df2 <- data.frame(
"char" = c("a", "b", NA, "a", "k", "x"),
"fact" = factor(c("f1", "f2", NA, NA, "f1", "f1"))
)
db <- list(df1 = df1, df2 = df2)
attr(db$df1$char, "label") <- "my label"
rule_map <- list(
df1 = list(
char = rule("Empty" = "", "B" = "b", "Not Available" = NA),
fact = rule("F1" = "f1"),
logi = rule()
),
df2 = list(
char = rule("Empty" = "", "A" = "a", "Not Available" = NA)
)
)
res <- reformat(db, rule_map, na_last = TRUE)
res
#> $df1
#> char fact logi
#> 1 Empty F1 NA
#> 2 B f2 FALSE
#> 3 Not Available <NA> TRUE
#> 4 a <NA> NA
#> 5 k F1 FALSE
#> 6 x F1 NA
#>
#> $df2
#> char fact
#> 1 A f1
#> 2 b f2
#> 3 Not Available <NA>
#> 4 A <NA>
#> 5 k f1
#> 6 x f1