Introduction

Reformatting in dulin consists in replacing predetermined values by another in particular variables for selected tables of a data set stored under the form of a dm object.

This is performed in two steps:

  1. A Reformatting Map (list object) is created which specifies the correspondence between the old and the new values

  2. The reformatting itself is performed with the dunlin::apply_reformat() function.

The Formatting Map Structure

The Reformatting Map is a list object with 4 hierarchic levels.

  1. The Table Name: Typically one of the values returned by names(db) where db is dm object. For instance, in the case of an adam data set, adsl or adae etc…

  2. The Variable Name: The name of the column in a table of a dm object. For instance ACTARM.

  3. The Key: the new value that will replace the old ones

  4. The Values: a vector or values, that have to be replaced by their associated key.

The All keyword, in HIGHER, lower or Mixed case, can be used instead of a table name to indicate that a particular variable should be changed in every table where it appears.

The values can be a string, a character vector or NA. Values that are present in the original table and not listed in the mapping are left unchanged, except that all altered columns are turned into factor class. If the original variable is a factor, the non-transformed levels are preserved even if the corresponding value does not appear.

The order in which the keys appear determine the order of the levels in the output. Hence the function can be used to re level variables. However, the level replacing missing values will automatically be placed last.

Example of Reformatting Map


my_map <- list(
  # This is the Table Name.
  airlines = list(
    # This is the Variable Name.
    name = list(
      # `AE` is the Key, `c("American Airlines Inc.")` is the value.
      "AE" = c("American Airlines Inc."),
      "Alaska and Hawaiian Airlines" = c("Alaska Airlines Inc.", "Hawaiian Airlines Inc.")
    )
  ),
  planes = list(
    manufacturer = list(
      "Airbus" = "AIRBUS INDUSTRIE",
      "New Level" = "new_level",
      "<Missing>" = NA
    ),
    model = list(
      "EMB-145" = c("EMB-145XR"),
      "Other 737" = c("737-824", "737-724", "737-732")
    )
  ),
  All = list(
    year = list(
      "Current" = "2013"
    )
  )
)

Re Formatting

Once the re formatting map is defined, the reformatting can be performed on a dm object.

Example


db <- dm::dm_nycflights13()

db <-
  db %>%
  dm_zoom_to(planes) %>%
  mutate(manufacturer = ifelse(manufacturer == "EMBRAER", NA, manufacturer)) %>%
  dm_update_zoomed()

res <- apply_reformat(db, format = my_map)
head(db$airlines$name)
#> [1] "Endeavor Air Inc."        "American Airlines Inc."  
#> [3] "Alaska Airlines Inc."     "JetBlue Airways"         
#> [5] "Delta Air Lines Inc."     "ExpressJet Airlines Inc."
head(res$airlines$name)
#> [1] Endeavor Air Inc.            AE                          
#> [3] Alaska and Hawaiian Airlines JetBlue Airways             
#> [5] Delta Air Lines Inc.         ExpressJet Airlines Inc.    
#> 14 Levels: AE Alaska and Hawaiian Airlines ... Virgin America

# New levels are introduced and missing values can be replaced.
head(db$planes$manufacturer)
#> [1] NA                 "AIRBUS INDUSTRIE" NA                 "AIRBUS INDUSTRIE"
#> [5] "AIRBUS INDUSTRIE" NA
head(res$planes$manufacturer)
#> [1] <Missing> Airbus    <Missing> Airbus    Airbus    <Missing>
#> 16 Levels: Airbus New Level AIRBUS BEECH BOEING BOMBARDIER INC ... <Missing>

# Using the `All` keyword changes several tables.
head(db$weather$year)
#> [1] 2013 2013 2013 2013 2013 2013
head(res$weather$year)
#> [1] Current Current Current Current Current Current
#> Levels: Current

head(db$flights$year)
#> [1] 2013 2013 2013 2013 2013 2013
head(res$flights$year)
#> [1] Current Current Current Current Current Current
#> Levels: Current

Reformatting Map with NULL values

Reformatting functions tolerate NULL values in the reformatting maps. In the following example, the table df2 will be left unchanged as well as the logi column of the df1 table. In the char column, only a and k will be reformatted and the column changed to factor. B wont be included in the levels.


# This is a valid reformatting map
null_map <- list(
  df1 = list(
    char = list(
      "A" = c("a", "k"),
      "B" = NULL
    ),
    logi = NULL
  ),
  df2 = NULL
)

assert_reformat(null_map)