Join Keys
NEST CoreDev
2022-07-26
join-keys.Rmd
Overview
teal.data
gives the possibility to define individual
keys per dataset and to define the relations to other datasets.
Each table can have a set of keys that inform its structure and relation to other tables by specifying:
- columns consisting the primary key
- merge keys, a concept similar to
SQL
s foreign key.
Usually, an application developer needs to specify the keys manually,
but in case of datasets named according to the ADaM
standard, teal
can assign the keys on its own. Refer to
vignette("including-adam-data-in-teal", package = "teal")
for more information.
Primary key
Using the keys
argument to the dataset
(or
for DDL
a dataset_connector
) function, we can
specify column(s) of the dataset which together uniquely identify rows
in the dataset.
library(teal.data)
library(dplyr)
iris_with_keys <- mutate(iris, id = factor(row_number()))
ds <- dataset(
"IRIS",
iris_with_keys,
code = "IRIS <- iris %>% mutate(id = factor(row_number()))",
keys = c("id")
)
ds$get_keys()
## [1] "id"
Merge keys
When passing multiple datasets to the cdisc_data
function, dataset relationship are set using join_keys
and
join_key
and these are used to merge datasets together
within teal
apps. For users familiar with SQL
database schema, these relationships are symmetric and not as strict as
SQL
foreign key relationships as teal
does not
validate whether the values inserted into foreign key columns are
present in the parent table.
For example:
library(teal.data)
data_1 <- data.frame(X = factor(1:10), Y = 21:30, Z = 1:10)
data_2 <- data.frame(W = factor(10:1), V = factor(5:14), M = rep(1:5, 2))
data_3 <- data.frame(V = factor(5:14), T = 4)
data <- teal_data(
dataset("D1", data_1, code = "D1 <- data.frame(X = factor(1:10), Y = 21:30, Z = 1:10)"),
dataset("D2", data_2, code = "D2 <- data.frame(W = factor(10:1), V = factor(5:14), M = rep(1:5, 2))"),
dataset("D3", data_3, code = "D3 <- data.frame(V = factor(5:14), T = 4)"),
join_keys = join_keys(
join_key("D1", "D2", c("X" = "W")),
join_key("D2", "D3", c("V" = "V"))
)
)
The join_keys
function is used to specify keys:
-
join_keys
is a collection of multiplejoin_key
entries -
join_key
specifies the relation between two datasets:-
dataset_1
,dataset_2
- name of two datasets -
key
- (optionally) named vector of column names
-
Note that it is assumed that join keys are symmetric,
i.e. join_key("x", "y", "x_col" = "y_col")
will enable
merge from “x” to “y” and vice versa.