Overview
The teal.data
package provides a way to define primary
keys for a dataset and establish relationships with other datasets.
Each dataset joining keys can be characterized by:
- Columns constituting the primary key;
- Foreign/merge keys, analogous to
SQL
foreign keys.
Typically, an application developer specifies these keys manually.
However, for datasets following the ADaM standard,
teal.data
can automatically assign keys using the
default_cdisc_join_keys
object. Refer to the section “Joining Keys with ADaM Datasets”
for details on using this object to select specific datasets.
Uses of join_keys
class in teal
applications
The primary function of the join_keys
class in
teal
applications is to facilitate the seamless merging
of datasets using teal.transform
.
Additionally, it plays a role in the data filtering using the Filter
Panel in a teal
application. The filters applied
to a (parent) dataset are also applied to their children.
Anatomy of join_keys
The join_keys
object contains information about the
foreign/primary keys of multiple datasets. Each key is represented by a
pair of datasets (by name reference) and a named character vector that
encodes the column name mapping between the two datasets. In addition, a
foreign key may also contain a parent-child attribute that is
used in the “Filter Panel” as we mentioned above.
A new join keys can be created as an empty object, or by defining an initial set of primary and foreign keys.
That initial object can be extended by adding/modifying/removing keys and by establishing parent-child relationships between datasets.
join_keys(...)
: Join Keys Constructor / Getter /
Setter
Convenient function that is used both as the constructor and as the getter for join_keys objects.
As the Getter it is used to retrieve the joining
keys that are contained in other objects, such as a
teal_data
object.
As the Constructor it is used to specify a collection of
multiple individual keys (via join_key
function described
below).
join_key(dataset_1, dataset_2, key, parent)
: Single
Join Key Constructor
Specifies a primary key or a relationship between two datasets.
-
dataset_1
,dataset_2
: names of the datasets (ifdataset_2
is the same asdataset_1
or is omitted, it creates a primary key); -
key
: namedcharacter
vector of column name mapping between datasets (unnamed vector assumes column names between datasets are the same); -
directed
(optional): flag that indicates whether there is a parent-child directed relationship betweendataset_2
todataset_1
(latter as a parent).
Note that join keys are assumed to be symmetric, i.e.,
join_key("ds1", "ds2", c("ds1_col" = "ds2_col"))
establishes a relationship from “x” to “y” and vice versa.
By default, the new joining key will set the dataset_1
as the parent.
Example & Output
library(teal.data)
jk <- join_keys(
join_key("ds1", keys = "col_1"), # ds1: [col_1]
join_key("ds2", keys = c("col_1", "col_2")), # ds2: [col_1, col_2]
join_key("ds3", keys = c("col_1", "col_3")), # ds3: [col_1, col_3]
join_key("ds1", "ds2", keys = "col_1"), # ds1 <-- ds2
join_key("ds1", "ds3", keys = "col_1"), # ds1 <-- ds3
join_key("ds4", "ds5", keys = c("col_4" = "col_5"), directed = FALSE) # ds4 <--> ds5
)
# The parent-child relationships are created automatically (unless 'parent' parameter is "none")
jk
Output of print(jk)
|
Output annotation |
---|---|
## A join_keys object containing foreign keys between 3 datasets: |
Title |
## ds1: [col_1] |
Primary keys <dataset name>: [<primary_keys>] |
## <-- ds2: [col_1] |
Foreign keys (arrow <-- denotes ds1 is the parent of
ds2 )
|
## <-- ds3: [col_1] |
|
## ds2: [col_1, col_2] |
|
## --> ds1: [col_1] |
arrow --> denotes ds2 is
a child of ds1
|
## --* (implicit via parent with): ds3 |
Implicit relationship between
ds2 & ds3 (given that they share common keys with same parent) |
## ds3: [col_1, col_3] |
|
## --> ds1: [col_1] |
|
## --* (implicit via parent with): ds2 |
|
## ds4: [no primary keys] |
|
## <-> ds5: [col_5] |
Foreign keys (arrow <-> denotes no parent definition between
datasets)
|
## ds5: [no primary keys] |
|
## <-> ds4: [col_4] |
Accessing and Modifying keys
The subset operator with 2 indices (x[i, j]
) is
used to retrieve the primary/foreign keys. Both indices must be a string
denoting the dataset name.
# Using the jk object defined in "Anatomy of Join Keys"
jk
## A join_keys object containing foreign keys between 5 datasets:
## ds1: [col_1]
## <-- ds2: [col_1]
## <-- ds3: [col_1]
## ds2: [col_1, col_2]
## --> ds1: [col_1]
## --* (implicit via parent with): ds3
## ds3: [col_1, col_3]
## --> ds1: [col_1]
## --* (implicit via parent with): ds2
## ds4: [no primary keys]
## <-> ds5: [col_5]
## ds5: [no primary keys]
## <-> ds4: [col_4]
# Getting primary key of "ds1"
jk["ds1", "ds1"]
## col_1
## "col_1"
# Getting foreign keys between "ds4" and "ds5"
jk["ds4", "ds5"]
## col_4
## "col_5"
Note that there is a symmetry in the keys between ds4
and ds5
relationship:
jk["ds5", "ds4"]
## col_5
## "col_4"
jk["ds5", "ds4"]
## col_5
## "col_4"
When only 1 argument is used this operator will return a
join_keys
object that is filtered accordingly.
# Using the jk object defined in "Anatomy of Join Keys"
jk
## A join_keys object containing foreign keys between 5 datasets:
## ds1: [col_1]
## <-- ds2: [col_1]
## <-- ds3: [col_1]
## ds2: [col_1, col_2]
## --> ds1: [col_1]
## --* (implicit via parent with): ds3
## ds3: [col_1, col_3]
## --> ds1: [col_1]
## --* (implicit via parent with): ds2
## ds4: [no primary keys]
## <-> ds5: [col_5]
## ds5: [no primary keys]
## <-> ds4: [col_4]
# Getting primary key of "ds1"
jk["ds1", "ds1"]
## col_1
## "col_1"
# Getting keys of "ds1" and "ds2"
jk[c("ds1", "ds2")]
## A join_keys object containing foreign keys between 2 datasets:
## ds1: [col_1]
## <-- ds2: [col_1]
## ds2: [col_1, col_2]
## --> ds1: [col_1]
Modifying or adding a key uses the same notation with the assignment
operator <-
. A symmetric relationship will be created
automatically, where the parent (by default) will be the dataset defined
in the first index. Assigning NULL
value will delete the
relationship.
# Adding a new ds5 <-- ds1 key
jk["ds1", "ds5"] <- "a_column"
# Removing an existing key
jk["ds4", "ds5"] <- NULL
Merging Join Keys
The combining of multiple joining keys is achieved using the
c()
generic function, which generates a symmetric and valid
join_keys
object. When encountering identical relationship
pairs, this operation retains the last occurrence of the pair within the
specified argument order.
For added convenience, the function also accommodates
join_key_set
objects created through the
join_key
function. These objects can be provided as the
initial argument or in any other position as needed.
jk1 <- join_keys(join_key("ds1", "ds1", "col_1"))
jk2 <- join_keys(join_key("ds2", "ds2", "col_1"), join_key("ds1", "ds2", "col_1"))
# Merging
c(jk1, jk2)
## A join_keys object containing foreign keys between 2 datasets:
## ds1: [col_1]
## <-- ds2: [col_1]
## ds2: [col_1]
## --> ds1: [col_1]
# Keeping last occurence
c(jk1, jk2, join_keys(join_key("ds2", "ds2", "col_2"), join_key("ds1", "ds2", c("col_1" = "col_2"))))
## A join_keys object containing foreign keys between 2 datasets:
## ds1: [col_1]
## <-- ds2: [col_2]
## ds2: [col_2]
## --> ds1: [col_1]
# Merges join_key and join_key_set objects (from join_key function)
c(jk1, join_key("ds3", "ds3", "col_3"))
## A join_keys object containing foreign keys between 2 datasets:
## ds1: [col_1]
## ds3: [col_3]
Join Keys Relationships
There are 2 types of relationships encoded with joining keys
that are described in the following sections. The primary and
foreign keys are created explicitly using the constructor for
individual keys (join_key
).
Additionally, the join_keys
object infers implicit
relationships when two datasets share foreign keys to a parent dataset,
but not between themselves. These implicit relationships are available
just like another foreign key and can be used to merge datasets, despite
not being defined by the user.
Primary Key with teal_data
When using the teal_data
function, the simplest method
to define the join keys is to use the join_keys
argument.
We can specify the column(s) of the dataset that (together) uniquely
identify rows in the dataset.
library(teal.data)
td_pk <- within(
teal_data(),
ds1 <- transform(iris, id = seq_len(nrow(iris)))
)
join_keys(td_pk) <- join_keys(join_key("ds1", keys = "id"))
join_keys(td_pk)
## A join_keys object containing foreign keys between 1 datasets:
## ds1: [id]
We can extend the previous example and define primary keys for multiple datasets:
td_pk <- within(
td_pk,
{
ds2 <- data.frame(W = 10:1, V = 5:14, M = rep(1:5, 2))
ds3 <- data.frame(V = 5:14, N = 4)
}
)
join_keys(td_pk)["ds2", "ds2"] <- c("V", "W")
join_keys(td_pk)["ds3", "ds3"] <- c("V", "W")
join_keys(td_pk)
## A join_keys object containing foreign keys between 3 datasets:
## ds1: [id]
## ds2: [V, W]
## ds3: [V, W]
Foreign Keys with teal_data
When passing multiple datasets to the teal_data
function, dataset relationships are set using join_keys
and
join_key
functions, which then can be used to merge
datasets together within teal
apps. For users familiar with
SQL
database schema, these relationships are symmetric and
not as strict as SQL
foreign key relationships as
teal
does not validate whether the values defined as
foreign key columns are present in the table.
For example:
library(teal.data)
td_fk <- within(
teal_data(),
{
ds1 <- data.frame(X = 1:10, Y = 21:30, Z = 1:10)
ds2 <- data.frame(W = 10:1, V = 5:14, M = rep(1:5, 2))
ds3 <- data.frame(V = 5:14, N = 4)
}
)
join_keys(td_fk) <- join_keys(
# Primary keys
join_key("ds1", keys = c("X")),
join_key("ds2", keys = c("V", "W")),
join_key("ds3", keys = c("V")),
# Foreign keys
join_key("ds1", "ds2", c("X" = "W")),
join_key("ds2", "ds3", c("V" = "V"))
)
join_keys(td_fk)
## A join_keys object containing foreign keys between 3 datasets:
## ds1: [X]
## <-- ds2: [W]
## ds2: [V, W]
## --> ds1: [X]
## <-- ds3: [V]
## ds3: [V]
## --> ds2: [V]
Implicit Relationships
Two datasets that share common foreign keys to the same parent dataset have an implicit relationship between them that is modeled and accessible in joining keys.
This is a special relationship that is inferred from existing foreign keys. It does not need to be explicitly defined but it can be accessed and overwritten just as any other foreign key.
These implicit relationships can be used to merge 2 datasets together, just as if they were defined manually.
library(teal.data)
td <- within(
teal_data(),
{
ds1 <- data.frame(X = 1:10, Y = 21:30, Z = 1:10)
ds2 <- data.frame(W = 10:1, V = 5:14, M = rep(1:5, 2))
ds3 <- data.frame(V = 5:14, N = 4)
ds4 <- data.frame(V = 5:14, R = rnorm(10))
}
)
join_keys(td) <- join_keys(
# Primary keys
join_key("ds1", keys = c("X")),
join_key("ds2", keys = c("V", "W")),
join_key("ds3", keys = c("V")),
join_key("ds4", keys = c("V")),
# Foreign keys
join_key("ds1", "ds2", c("X" = "W")),
join_key("ds2", "ds3", c("V" = "V")),
join_key("ds1", "ds4", c("X" = "B"))
)
join_keys(td)
## A join_keys object containing foreign keys between 4 datasets:
## ds1: [X]
## <-- ds2: [W]
## <-- ds4: [B]
## ds2: [V, W]
## --> ds1: [X]
## <-- ds3: [V]
## --* (implicit via parent with): ds4
## ds3: [V]
## --> ds2: [V]
## ds4: [V]
## --> ds1: [X]
## --* (implicit via parent with): ds2
join_keys(td)["ds2", "ds4"]
## W
## "B"
Note that the definition above contains no join_key
for
"ds2" <-> "ds4"
Join Keys in ADaM Datasets
teal.data
provides a set of default join keys (primary
and foreign) for datasets named according to the ADaM standard. They are
stored in default_cdisc_join_keys
.
## [1] "ADAE" "ADAETTE" "ADCM" "ADCSSRS" "ADDV" "ADEG"
## [7] "ADEQ5D5L" "ADEX" "ADHY" "ADLB" "ADMH" "ADQLQC"
## [13] "ADQS" "ADRS" "ADSAFTTE" "ADSL" "ADSUB" "ADTTE"
## [19] "ADVS"
When not all default keys are required users can select a smaller set
of datasets by subsetting (for example:
default_cdisc_join_keys[c("ADSL", "ADTTE")]
). Please note,
that for every element that is selected, its parent will also be
returned (if it has one) as well as any pair-wise relationships in the
resulting selection.
default_cdisc_join_keys
## A join_keys object containing foreign keys between 19 datasets:
## ADSL: [STUDYID, USUBJID]
## <-- ADAE: [STUDYID, USUBJID]
## <-- ADEG: [STUDYID, USUBJID]
## <-- ADTTE: [STUDYID, USUBJID]
## <-- ADAETTE: [STUDYID, USUBJID]
## <-- ADCM: [STUDYID, USUBJID]
## <-- ADEX: [STUDYID, USUBJID]
## <-- ADLB: [STUDYID, USUBJID]
## <-- ADMH: [STUDYID, USUBJID]
## <-- ADQS: [STUDYID, USUBJID]
## <-- ADRS: [STUDYID, USUBJID]
## <-- ADSAFTTE: [STUDYID, USUBJID]
## <-- ADVS: [STUDYID, USUBJID]
## <-- ADDV: [STUDYID, USUBJID]
## <-- ADSUB: [STUDYID, USUBJID]
## <-- ADHY: [STUDYID, USUBJID]
## <-- ADQLQC: [STUDYID, USUBJID]
## <-- ADCSSRS: [STUDYID, USUBJID]
## <-- ADEQ5D5L: [STUDYID, USUBJID]
## ADAE: [STUDYID, USUBJID, ASTDTM, AETERM, AESEQ]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADEG, ADTTE, ADAETTE, ADCM, ADEX, ADLB, ADMH, ADQS, ADRS, ADSAFTTE, ADVS, ADDV, ADSUB, ADHY, ADQLQC, ADCSSRS, ADEQ5D5L
## ADEG: [STUDYID, USUBJID, PARAMCD, AVISIT]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADTTE, ADAETTE, ADCM, ADEX, ADLB, ADMH, ADQS, ADRS, ADSAFTTE, ADVS, ADDV, ADSUB, ADHY, ADQLQC, ADCSSRS, ADEQ5D5L
## ADTTE: [STUDYID, USUBJID, PARAMCD]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADAETTE, ADCM, ADEX, ADLB, ADMH, ADQS, ADRS, ADSAFTTE, ADVS, ADDV, ADSUB, ADHY, ADQLQC, ADCSSRS, ADEQ5D5L
## ADAETTE: [STUDYID, USUBJID, PARAMCD]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADTTE, ADCM, ADEX, ADLB, ADMH, ADQS, ADRS, ADSAFTTE, ADVS, ADDV, ADSUB, ADHY, ADQLQC, ADCSSRS, ADEQ5D5L
## ADCM: [STUDYID, USUBJID, ASTDTM, CMSEQ, ATC1CD, ATC2CD, ATC3CD, ATC4CD]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADTTE, ADAETTE, ADEX, ADLB, ADMH, ADQS, ADRS, ADSAFTTE, ADVS, ADDV, ADSUB, ADHY, ADQLQC, ADCSSRS, ADEQ5D5L
## ADEX: [STUDYID, USUBJID, PARCAT1, PARAMCD, AVISITN, ASTDTM, EXSEQ]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADTTE, ADAETTE, ADCM, ADLB, ADMH, ADQS, ADRS, ADSAFTTE, ADVS, ADDV, ADSUB, ADHY, ADQLQC, ADCSSRS, ADEQ5D5L
## ADLB: [STUDYID, USUBJID, PARAMCD, AVISIT]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADTTE, ADAETTE, ADCM, ADEX, ADMH, ADQS, ADRS, ADSAFTTE, ADVS, ADDV, ADSUB, ADHY, ADQLQC, ADCSSRS, ADEQ5D5L
## ADMH: [STUDYID, USUBJID, ASTDTM, MHSEQ]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADTTE, ADAETTE, ADCM, ADEX, ADLB, ADQS, ADRS, ADSAFTTE, ADVS, ADDV, ADSUB, ADHY, ADQLQC, ADCSSRS, ADEQ5D5L
## ADQS: [STUDYID, USUBJID, PARAMCD, AVISIT]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADTTE, ADAETTE, ADCM, ADEX, ADLB, ADMH, ADRS, ADSAFTTE, ADVS, ADDV, ADSUB, ADHY, ADQLQC, ADCSSRS, ADEQ5D5L
## ADRS: [STUDYID, USUBJID, PARAMCD, AVISIT]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADTTE, ADAETTE, ADCM, ADEX, ADLB, ADMH, ADQS, ADSAFTTE, ADVS, ADDV, ADSUB, ADHY, ADQLQC, ADCSSRS, ADEQ5D5L
## ADSAFTTE: [STUDYID, USUBJID, PARAMCD]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADTTE, ADAETTE, ADCM, ADEX, ADLB, ADMH, ADQS, ADRS, ADVS, ADDV, ADSUB, ADHY, ADQLQC, ADCSSRS, ADEQ5D5L
## ADVS: [STUDYID, USUBJID, PARAMCD, AVISIT]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADTTE, ADAETTE, ADCM, ADEX, ADLB, ADMH, ADQS, ADRS, ADSAFTTE, ADDV, ADSUB, ADHY, ADQLQC, ADCSSRS, ADEQ5D5L
## ADDV: [STUDYID, USUBJID, ASTDT, DVTERM, DVSEQ]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADTTE, ADAETTE, ADCM, ADEX, ADLB, ADMH, ADQS, ADRS, ADSAFTTE, ADVS, ADSUB, ADHY, ADQLQC, ADCSSRS, ADEQ5D5L
## ADSUB: [STUDYID, USUBJID, PARAMCD, AVISITN, ADTM, SRCSEQ]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADTTE, ADAETTE, ADCM, ADEX, ADLB, ADMH, ADQS, ADRS, ADSAFTTE, ADVS, ADDV, ADHY, ADQLQC, ADCSSRS, ADEQ5D5L
## ADHY: [STUDYID, USUBJID, PARAMCD, AVISITN, ADTM, SRCSEQ]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADTTE, ADAETTE, ADCM, ADEX, ADLB, ADMH, ADQS, ADRS, ADSAFTTE, ADVS, ADDV, ADSUB, ADQLQC, ADCSSRS, ADEQ5D5L
## ADQLQC: [STUDYID, USUBJID, PARCAT1N, PARAMCD, BASETYPE, AVISITN, ATPTN, ADTM, QSSEQ]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADTTE, ADAETTE, ADCM, ADEX, ADLB, ADMH, ADQS, ADRS, ADSAFTTE, ADVS, ADDV, ADSUB, ADHY, ADCSSRS, ADEQ5D5L
## ADCSSRS: [STUDYID, USUBJID, PARAMCD, BASETYPE, AVISITN, DTYPE, ADTM]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADTTE, ADAETTE, ADCM, ADEX, ADLB, ADMH, ADQS, ADRS, ADSAFTTE, ADVS, ADDV, ADSUB, ADHY, ADQLQC, ADEQ5D5L
## ADEQ5D5L: [STUDYID, USUBJID, PARCAT1N, PARAMCD, BASETYPE, AVISITN, ATPTN, ADTM, QSSEQ]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADAE, ADEG, ADTTE, ADAETTE, ADCM, ADEX, ADLB, ADMH, ADQS, ADRS, ADSAFTTE, ADVS, ADDV, ADSUB, ADHY, ADQLQC, ADCSSRS
default_cdisc_join_keys["ADSL"]
## A join_keys object containing foreign keys between 1 datasets:
## ADSL: [STUDYID, USUBJID]
default_cdisc_join_keys["ADTTE"]
## A join_keys object containing foreign keys between 2 datasets:
## ADSL: [STUDYID, USUBJID]
## <-- ADTTE: [STUDYID, USUBJID]
## ADTTE: [STUDYID, USUBJID, PARAMCD]
## --> ADSL: [STUDYID, USUBJID]
default_cdisc_join_keys[c("ADSL", "ADTTE", "ADRS")]
## A join_keys object containing foreign keys between 3 datasets:
## ADSL: [STUDYID, USUBJID]
## <-- ADTTE: [STUDYID, USUBJID]
## <-- ADRS: [STUDYID, USUBJID]
## ADTTE: [STUDYID, USUBJID, PARAMCD]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADRS
## ADRS: [STUDYID, USUBJID, PARAMCD, AVISIT]
## --> ADSL: [STUDYID, USUBJID]
## --* (implicit via parent with): ADTTE