Relational Data

Adithi R. Upadhya

4 August, 2022

Recap

tidyverse package functions like select, group_by, summarize, mutate, filter, case_when, ggplot etc.

Prerequsities

library(tidyverse) 

Shortcut of the day

Alt + Shift + K


Shows all keyboard shortcuts!

Relational Data

  • Usually data is more than one table of data.
  • Collectively, multiple tables of data are called relational data because it is the relations that is important.
  • Relations are always defined between a pair of tables.

Relationships

Verbs

  • Mutating joins, which add new variables to one data frame from matching observations in another.

  • Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.

  • Set operations, which treat observations as if they were set elements.

Keys

  • The variables used to connect each pair of tables are called keys.
  • A key is a variable (or set of variables) that uniquely identifies an observation.

Type of keys

  • A primary key uniquely identifies an observation in its own table.
  • A foreign key uniquely identifies an observation in another table.

Understanding Keys

  • These keys cannot have duplicates. They can be combination of keys as well.
  • Sometimes a table doesn’t have an explicit primary key, there a new row can be created as surrogate key using mutate() or row_number().


A variable can be both a primary key and a foreign key.

Relations

  • A primary key and the corresponding foreign key in another table form a relation.
  • Relations are typically one-to-many. For example, each flight has one plane, but each plane has many flights.
  • many-to-many relations are possible as well. Eg- each airline flies to many airports; each airport hosts many airlines.

Data used

band_members and band_instruments

These data sets describe band members and instruments used by the Beatles and Rolling Stones.

glimpse(band_members)
Rows: 3
Columns: 2
$ name <chr> "Mick", "John", "Paul"
$ band <chr> "Stones", "Beatles", "Beatles"
glimpse(band_instruments)
Rows: 3
Columns: 2
$ name  <chr> "John", "Paul", "Keith"
$ plays <chr> "guitar", "bass", "guitar"

Quiz 1

Imagine you wanted to find out which musician from a particular band played which instrument. What variables would you need? Identify the primary key to be used.

  1. name, plays; primary key - name
  2. name, band, plays; primary key - name, band, plays
  1. name, band, plays; primary key - name
  1. band, plays; primary key - band, plays

Concept Map

Mutating Join

  • A mutating join allows you to combine variables from two tables.
  • It first matches observations by their keys, then copies across variables from one table to the other.

Understanding Joins

  • The colored column represents the key variable: these are used to match the rows between the tables.
  • The grey column represents the value column that is carried along for the ride.
  • A join is a way of connecting each row in x to zero, one, or more rows in y.
  • The number of dots = the number of matches = the number of rows in the output.

Inner Join

  • An inner join matches pairs of observations whenever their keys are equal.

  • The output of an inner join is a new data frame that contains the key, the x values, and the y values.

Demo

inner_join(band_instruments, band_members, by = "name")
# A tibble: 2 × 3
  name  plays  band   
  <chr> <chr>  <chr>  
1 John  guitar Beatles
2 Paul  bass   Beatles

Left Join

A left join keeps all observations in x.

Demo

band_instruments %>% 
  left_join(band_members, by = "name")
# A tibble: 3 × 3
  name  plays  band   
  <chr> <chr>  <chr>  
1 John  guitar Beatles
2 Paul  bass   Beatles
3 Keith guitar <NA>   

Right Join

A right join keeps all observations in y.

Demo

band_instruments %>% 
  right_join(band_members, by = "name")
# A tibble: 3 × 3
  name  plays  band   
  <chr> <chr>  <chr>  
1 John  guitar Beatles
2 Paul  bass   Beatles
3 Mick  <NA>   Stones 

Full Join

A full join keeps all observations in x and y.

Demo

band_instruments %>% 
  full_join(band_members, by = "name")
# A tibble: 4 × 3
  name  plays  band   
  <chr> <chr>  <chr>  
1 John  guitar Beatles
2 Paul  bass   Beatles
3 Keith guitar <NA>   
4 Mick  <NA>   Stones 

Outer Joins

Quiz 2

How to combine all observations from the two tables band_members and band_instruments to get a table as shown here.

  1. left_join
  1. full_join
  1. right_join
  2. inner_join

Defining key columns

  • The default, by = NULL, uses all variables that appear in both tables, the so called natural join.
  • A character vector, by = "x". This is like a natural join, but uses only some of the common variables.
  • A named character vector: by = c("a" = "b"). This will match variable a in table x to variable b in table y.

Concept Map

Filtering Joins

  • Affect the observations, not the variables. Never have duplicate rows.

  • semi_join(x, y) keeps all observations in x that have a match in y.

  • anti_join(x, y) drops all observations in x that have a match in y.

  • Anti-joins are useful for diagnosing join mismatches.

Semi join and Anti join

Join Problems

  • Start by identifying the variables that form the primary key in each table.

  • Do not empirically look for a combination of variables that give a unique identifier.

  • Check that none of the variables in the primary key are missing.

  • Check that your foreign keys match primary keys in another table.

Set Operations

  • The final type of two-table verb are the set operations.
  • These expect the x and y inputs to have the same variables, and treat the observations like sets.

Set Operations types

  • intersect(x, y): return only observations in both x and y

  • union(x, y): return unique observations in x and y

  • setdiff(x, y): return observations in x, but not in y

intersect and union

Demo

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)

intersect(df1, df2)
# A tibble: 1 × 2
      x     y
  <dbl> <dbl>
1     1     1

Quiz 3

How are joins and set operations different from each other? (Choose one or many)

  1. set operations need same column / variable names while joins do not
  1. no difference both are same, but have different syntax
  1. set operations have no keys but joins have keys

Resources