Adithi R. Upadhya
4 August, 2022
tidyverse
package functions like select
, group_by
, summarize
, mutate
, filter
, case_when
, ggplot
etc.
Alt + Shift + K
Shows all keyboard shortcuts!
Mutating joins, which add new variables to one data frame from matching observations in another.
Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.
Set operations, which treat observations as if they were set elements.
primary key
uniquely identifies an observation in its own table.foreign key
uniquely identifies an observation in another table.surrogate key
using mutate()
or row_number()
.A variable can be both a primary key and a foreign key.
band_members
and band_instruments
These data sets describe band members and instruments used by the Beatles and Rolling Stones.
Imagine you wanted to find out which musician from a particular band played which instrument. What variables would you need? Identify the primary key to be used.
key
variable: these are used to match the rows between the tables.value
column that is carried along for the ride.An inner join matches pairs of observations whenever their keys are equal.
The output of an inner join is a new data frame that contains the key, the x values, and the y values.
A left join
keeps all observations in x
.
A right join
keeps all observations in y
.
A full join
keeps all observations in x
and y
.
How to combine all observations from the two tables band_members and band_instruments to get a table as shown here.
key
columnsby = NULL
, uses all variables that appear in both tables, the so called natural join.by = "x"
. This is like a natural join, but uses only some of the common variables.by = c("a" = "b")
. This will match variable a in table x
to variable b
in table y
.Affect the observations, not the variables. Never have duplicate rows.
semi_join(x, y)
keeps all observations in x
that have a match in y
.
anti_join(x, y)
drops all observations in x
that have a match in y
.
Anti-joins are useful for diagnosing join mismatches.
Start by identifying the variables that form the primary key in each table.
Do not empirically look for a combination of variables that give a unique identifier.
Check that none of the variables in the primary key are missing.
Check that your foreign keys match primary keys in another table.
intersect(x, y)
: return only observations in both x
and y
union(x, y)
: return unique observations in x
and y
setdiff(x, y)
: return observations in x
, but not in y
intersect
and union
How are joins and set operations different from each other? (Choose one or many)