Adithi R. Upadhya
4 August, 2022
tidyverse package functions like select, group_by, summarize, mutate, filter, case_when, ggplot etc.
Alt + Shift + K
Shows all keyboard shortcuts!
Mutating joins, which add new variables to one data frame from matching observations in another.
Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.
Set operations, which treat observations as if they were set elements.
primary key uniquely identifies an observation in its own table.foreign key uniquely identifies an observation in another table.surrogate key using mutate() or row_number().A variable can be both a primary key and a foreign key.
band_members and band_instruments
These data sets describe band members and instruments used by the Beatles and Rolling Stones.
Imagine you wanted to find out which musician from a particular band played which instrument. What variables would you need? Identify the primary key to be used.

key variable: these are used to match the rows between the tables.value column that is carried along for the ride.
An inner join matches pairs of observations whenever their keys are equal.
The output of an inner join is a new data frame that contains the key, the x values, and the y values.

A left join keeps all observations in x.

A right join keeps all observations in y.

A full join keeps all observations in x and y.

How to combine all observations from the two tables band_members and band_instruments to get a table as shown here.

key columnsby = NULL, uses all variables that appear in both tables, the so called natural join.by = "x". This is like a natural join, but uses only some of the common variables.by = c("a" = "b"). This will match variable a in table x to variable b in table y.Affect the observations, not the variables. Never have duplicate rows.
semi_join(x, y) keeps all observations in x that have a match in y.
anti_join(x, y) drops all observations in x that have a match in y.
Anti-joins are useful for diagnosing join mismatches.


Start by identifying the variables that form the primary key in each table.
Do not empirically look for a combination of variables that give a unique identifier.
Check that none of the variables in the primary key are missing.
Check that your foreign keys match primary keys in another table.
intersect(x, y): return only observations in both x and y
union(x, y): return unique observations in x and y
setdiff(x, y): return observations in x, but not in y
intersect and union

How are joins and set operations different from each other? (Choose one or many)