A data analysis often involves multiple data frames that must be combined to answer the question that you are interested in.
The dplyr
library contains functions for combining or joining two data frames. This lesson will only cover mutating joins or joins that combine variables from two data frames. Filtering joins “match observations in the same way as mutating joins, but affect the observations, not the variables” (R for Data Science, 13.5).
Consider two small data frames, superheros
and publishers
which were obtained from https://www.superherodb.com. This example is based on the STAT545 Cheatsheet by Jenny Bryan.
superheros
## # A tibble: 5 x 4
## name Alignment Gender Publisher
## <chr> <chr> <chr> <chr>
## 1 Luna good Female Marvel Comics
## 2 Jessica Sanders good Female NBC - Heroes
## 3 Sage good Female Marvel Comics
## 4 Kraven II bad Male Marvel Comics
## 5 Batgirl III good Female DC Comics
publisher
## # A tibble: 4 x 2
## Publisher yr_founded
## <chr> <chr>
## 1 Marvel Comics 1939
## 2 Image Comics 1992
## 3 HarperCollins 1989
## 4 DC Comics 1934
The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation. In simple cases, a single variable is sufficient to identify an observation. (Ref: R for Data Science 13.3)
In this example each publisher is uniquely identified by the variable Publisher
.
A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.
An inner join matches pairs of observations whenever their keys are equal
(Ref: R for Data Science 13.4.2).
inner_join(x = superheros, y = publisher, by = "Publisher")
## # A tibble: 4 x 5
## name Alignment Gender Publisher yr_founded
## <chr> <chr> <chr> <chr> <chr>
## 1 Luna good Female Marvel Comics 1939
## 2 Sage good Female Marvel Comics 1939
## 3 Kraven II bad Male Marvel Comics 1939
## 4 Batgirl III good Female DC Comics 1934
Question: What happened?
We lose Jessica Sanders even though she appears in x = superheros
, her publisher, NBC - Heroes, does not appear in y = publisher
.
An inner join keeps observations that appear in both tables. An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:
A left join keeps all observations in x. A right join keeps all observations in y. A full join keeps all observations in x and y.
These joins work by adding an additional “virtual” observation to each table. This observation has a key that always matches (if no other key matches), and a value filled with
NA
.
(Ref: R for Data Science 13.4.3).
Return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
left_join(x = superheros, y = publisher, by = "Publisher")
## # A tibble: 5 x 5
## name Alignment Gender Publisher yr_founded
## <chr> <chr> <chr> <chr> <chr>
## 1 Luna good Female Marvel Comics 1939
## 2 Jessica Sanders good Female NBC - Heroes <NA>
## 3 Sage good Female Marvel Comics 1939
## 4 Kraven II bad Male Marvel Comics 1939
## 5 Batgirl III good Female DC Comics 1934
Question: What happened?
Jessica Sanders publisher, NBC - Heroes, is not included in y = publisher
so a virtual observation is created for Publisher
and yr_founded
is filled in with NA
.
Return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
right_join(x = superheros, y = publisher, by = "Publisher")
## # A tibble: 6 x 5
## name Alignment Gender Publisher yr_founded
## <chr> <chr> <chr> <chr> <chr>
## 1 Luna good Female Marvel Comics 1939
## 2 Sage good Female Marvel Comics 1939
## 3 Kraven II bad Male Marvel Comics 1939
## 4 <NA> <NA> <NA> Image Comics 1992
## 5 <NA> <NA> <NA> HarperCollins 1989
## 6 Batgirl III good Female DC Comics 1934
Question: What happened?
There are no superheros in the superheros
data published by Image Comics or HarperCollins, so values for name
, Alignment
, and Gender
are filled in with NA
.
Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
full_join(x = superheros, y = publisher, by = "Publisher")
## # A tibble: 7 x 5
## name Alignment Gender Publisher yr_founded
## <chr> <chr> <chr> <chr> <chr>
## 1 Luna good Female Marvel Comics 1939
## 2 Jessica Sanders good Female NBC - Heroes <NA>
## 3 Sage good Female Marvel Comics 1939
## 4 Kraven II bad Male Marvel Comics 1939
## 5 Batgirl III good Female DC Comics 1934
## 6 <NA> <NA> <NA> Image Comics 1992
## 7 <NA> <NA> <NA> HarperCollins 1989
Question: What happened?
Jessica Sanders publisher, NBC- Heros, is not in y = publisher
so an NA
value was filled in for yr_founded
. Image Comics and HarperCollins did not publish any of the characters in x = superheros
so name
, Alignment
, and Gender
are filled in with NA
.
The data frames that are the result of a join are just like any other data frame. This means that after joining data frames to create a new data frame the resulting data frame can, for example, be summarised (e.g., count the number of superheroes from each publisher)
right_join(x = superheros, y = publisher, by = "Publisher") %>%
group_by(Publisher) %>%
summarise(n = n())
## # A tibble: 4 x 2
## Publisher n
## <chr> <int>
## 1 DC Comics 1
## 2 HarperCollins 1
## 3 Image Comics 1
## 4 Marvel Comics 3
or visualized (e.g., visualize the distribution of gender).
right_join(x = superheros, y = publisher, by = "Publisher") %>%
filter(Gender != "NA") %>%
ggplot(aes(x = Gender)) + geom_bar()