STA130 - Joining Data Frames

Joining Data Frames

A data analysis often involves multiple data frames that must be combined to answer the question that you are interested in.

The dplyr library contains functions for combining or joining two data frames. This lesson will only cover mutating joins or joins that combine variables from two data frames. Filtering joins “match observations in the same way as mutating joins, but affect the observations, not the variables” (R for Data Science, 13.5).

Consider two small data frames, superheros and publishers which were obtained from https://www.superherodb.com. This example is based on the STAT545 Cheatsheet by Jenny Bryan.

superheros

## # A tibble: 5 x 4
##   name            Alignment Gender Publisher    
##   <chr>           <chr>     <chr>  <chr>        
## 1 Luna            good      Female Marvel Comics
## 2 Jessica Sanders good      Female NBC - Heroes 
## 3 Sage            good      Female Marvel Comics
## 4 Kraven II       bad       Male   Marvel Comics
## 5 Batgirl III     good      Female DC Comics

publisher

## # A tibble: 4 x 2
##   Publisher     yr_founded
##   <chr>         <chr>     
## 1 Marvel Comics 1939      
## 2 Image Comics  1992      
## 3 HarperCollins 1989      
## 4 DC Comics     1934

Keys

The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation. In simple cases, a single variable is sufficient to identify an observation. (Ref: R for Data Science 13.3)

In this example each publisher is uniquely identified by the variable Publisher.

Mutating joins

A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.

Inner joins

An inner join matches pairs of observations whenever their keys are equal

(Ref: R for Data Science 13.4.2).

inner_join(x = superheros, y = publisher, by = "Publisher")

## # A tibble: 4 x 5
##   name        Alignment Gender Publisher     yr_founded
##   <chr>       <chr>     <chr>  <chr>         <chr>     
## 1 Luna        good      Female Marvel Comics 1939      
## 2 Sage        good      Female Marvel Comics 1939      
## 3 Kraven II   bad       Male   Marvel Comics 1939      
## 4 Batgirl III good      Female DC Comics     1934

Question: What happened?

We lose Jessica Sanders even though she appears in x = superheros, her publisher, NBC - Heroes, does not appear in y = publisher.

Outer joins

An inner join keeps observations that appear in both tables. An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:

A left join keeps all observations in x. A right join keeps all observations in y. A full join keeps all observations in x and y.

These joins work by adding an additional “virtual” observation to each table. This observation has a key that always matches (if no other key matches), and a value filled with NA.

(Ref: R for Data Science 13.4.3).

Left join

Return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

left_join(x = superheros, y = publisher, by = "Publisher")

## # A tibble: 5 x 5
##   name            Alignment Gender Publisher     yr_founded
##   <chr>           <chr>     <chr>  <chr>         <chr>     
## 1 Luna            good      Female Marvel Comics 1939      
## 2 Jessica Sanders good      Female NBC - Heroes  <NA>      
## 3 Sage            good      Female Marvel Comics 1939      
## 4 Kraven II       bad       Male   Marvel Comics 1939      
## 5 Batgirl III     good      Female DC Comics     1934

Question: What happened?

Jessica Sanders publisher, NBC - Heroes, is not included in y = publisher so a virtual observation is created for Publisher and yr_founded is filled in with NA.

Right join

Return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

right_join(x = superheros, y = publisher, by = "Publisher")

## # A tibble: 6 x 5
##   name        Alignment Gender Publisher     yr_founded
##   <chr>       <chr>     <chr>  <chr>         <chr>     
## 1 Luna        good      Female Marvel Comics 1939      
## 2 Sage        good      Female Marvel Comics 1939      
## 3 Kraven II   bad       Male   Marvel Comics 1939      
## 4 <NA>        <NA>      <NA>   Image Comics  1992      
## 5 <NA>        <NA>      <NA>   HarperCollins 1989      
## 6 Batgirl III good      Female DC Comics     1934

Question: What happened?

There are no superheros in the superheros data published by Image Comics or HarperCollins, so values for name, Alignment, and Gender are filled in with NA.

Full join

Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

full_join(x = superheros, y = publisher, by = "Publisher")

## # A tibble: 7 x 5
##   name            Alignment Gender Publisher     yr_founded
##   <chr>           <chr>     <chr>  <chr>         <chr>     
## 1 Luna            good      Female Marvel Comics 1939      
## 2 Jessica Sanders good      Female NBC - Heroes  <NA>      
## 3 Sage            good      Female Marvel Comics 1939      
## 4 Kraven II       bad       Male   Marvel Comics 1939      
## 5 Batgirl III     good      Female DC Comics     1934      
## 6 <NA>            <NA>      <NA>   Image Comics  1992      
## 7 <NA>            <NA>      <NA>   HarperCollins 1989

Question: What happened?

Jessica Sanders publisher, NBC- Heros, is not in y = publisher so an NA value was filled in for yr_founded. Image Comics and HarperCollins did not publish any of the characters in x = superheros so name, Alignment, and Genderare filled in with NA.

Using Joined Data Frames

The data frames that are the result of a join are just like any other data frame. This means that after joining data frames to create a new data frame the resulting data frame can, for example, be summarised (e.g., count the number of superheroes from each publisher)

right_join(x = superheros, y = publisher, by = "Publisher") %>%
  group_by(Publisher) %>%
  summarise(n = n())

## # A tibble: 4 x 2
##   Publisher         n
##   <chr>         <int>
## 1 DC Comics         1
## 2 HarperCollins     1
## 3 Image Comics      1
## 4 Marvel Comics     3

or visualized (e.g., visualize the distribution of gender).

right_join(x = superheros, y = publisher, by = "Publisher") %>%
  filter(Gender != "NA") %>%
  ggplot(aes(x = Gender)) + geom_bar()

STA130 - Joining Data Frames

Nathan Taback

2018-10-05

Joining Data Frames

Keys

Mutating joins

Inner joins

Outer joins

Left join

Right join

Full join

Using Joined Data Frames