Instructions

What should I bring to tutorial on October 12?

  • R output (e.g., plots and explanations) for Question 1 (a)-(e). You can either bring a hardcopy or bring your laptop with the output.

Tutorial Grading

Tutorial grades will be assigned according to the following marking scheme.

Mark
Attendance for the entire tutorial 1
Assigned homework completiona 1
In-class exercises 4
Total 6

These problems are based on the lesson Joining Data Frames.

Practice Problems

The file heroes_information_exer.csv contains some information on superheroes and super_hero_powers_exer.csv conatins some information on powers of superheroes.

The following questions are based on data in heroes_information.csv and super_hero_powers.csv.

Question 1

  1. Read both data sets heroes_information.csv and super_hero_powers.csv into R using read_csv from the tidyverse library. Here is the R code. How may variables and observations are in each data frame?
library(tidyverse)
heroinfo_url <- "https://raw.githubusercontent.com/ntaback/UofT_STA130/master/Fall2018/week5/heroes_information_exer.csv"
heropower_url <- "https://raw.githubusercontent.com/ntaback/UofT_STA130/master/Fall2018/week5/super_hero_powers_exer.csv"

hero_info <- read_csv(heroinfo_url)
hero_power <- read_csv(heropower_url)
glimpse(hero_info)
## Observations: 487
## Variables: 4
## $ name      <chr> "A-Bomb", "Abe Sapien", "Abin Sur", "Abomination", "...
## $ Alignment <chr> "good", "good", "good", "bad", "bad", "good", "good"...
## $ Weight    <dbl> 441, 65, 90, 441, 122, 88, 61, 81, 104, 108, 90, 90,...
## $ Publisher <chr> "Marvel Comics", "Dark Horse Comics", "DC Comics", "...
glimpse(hero_power)
## Observations: 667
## Variables: 4
## $ name         <chr> "3-D Man", "A-Bomb", "Abe Sapien", "Abin Sur", "A...
## $ Agility      <chr> "True", "False", "True", "False", "False", "False...
## $ Flight       <chr> "False", "False", "False", "False", "False", "Tru...
## $ Marksmanship <chr> "False", "False", "True", "False", "False", "Fals...
  1. Suggest a key to join the two data frames?

Use name as the key since it uniquely identifies observations.

  1. What proprotion of superheroes in heroes_information also have data in super_hero_powers?
inner_join(hero_info, hero_power, by = "name") %>% head()
inner_join(hero_info, hero_power, by = "name") %>% summarise(n = n())

The proportion is 460/487 = 0.94.

  1. What is the number of observations, average, median, standard deviation, and inter-quartile range of weight for superheroes for each category of marksmanship? (HINT: use the group_by() function then summarise())
left_join(hero_info, hero_power, by = "name") %>% 
  group_by(Marksmanship) %>% 
  summarise(n = n(), 
            mean_wt = mean(Weight, na.rm = TRUE), 
            sd_wt = sd(Weight, na.rm = TRUE),
            median_wt = median(Weight), 
            iqr_wt = IQR(Weight))
  1. Are superheroes with marksmanship thinner compared to those without marksmanship? Create a visualization to compare the distribution of weight between superheroes that have marksmanship and those that don’t have marksmanship. Which distribution has more variability?
left_join(hero_info, hero_power, by = "name") %>% 
  filter(Marksmanship != "NA") %>%
  ggplot(aes(x = Marksmanship, y = Weight)) + geom_boxplot()

Superheroes with marksmanship are thinner compared to those without marksmanship. The variability in weight is greater in those without marksmanship compared to those without.