Instructions

What should I bring to tutorial on September 28?

  • R output (e.g., plots and explanations) for Question 1(b), 1(d), 2(b), 3(b) You can either bring a hardcopy or bring your laptop with the output.

Tutorial Grading

Tutorial grades will be assigned according to the following marking scheme.

Mark
Attendance for the entire tutorial 1
Assigned homework completiona 1
In-class exercises 4
Total 6

Bring your output for this question to tutorial on Friday September 28 (either a hardcopy or on your laptop).

Practice Problems

Question 1

Use the nycflights13 package to answer the following questions.

  1. Use the weather table to answer the following question: On how many days was there precipitation in the New York area in 2013? (HINT: consider using the distinct() function in dplyr as in the example below).

  2. Use the flights data frame to answer the following questions: What month had the highest proportion of cancelled flights?

  3. How many planes that flew from NYC have a missing date of manufacturer?

  4. Create a vizualization of the distribution of engine types that flew from NYC? Make sure to order the engine categories. Which engine type is most frequent in flights from NYC?

Question 2

The Respiratory Virus Detection Surveillance System collects data from select laboratories across Canada on the number of tests performed and the number of tests positive for influenza and other respiratory viruses. Data are reported on a weekly basis year-round to the Centre for Immunization and Respiratory Infectious Diseases (CIRID), Public Health Agency of Canada. These data are also summarized in the weekly FluWatch report. Visit the website.

  1. The data for the report, Week 1 ending December 30, 2017 is here - click on Table 1. Explain why this data set is not tidy?

  2. For this exercise you will need to install the library rvest. This code will “scrape” the table from the website and load it into an R data frame. Run the following code to download Table 1 directly into the a data frame called fludat.

NB: You will not be responsible for understanding how this code and how the rvest library works (i.e., there will not be a test question on this topic). But, if you are interested in scraping data from the web see.

# Uncomment next line if the rvest package is not installed
# install.packages("rvest") 
library(rvest)
library(tidyverse)

url <- "https://www.canada.ca/en/public-health/services/surveillance/respiratory-virus-detections-canada/2017-2018/respiratory-virus-detections-isolations-week-52-ending-december-30-2017.html"

# download and read table into flu_dat 
flu_dat <- url %>% 
  read_html() %>% 
  html_nodes(xpath = '/html/body/main/div[1]/div[2]/details[1]/table') %>% 
  html_table()

# clean up the file
fludat <- flu_dat[[1]]
dat <- as.data.frame(sapply(select(fludat,2:23), as.numeric))
fludat <- cbind(`Reporting Laboratory` = fludat[,1],dat)

Answer the following questions:

  1. Create a tidy version of fludat. Explain how you made the data tidy.

A tidy versiuon of fludat could mean either observations are: individual reporting labs; provinces and territories (Newfoundland, Nova Scotia, etc.); regions (Atlantic, Quebec, etc.); or country (Canada). For this example solution I’ll pick province to be the observations. So, I’ll tidy up the data so that every row corresponds to a province.

  1. Which provinces and territories have the highest positive rates for Flu A and Flu B.

  2. The odds of testing positive for, say, Flu A in a province or territory is:

\[\frac{\hat p_{\text province}}{(1-\hat p_{\text province})},\] where \(\hat p_{\text province}\) is the proportion that tested positive for Flu A. The odds ratio of testing positive for, say Flu A, in Newfoundland versus Ontario is:

\[\frac{\hat p_{\text Newfoundland}/(1-\hat p_{\text Newfoundland})}{\hat p_{\text Ontario}/(1-\hat p_{\text Ontario})} \]

  • Calculate the odds ratio for testing positive for Flu A in each province versus Ontario. Interpret odds ratio larger than one, less than one, and equal to one.

  • Use the ggplot library to plot the odds ratios. Explain why you selected this type of plot.

  • Try the same plot except take the logarithm of the odds ratio. This is called the log odds. Interpret the log odds.

Question 3

The CIACountries data set is available in the mdsr library. Countries can be categorized by gross domestic product (GDP) - “a monetary measure of the market value of all the final goods and services produced in a period of time, often yearly or quarterly” (see Wikipedia article). The gdp variable contains data on GDP per person (or per capita).

  1. Use boxplots to compare the distribution of roadways in countries with a GDP of at least $50000 compared to less than $50000? Interpret the boxplots. What conclusions can you draw from the comparison?

  2. Write a function to identify outliers using the \(1.5 \times IQR\) rule. Use the function to identify which countries are outliers among the group of countries with a GDP of at least \(50000\) and the group of countries with a GDP less than $50000.