Tutorial grades will be assigned according to the following marking scheme.
Mark | |
---|---|
Attendance for the entire tutorial | 1 |
Assigned homework completiona | 1 |
In-class exercises | 4 |
Total | 6 |
Bring your output for this question to tutorial on Friday September 28 (either a hardcopy or on your laptop).
Use the nycflights13
package to answer the following questions.
distinct()
function in dplyr
as in the example below).library(nycflights13)
# example code to obtain the distinct number of days in the weather data
weather %>%
distinct(month, day) %>%
summarise(n = n())
library(nycflights13)
weather %>%
select(month, day, precip) %>%
filter(precip != 0) %>%
distinct(month, day) %>%
summarise(n = n())
There was precipitation on 141 days in the NY area in 2013.
library(nycflights13)
flights2 <- flights %>%
group_by(month) %>%
summarize(cancelled = sum(is.na(arr_delay)),
total = n(),
prop_cancelled = cancelled/total) %>%
arrange(desc(prop_cancelled))
flights2
flights2 %>% ggplot(aes(x = month, y = prop_cancelled)) +
geom_point() +
labs(title = "Proportion of Cancelled Flights Each Month",
y = "proportion cancelled")
February had the highest proportion of cancelled flights while October had the lowest. The data shows that February, December, and summer are the periods with the great- est number of cancellations, which may be because they are often also the stormiest and snowiest periods of the year, with severe weather likely to be causing cancellations.
planes %>% summarise(sum(is.na(year) == TRUE))
planes %>%
count(engine) %>%
mutate(engine = reorder(engine, -n)) %>%
ggplot(aes(engine, n)) + geom_col()
Turbo-fan is the most frequent engine type.
The Respiratory Virus Detection Surveillance System collects data from select laboratories across Canada on the number of tests performed and the number of tests positive for influenza and other respiratory viruses. Data are reported on a weekly basis year-round to the Centre for Immunization and Respiratory Infectious Diseases (CIRID), Public Health Agency of Canada. These data are also summarized in the weekly FluWatch report. Visit the website.
The data for the report, Week 1 ending December 30, 2017 is here - click on Table 1. Explain why this data set is not tidy?
For this exercise you will need to install the library rvest
. This code will “scrape” the table from the website and load it into an R data frame. Run the following code to download Table 1 directly into the a data frame called fludat
.
NB: You will not be responsible for understanding how this code and how the rvest
library works (i.e., there will not be a test question on this topic). But, if you are interested in scraping data from the web see.
# Uncomment next line if the rvest package is not installed
# install.packages("rvest")
library(rvest)
library(tidyverse)
url <- "https://www.canada.ca/en/public-health/services/surveillance/respiratory-virus-detections-canada/2017-2018/respiratory-virus-detections-isolations-week-52-ending-december-30-2017.html"
# download and read table into flu_dat
flu_dat <- url %>%
read_html() %>%
html_nodes(xpath = '/html/body/main/div[1]/div[2]/details[1]/table') %>%
html_table()
# clean up the file
fludat <- flu_dat[[1]]
dat <- as.data.frame(sapply(select(fludat,2:23), as.numeric))
fludat <- cbind(`Reporting Laboratory` = fludat[,1],dat)
Answer the following questions:
fludat
. Explain how you made the data tidy.A tidy versiuon of fludat
could mean either observations are: individual reporting labs; provinces and territories (Newfoundland, Nova Scotia, etc.); regions (Atlantic, Quebec, etc.); or country (Canada). For this example solution I’ll pick province to be the observations. So, I’ll tidy up the data so that every row corresponds to a province.
# using the match operator %in%
fludat_prov <- fludat %>% filter(row_number() < 42 & row_number() %in% c(1, 2, 3, 4, 12, 29, 30, 33, 34, 36, 37,38, 39))
# using the or operator |
#fludat_prov <- fludat %>% filter(row_number() < 42 & (row_number() == 1 | #row_number() == 2 | #row_number() == 3 | row_number() == 4 | row_number() == 12 | #row_number() == #29|row_number() == 30 | row_number() == 33 | row_number() == 34 #| row_number() == 36 | #row_number() == 37| row_number() == 38 | row_number() == #39))
glimpse(fludat_prov)
## Observations: 13
## Variables: 23
## $ `Reporting Laboratory` <fct> Newfoundland, Prince Edward Island, No...
## $ `Flu Tested` <dbl> 49, 39, 58, 242, 4417, 2001, 543, 646,...
## $ `A(H1N1)pdm09 Positive` <dbl> 0, 0, 0, 0, 0, 3, 1, 4, 6, 14, 0, 0, 0
## $ `A(H3) Positive` <dbl> 0, 12, 0, 0, 0, 117, 60, 106, 439, 36,...
## $ `A(UnS) Positive` <dbl> 2, 0, 8, 54, 735, 123, 85, 43, 70, 38,...
## $ `Total Flu A Positive` <dbl> 2, 12, 8, 54, 735, 243, 146, 153, 515,...
## $ `Total Flu B Positive` <dbl> 1, 11, 0, 5, 352, 173, 11, 42, 235, 15...
## $ `RSV Tested` <dbl> 49, 39, 58, 242, 3610, 1623, 526, 646,...
## $ `RSV Positive` <dbl> 2, 0, 11, 0, 413, 143, 10, 15, 0, 34, ...
## $ `PIV Tested` <dbl> 49, 3, 4, 87, 721, 895, 212, 646, 1488...
## $ `PIV 1 Positive` <dbl> 4, 0, 0, 4, 7, 17, 6, 26, 31, 7, NA, 0, 0
## $ `PIV 2 Positive` <dbl> 0, 0, 0, 0, 0, 1, 0, 5, 6, 3, NA, 0, 0
## $ `PIV 3 Positive` <dbl> 2, 0, 0, 0, 1, 5, 0, 2, 6, 0, NA, 0, 0
## $ `PIV 4 Positive` <dbl> 0, 0, 0, 1, 9, 4, 0, 3, 13, 5, NA, 0, 0
## $ `Other PIV Positive` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0
## $ `Adeno Tested` <dbl> 49, 3, 4, 87, 728, 866, 212, 646, 1488...
## $ `Adeno Positive` <dbl> 3, 0, 0, 3, 21, 8, 2, 8, 18, 2, NA, 0, 0
## $ `hMPV Tested` <dbl> 49, 3, 4, 87, 692, 841, 139, 646, 1488...
## $ `hMPV Positive` <dbl> 0, 0, 0, 8, 7, 22, 5, 41, 64, 8, NA, 0, 0
## $ `Entero/Rhino Tested` <dbl> 49, 3, 4, 87, NA, 230, 212, 646, 1488,...
## $ `Entero/Rhino Positive` <dbl> 4, 1, 1, 12, NA, 19, 14, 42, 100, 35, ...
## $ `Coron Tested` <dbl> NA, 3, 4, 87, 621, 207, 139, 646, 1488...
## $ `Coron Positive` <dbl> NA, 0, 0, 1, 34, 21, 4, 25, 107, 12, N...
I removed row 42 since it had some text by specifying row_number() < 42
. I had to inspect the data frame fludat
to find out it was row 42. Similarlily, I inspected the table to find out which rows correponded to provinces or territories. Then I used the %in%
operator which checks if the row number is in the list that I specified.
Now, each row corresponds to a province or territory. So, the data is in tidy format.
fludat_prov %>%
mutate(rate_A = round((`Total Flu A Positive` / `Flu Tested`), 2),
rate_B = round((`Total Flu B Positive` / `Flu Tested`), 2)) %>%
arrange(desc(rate_A)) %>%
select(`Reporting Laboratory`, rate_A)
fludat_prov %>%
mutate(rate_A = round((`Total Flu A Positive` / `Flu Tested`), 2),
rate_B = round((`Total Flu B Positive` / `Flu Tested`), 2)) %>%
arrange(desc(rate_A)) %>%
select(`Reporting Laboratory`, rate_B)
The highest rates for Flu A and Flu B is PEI.
\[\frac{\hat p_{\text province}}{(1-\hat p_{\text province})},\] where \(\hat p_{\text province}\) is the proportion that tested positive for Flu A. The odds ratio of testing positive for, say Flu A, in Newfoundland versus Ontario is:
\[\frac{\hat p_{\text Newfoundland}/(1-\hat p_{\text Newfoundland})}{\hat p_{\text Ontario}/(1-\hat p_{\text Ontario})} \]
Calculate the odds ratio for testing positive for Flu A in each province versus Ontario. Interpret odds ratio larger than one, less than one, and equal to one.
Use the ggplot
library to plot the odds ratios. Explain why you selected this type of plot.
Try the same plot except take the logarithm of the odds ratio. This is called the log odds. Interpret the log odds.
Comments:
An odds ratio greater than 1 means that a province has a greater odds of Flu A compared to Ontario.
An odds ratio less than 1 means that a province has a smaller odds of Flu A compared to Ontario.
An odds ratio equal to 1 means that a province has the same odds of Flu A as Ontario.
A bar chart is appropriate since we are plotting a categorical varaible and continuous variable.
The log odds is:
\[\ln\left(\frac{\hat p_{\text Province}/(1-\hat p_{\text Province})}{\hat p_{\text Ontario}/(1-\hat p_{\text Ontario})}\right) = \ln \left(\hat p_{\text Province}/(1-\hat p_{\text Province}) \right) - \ln \left( \hat p_{\text Ontario}/(1-\hat p_{\text Ontario})\right).\]
The difference in the log of the odds between a province and Ontario. If
\[\ln \left(\hat p_{\text Province}/(1-\hat p_{\text Province}) \right) - \ln \left( \hat p_{\text Ontario}/(1-\hat p_{\text Ontario})\right) >0,\]
then the odds of Flu A is greater in the provinces versus Ontario.
If
\[\ln \left(\hat p_{\text Province}/(1-\hat p_{\text Province}) \right) - \ln \left( \hat p_{\text Ontario}/(1-\hat p_{\text Ontario})\right) < 0,\]
then the odds of Flu A is less than in the provinces versus Ontario.
If
\[\ln \left(\hat p_{\text Province}/(1-\hat p_{\text Province}) \right) - \ln \left( \hat p_{\text Ontario}/(1-\hat p_{\text Ontario})\right) =0\]
then the odds of Flu A is the same as Ontario.
The plot of the log odds is easier to interpret. It clearly shows which provinces has greater odds of Flu A compared to Ontario.
ont_odds <- fludat %>%
filter(`Reporting Laboratory` == "Province of Ontario") %>%
mutate(rate = (`Total Flu A Positive` / `Flu Tested`), oddsR = (rate / (1 - rate))) %>%
select(oddsR)
fludat_prov %>%
mutate(rate = (`Total Flu A Positive` / `Flu Tested`),
odds = (rate / (1 - rate)), oddsR = (odds / ont_odds$oddsR)) %>%
select(`Reporting Laboratory`, oddsR) %>%
arrange(desc(oddsR))
fludat_prov %>%
mutate(rate = (`Total Flu A Positive` / `Flu Tested`),
odds = (rate / (1 - rate)), oddsR = (odds / ont_odds$oddsR)) %>%
mutate(province = reorder(`Reporting Laboratory`, -oddsR)) %>%
ggplot(aes(x = province, y = oddsR)) + geom_col(colour = "black", fill = "grey") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + geom_hline(yintercept = 1, colour = "red")
fludat_prov %>%
mutate(rate = (`Total Flu A Positive` / `Flu Tested`),
odds = (rate / (1 - rate)), log_oddsR = log(odds / ont_odds$oddsR)) %>%
mutate(province = reorder(`Reporting Laboratory`, -log_oddsR)) %>%
ggplot(aes(x = province, y = log_oddsR)) +
geom_col(colour = "black", fill = "grey") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The CIACountries
data set is available in the mdsr
library. Countries can be categorized by gross domestic product (GDP) - “a monetary measure of the market value of all the final goods and services produced in a period of time, often yearly or quarterly” (see Wikipedia article). The gdp
variable contains data on GDP per person (or per capita).
library(mdsr)
library(tidyverse)
CIACountries %>%
mutate(gdp_cat = ifelse(gdp >= 50000, "high","med-low")) %>%
filter(is.na(gdp_cat) == F) %>%
ggplot(aes(x = gdp_cat, y= roadways)) + geom_boxplot()
library(mdsr)
library(tidyverse)
is.outlier <- function(x){
ifelse(x <= quantile(x, 0.25, na.rm = T) - 1.5*IQR(x, na.rm = T) |
x >= quantile(x, 0.75, na.rm = T) + 1.5*IQR(x, na.rm = T),
"yes", "no")
}
outl_dat <- CIACountries %>%
mutate(gdp_cat = ifelse(gdp >= 50000, "high","med-low")) %>%
group_by(gdp_cat) %>%
filter(is.na(gdp_cat) == F) %>%
mutate(outlier = is.outlier(roadways)) %>%
filter(outlier == "yes") %>%
select(country, gdp_cat, outlier) %>%
arrange(gdp_cat)
The coutries with GDP in the high category are:
outl_dat %>% filter(gdp_cat == "high")
The coutries in the medium-low category are:
outl_dat %>% filter(gdp_cat == "med-low")