Instructions

What should I bring to tutorial on September 21?

R output (e.g., plots and explanations) for Question 1 and 2 only. You can either bring a hardcopy or bring your laptop with the output.

Tutorial Grading

Tutorial grades will be assigned according to the following marking scheme.

	Mark
Attendance for the entire tutorial	1
Assigned homework completion^a	1
In-class exercises	4
Total	6

Practice Problems

Question 1

Most students in undergrad at UofT are either in their first, second, third or fourth year of study. Create an object in R called year that represents year of study at university.

Insert your answer here

Use the sample() function to simulate selecting 100 students and recording their year of study.

Insert your answer here

How many students in your sample are in each year?

Insert your answer here

What is the average and standard deviation of the number of years students have been studying in your sample?

Insert your answer here

Write a function that returns the average number of years for given sample size.

Insert your answer here

The replicate() function in R can evaluate an expression a fixed number of times. For example, the following code will:

draw a random sample of 8 with replacement from the numbers 1 to 10,
calculate the average of the sample, and
repeat this process 5 times.

replicate(n = 5,  mean(sample(1:10, size = 8, replace = TRUE)))

## [1] 5.750 6.750 4.250 4.875 5.875

Use the replicate() and the function you wrote in part (e) to simulate sampling 500 students per year for 50 years. Plot the distribution of the average years of study for each year. What is the mean and standard deviation of the this distribution? What is the shape of the distribution? Explain.

Insert your answer here

Question 2

The Galton data set in the mosaic library contains data from Francis Galton in the 1880s.

library(mosaic)
library(tidyverse)
glimpse(Galton)

## Observations: 898
## Variables: 6
## $ family <fct> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5...
## $ father <dbl> 78.5, 78.5, 78.5, 78.5, 75.5, 75.5, 75.5, 75.5, 75.0, 7...
## $ mother <dbl> 67.0, 67.0, 67.0, 67.0, 66.5, 66.5, 66.5, 66.5, 64.0, 6...
## $ sex    <fct> M, F, F, F, M, M, F, F, M, F, M, M, F, F, F, M, M, M, F...
## $ height <dbl> 73.2, 69.2, 69.0, 69.0, 73.5, 72.5, 65.5, 65.5, 71.0, 6...
## $ nkids  <int> 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 6, 6, 6, 6...

Using the Galton data use R to calculate the average and variance of child’s height in families 1, 2, and 32. Which family has the largest variance? Explain the meaning of variance in this context.

Insert your answer here

Do tall Fathers have tall kids? Do tall Mothers have tall kids? Vizualize the relationships between the average height of kids in a family and their Fathers’ and Mothers’ heights. Explain your answer.

One way to calculate the mean height of kids in each family is to use group_by() in combination with summarise() function in the dplyr library. We haven’t covered group_by() in class yet, but an example on how to do this is given below. Note that both group_by() and summarise() return a data frame.

Here is an example. Consider a simple data frame marks of the final marks for two (fictitious) students that each took five courses during their first year at UofT. The example below uses group_by() then summarise() to calculate the average mark for each student.

library(tidyverse)

marks <- data_frame(student = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), 
                    courses = c("STA130", "MAT137", "ECO100", "CSC148", "PHL100",
                                "STA130", "MAT137", "ECO100", "CSC148", "PHL100"),
                    grade = c(82, 83, 77, 84, 79, 83, 74, 85, 77, 72))

marks_grouped <- group_by(marks, student)

ave_grades <- summarise(marks_grouped, ave = mean(grade))
ave_grades

## # A tibble: 2 x 2
##   student   ave
##     <dbl> <dbl>
## 1       1  81  
## 2       2  78.2

# verify calculations
mean(c(82, 83, 77, 84, 79))

## [1] 81

mean(c(83, 74, 85, 77, 72))

## [1] 78.2

Insert your answer here

Question 3

An article from FiveThirtyEight explored where people check the weather. Use the weather_check data set in the fivethirtyeight library to answer the following question.

Where do people go to check the weather? Do people use internet sources (Phone apps, websites, etc.) versus non-internet sources more often. HINT: use the count() function in the dplyr library.

library(fivethirtyeight)
glimpse(weather_check)

## Observations: 928
## Variables: 9
## $ respondent_id       <dbl> 3887201482, 3887159451, 3887152228, 388714...
## $ ck_weather          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ...
## $ weather_source      <chr> "The default weather app on your phone", "...
## $ weather_source_site <chr> NA, NA, NA, NA, "Iphone app", "AccuWeather...
## $ ck_weather_watch    <ord> Very likely, Very likely, Very likely, Som...
## $ age                 <fct> 30 - 44, 18 - 29, 30 - 44, 30 - 44, 30 - 4...
## $ female              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ hhold_income        <ord> $50,000 to $74,999, Prefer not to answer, ...
## $ region              <chr> "South Atlantic", NA, "Middle Atlantic", N...

Insert your answer here

Do people ages 18-29 watch the Weather Channel at the same rate as people 45-59? Explain.

Insert your answer here

Question 4

The R function sample(x, size, replace = TRUE) applied to the vector c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) (this also can be written in R as 0:9) can take a random sample of size size from the numbers 0 - 9. If the sampling is truly random then we would expect that each digit, 0 - 9, to have an equal chance of being sampled.

Create a data frame that has a random sample of 100 numbers from 0 - 9. Compute the proportion of times that each number occurs (i.e., the proportion of the 100 numbers that are 0, the proportion of the 100 numbers that are 1, etc.). Below is some code to get you started.

dat <- sample(0:9, size = FILL IN CODE, replace = TRUE)
samp <- data_frame(dat)
ssamp_dat <- count(FILL IN CODE)
mutate(FILL IN CODE)

Insert your answer here

Do the same as in part (a) except take a random sample of 1000 from 0 - 9.

Insert your answer here

Do the same as in part (a) except take a random sample of 10000 from 0 - 9.

Insert your answer here

If the sampling is truly random then what proportion of the sample do you expect will contain each number 0 - 9? Compare your expectation with the distribution of the numbers 0 - 9 that you found in parts (a) - (c). Which part is closest to what you expected?

Insert your answer here

STA130H1F – Fall 2018

Week 2 Practice Problems

N. Taback and N. Moon

Instructions

What should I bring to tutorial on September 21?

Tutorial Grading

Practice Problems

Question 1

Question 2

Question 3

Question 4