Instructions

What should I bring to tutorial on September 21?

R output (e.g., plots and explanations) for Question 1 and 2 only. You can either bring a hardcopy or bring your laptop with the output.

Tutorial Grading

Tutorial grades will be assigned according to the following marking scheme.

	Mark
Attendance for the entire tutorial	1
Assigned homework completion^a	1
In-class exercises	4
Total	6

Practice Problems

Question 1

Most students in undergrad at UofT are either in their first, second, third or fourth year of study. Create an object in R called year that represents year of study at university.

year <- 1:4

Use the sample() function to simulate selecting 100 students and recording their year of study.

set.seed(10)
sample(year, size = 100, replace = TRUE)

##   [1] 3 2 2 3 1 1 2 2 3 2 3 3 1 3 2 2 1 2 2 4 4 3 4 2 2 3 4 1 4 2 3 1 1 4 2
##  [36] 3 4 4 3 3 2 1 1 3 1 1 1 2 1 4 2 4 1 2 1 3 2 2 2 3 1 1 2 2 4 4 3 2 1 1
##  [71] 1 3 3 3 1 3 2 4 2 1 4 2 1 3 1 1 4 2 1 1 2 3 3 3 2 4 1 3 3 1

How many students in your sample are in each year?

results <- sample(year, size = 100, replace = TRUE)
sum(results == 1)

## [1] 25

sum(results == 2)

## [1] 25

sum(results == 3)

## [1] 25

sum(results == 4)

## [1] 25

What is the average and standard deviation of the number of years students have been studying in your sample?

library(tidyverse)
results_dat <- data.frame(results)
summarise(results_dat, ave = mean(results), sd = sd(results))

##   ave       sd
## 1 2.5 1.123666

Write a function that returns the average number of years for given sample size.

ave_years <- function(n){
  results <- sample(1:4, size = n, replace = TRUE)
  mean(results)
}
ave_years(200)

## [1] 2.575

The replicate() function in R can evaluate an expression a fixed number of times. For example, the following code will:

draw a random sample of 8 with replacement from the numbers 1 to 10,
calculate the average of the sample, and
repeat this process 5 times.

replicate(n = 5,  mean(sample(1:10, size = 8, replace = TRUE)))

## [1] 7.125 4.625 5.500 5.125 4.250

Use the replicate() and the function you wrote in part (e) to simulate sampling 500 students per year for 50 years. Plot the distribution of the average years of study for each year. What is the mean and standard deviation of the this distribution? What is the shape of the distribution? Explain.

aves <- replicate(n = 50, expr = ave_years(200))
dat <- data.frame(aves)
ggplot(dat) + aes(x = aves) + geom_histogram(colour = "black", fill = "grey", bins = 10)

summarise(dat, mean = mean(aves), sd = sd(aves))

##     mean         sd
## 1 2.5002 0.08472019

Question 2

The Galton data set in the mosaic library contains data from Francis Galton in the 1880s.

library(mosaic)
library(tidyverse)
glimpse(Galton)

## Observations: 898
## Variables: 6
## $ family <fct> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5...
## $ father <dbl> 78.5, 78.5, 78.5, 78.5, 75.5, 75.5, 75.5, 75.5, 75.0, 7...
## $ mother <dbl> 67.0, 67.0, 67.0, 67.0, 66.5, 66.5, 66.5, 66.5, 64.0, 6...
## $ sex    <fct> M, F, F, F, M, M, F, F, M, F, M, M, F, F, F, M, M, M, F...
## $ height <dbl> 73.2, 69.2, 69.0, 69.0, 73.5, 72.5, 65.5, 65.5, 71.0, 6...
## $ nkids  <int> 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 6, 6, 6, 6...

Using the Galton data use R to calculate the average and variance of child’s height in families 1, 2, and 32. Which family has the largest variance? Explain the meaning of variance in this context.

summarise(filter(Galton, family == 1), n= n(), mean = mean(height), var = (sd(height))^2)

##   n mean  var
## 1 4 70.1 4.28

summarise(filter(Galton, family == 2), n= n(), mean = mean(height), var = (sd(height))^2)

##   n  mean      var
## 1 4 69.25 18.91667

summarise(filter(Galton, family == 32), n= n(), mean = mean(height), var = (sd(height))^2)

##   n mean    var
## 1 5 69.2 16.575

Family 2 has variance 18.9166667 which is the largest among the three families. This means that the kids’ heights in this family are more spread out compared to the other two families. Looking at the data makes this clear: two kids’ heights are approximately 73in while the other two kids’ heights in this family are approximately 66in.

filter(Galton, family == 1 | family == 2 | family == 32)

##    family father mother sex height nkids
## 1       1   78.5   67.0   M   73.2     4
## 2       1   78.5   67.0   F   69.2     4
## 3       1   78.5   67.0   F   69.0     4
## 4       1   78.5   67.0   F   69.0     4
## 5       2   75.5   66.5   M   73.5     4
## 6       2   75.5   66.5   M   72.5     4
## 7       2   75.5   66.5   F   65.5     4
## 8       2   75.5   66.5   F   65.5     4
## 9      32   72.0   62.0   M   74.0     5
## 10     32   72.0   62.0   M   72.0     5
## 11     32   72.0   62.0   M   69.0     5
## 12     32   72.0   62.0   F   67.5     5
## 13     32   72.0   62.0   F   63.5     5

Do tall Fathers have tall kids? Do tall Mothers have tall kids? Vizualize the relationships between the average height of kids in a family and their Fathers’ and Mothers’ heights. Explain your answer.

One way to calculate the mean height of kids in each family is to use group_by() in combination with summarise() function in the dplyr library. We haven’t covered group_by() in class yet, but an example on how to do this is given below. Note that both group_by() and summarise() return a data frame.

Here is an example. Consider a simple data frame marks of the final marks for two (fictitious) students that each took five courses during their first year at UofT. The example below uses group_by() then summarise() to calculate the average mark for each student.

library(tidyverse)

marks <- data_frame(student = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), 
                    courses = c("STA130", "MAT137", "ECO100", "CSC148", "PHL100",
                                "STA130", "MAT137", "ECO100", "CSC148", "PHL100"),
                    grade = c(82, 83, 77, 84, 79, 83, 74, 85, 77, 72))

marks_grouped <- group_by(marks, student)

ave_grades <- summarise(marks_grouped, ave = mean(grade))
ave_grades

## # A tibble: 2 x 2
##   student   ave
##     <dbl> <dbl>
## 1       1  81  
## 2       2  78.2

# verify calculations
mean(c(82, 83, 77, 84, 79))

## [1] 81

mean(c(83, 74, 85, 77, 72))

## [1] 78.2

library(tidyverse)
dat <- summarise(group_by(Galton, family), 
                 mean_kids = mean(height), 
                 mean_father = mean(father))
ggplot(dat) + aes(x = mean_father, y = mean_kids) + geom_point()

dat <- summarise(group_by(Galton, family), 
                 mean_kids = mean(height), 
                 mean_mother = mean(mother))
ggplot(dat) + aes(x = mean_mother, y = mean_kids) + geom_point()

Tall fathers and mothers have taller kids on average.

Question 3

An article from https://fivethirtyeight.com explored where people check the weather. Use the weather_check data set in the fivethirtyeight library to answer the following question.

Where do people go to check the weather? Do people use internet sources (Phone apps, websites, etc.) versus non-internet sources more often. HINT: use the count() function in the dplyr library.

library(fivethirtyeight)
glimpse(weather_check)

## Observations: 928
## Variables: 9
## $ respondent_id       <dbl> 3887201482, 3887159451, 3887152228, 388714...
## $ ck_weather          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ...
## $ weather_source      <chr> "The default weather app on your phone", "...
## $ weather_source_site <chr> NA, NA, NA, NA, "Iphone app", "AccuWeather...
## $ ck_weather_watch    <ord> Very likely, Very likely, Very likely, Som...
## $ age                 <fct> 30 - 44, 18 - 29, 30 - 44, 30 - 44, 30 - 4...
## $ female              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ hhold_income        <ord> $50,000 to $74,999, Prefer not to answer, ...
## $ region              <chr> "South Atlantic", NA, "Middle Atlantic", N...

count(weather_check, weather_source, sort = TRUE)

## # A tibble: 9 x 2
##   weather_source                                            n
##   <chr>                                                 <int>
## 1 The default weather app on your phone                   213
## 2 Local TV News                                           189
## 3 A specific website or app (please provide the answer)   175
## 4 The Weather Channel                                     139
## 5 Internet search                                         130
## 6 Newspaper                                                32
## 7 Radio weather                                            31
## 8 <NA>                                                     11
## 9 Newsletter                                                8

The number of respondents that check weather on the internet are: 213 + 175 + 130 = 518, and the number that check non-internet sources are 928 - (213 + 175 + 130 + 11) = 399. So, more people in the survey used the internet to check weather.

Do people ages 18-29 watch the Weather Channel at the same rate as people 45-59? Explain.

library(tidyverse)
count(weather_check, age)

## # A tibble: 5 x 2
##   age         n
##   <fct>   <int>
## 1 18 - 29   176
## 2 30 - 44   204
## 3 45 - 59   278
## 4 60+       258
## 5 <NA>       12

count(group_by(weather_check, age), weather_source)

## # A tibble: 34 x 3
## # Groups:   age [5]
##    age     weather_source                                            n
##    <fct>   <chr>                                                 <int>
##  1 18 - 29 A specific website or app (please provide the answer)    28
##  2 18 - 29 Internet search                                          35
##  3 18 - 29 Local TV News                                            14
##  4 18 - 29 Newsletter                                                2
##  5 18 - 29 Newspaper                                                 2
##  6 18 - 29 Radio weather                                             5
##  7 18 - 29 The default weather app on your phone                    64
##  8 18 - 29 The Weather Channel                                      26
##  9 30 - 44 A specific website or app (please provide the answer)    45
## 10 30 - 44 Internet search                                          26
## # ... with 24 more rows

The proportion of people in 18-29 that watch the weather channel is 26/176 = 0.1477273 and the proportion of 45-59 that watch the weather channel is 49/278 = 0.176259. So, the 45-59 watch the Weather Channel more often than the 18-29 group.

Question 4

The R function sample(x, size, replace = TRUE) applied to the vector c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) (this also can be written in R as 0:9) can take a random sample of size size from the numbers 0 - 9. If the sampling is truly random then we would expect that each digit 0 - 9 to have an equal chance of being sampled.

Create a data frame that has a random sample of 100 numbers from 0 - 9. Compute the proportion of times that each number occurs (i.e., the proportion of the 100 numbers that are 0, the proportion of the 100 numbers that are 1, etc.). Below is some code to get you started.

dat <- sample(0:9, size = FILL IN CODE, replace = TRUE)
samp <- data_frame(dat)
ssamp_dat <- count(FILL IN CODE)
mutate(FILL IN CODE)

dat <- sample(0:9, size = 100, replace = TRUE)
samp <- data_frame(dat)
ssamp_dat <- count(samp, dat)
mutate(ssamp_dat, f = round(n/100, 3))

## # A tibble: 10 x 3
##      dat     n     f
##    <int> <int> <dbl>
##  1     0    14  0.14
##  2     1     8  0.08
##  3     2     9  0.09
##  4     3    18  0.18
##  5     4     6  0.06
##  6     5    11  0.11
##  7     6    11  0.11
##  8     7     5  0.05
##  9     8     9  0.09
## 10     9     9  0.09

Do the same as in part (a) except take a random sample of 1000 from 0 - 9.

dat <- sample(0:9, size = 1000, replace = TRUE)
samp <- data_frame(dat)
ssamp_dat <- count(samp, dat)
mutate(ssamp_dat, f = round(n/1000, 3))

## # A tibble: 10 x 3
##      dat     n     f
##    <int> <int> <dbl>
##  1     0   101 0.101
##  2     1    90 0.09 
##  3     2   103 0.103
##  4     3   104 0.104
##  5     4   122 0.122
##  6     5    99 0.099
##  7     6    81 0.081
##  8     7    85 0.085
##  9     8   100 0.1  
## 10     9   115 0.115

Do the same as in part (a) except take a random sample of 10000 from 0 - 9.

dat <- sample(0:9, size = 10000, replace = TRUE)
samp <- data_frame(dat)
ssamp_dat <- count(samp, dat)
mutate(ssamp_dat, f = round(n/10000, 3))

## # A tibble: 10 x 3
##      dat     n     f
##    <int> <int> <dbl>
##  1     0   981 0.098
##  2     1   986 0.099
##  3     2  1003 0.1  
##  4     3  1041 0.104
##  5     4  1016 0.102
##  6     5  1010 0.101
##  7     6  1005 0.1  
##  8     7   996 0.1  
##  9     8   960 0.096
## 10     9  1002 0.1

If the sampling is truly random then what proportion of the sample do expect will contain 0 - 9? Compare your expectation with the distribution of the numbers 0 - 9 that you found in parts (a) - (c). Which part is closest to what you expected?

If the sampling is truly random we would expect the proportion of each number to be 0.10. Taking a larger sample increases the accuracy of the sampling.

STA130H1F – Fall 2018

Week 2 Practice Problems - Sample Soultions

N. Taback and N. Moon