Tutorial grades will be assigned according to the following marking scheme.
| Mark | |
|---|---|
| Attendance for the entire tutorial | 1 | 
| Assigned homework completiona | 1 | 
| In-class exercises | 4 | 
| Total | 6 | 
year that represents year of study at university.year <- 1:4sample() function to simulate selecting 100 students and recording their year of study.set.seed(10)
sample(year, size = 100, replace = TRUE)##   [1] 3 2 2 3 1 1 2 2 3 2 3 3 1 3 2 2 1 2 2 4 4 3 4 2 2 3 4 1 4 2 3 1 1 4 2
##  [36] 3 4 4 3 3 2 1 1 3 1 1 1 2 1 4 2 4 1 2 1 3 2 2 2 3 1 1 2 2 4 4 3 2 1 1
##  [71] 1 3 3 3 1 3 2 4 2 1 4 2 1 3 1 1 4 2 1 1 2 3 3 3 2 4 1 3 3 1results <- sample(year, size = 100, replace = TRUE)
sum(results == 1)## [1] 25sum(results == 2)## [1] 25sum(results == 3)## [1] 25sum(results == 4)## [1] 25library(tidyverse)
results_dat <- data.frame(results)
summarise(results_dat, ave = mean(results), sd = sd(results))##   ave       sd
## 1 2.5 1.123666ave_years <- function(n){
  results <- sample(1:4, size = n, replace = TRUE)
  mean(results)
}
ave_years(200)## [1] 2.575replicate() function in R can evaluate an expression a fixed number of times. For example, the following code will:replicate(n = 5,  mean(sample(1:10, size = 8, replace = TRUE)))## [1] 7.125 4.625 5.500 5.125 4.250Use the replicate() and the function you wrote in part (e) to simulate sampling 500 students per year for 50 years. Plot the distribution of the average years of study for each year. What is the mean and standard deviation of the this distribution? What is the shape of the distribution? Explain.
aves <- replicate(n = 50, expr = ave_years(200))
dat <- data.frame(aves)
ggplot(dat) + aes(x = aves) + geom_histogram(colour = "black", fill = "grey", bins = 10)summarise(dat, mean = mean(aves), sd = sd(aves))##     mean         sd
## 1 2.5002 0.08472019The Galton data set in the mosaic library contains data from Francis Galton in the 1880s.
library(mosaic)
library(tidyverse)
glimpse(Galton)## Observations: 898
## Variables: 6
## $ family <fct> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5...
## $ father <dbl> 78.5, 78.5, 78.5, 78.5, 75.5, 75.5, 75.5, 75.5, 75.0, 7...
## $ mother <dbl> 67.0, 67.0, 67.0, 67.0, 66.5, 66.5, 66.5, 66.5, 64.0, 6...
## $ sex    <fct> M, F, F, F, M, M, F, F, M, F, M, M, F, F, F, M, M, M, F...
## $ height <dbl> 73.2, 69.2, 69.0, 69.0, 73.5, 72.5, 65.5, 65.5, 71.0, 6...
## $ nkids  <int> 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 6, 6, 6, 6...Galton data use R to calculate the average and variance of child’s height in families 1, 2, and 32. Which family has the largest variance? Explain the meaning of variance in this context.summarise(filter(Galton, family == 1), n= n(), mean = mean(height), var = (sd(height))^2)##   n mean  var
## 1 4 70.1 4.28summarise(filter(Galton, family == 2), n= n(), mean = mean(height), var = (sd(height))^2)##   n  mean      var
## 1 4 69.25 18.91667summarise(filter(Galton, family == 32), n= n(), mean = mean(height), var = (sd(height))^2)##   n mean    var
## 1 5 69.2 16.575Family 2 has variance 18.9166667 which is the largest among the three families. This means that the kids’ heights in this family are more spread out compared to the other two families. Looking at the data makes this clear: two kids’ heights are approximately 73in while the other two kids’ heights in this family are approximately 66in.
filter(Galton, family == 1 | family == 2 | family == 32)##    family father mother sex height nkids
## 1       1   78.5   67.0   M   73.2     4
## 2       1   78.5   67.0   F   69.2     4
## 3       1   78.5   67.0   F   69.0     4
## 4       1   78.5   67.0   F   69.0     4
## 5       2   75.5   66.5   M   73.5     4
## 6       2   75.5   66.5   M   72.5     4
## 7       2   75.5   66.5   F   65.5     4
## 8       2   75.5   66.5   F   65.5     4
## 9      32   72.0   62.0   M   74.0     5
## 10     32   72.0   62.0   M   72.0     5
## 11     32   72.0   62.0   M   69.0     5
## 12     32   72.0   62.0   F   67.5     5
## 13     32   72.0   62.0   F   63.5     5One way to calculate the mean height of kids in each family is to use group_by() in combination with summarise() function in the dplyr library. We haven’t covered group_by() in class yet, but an example on how to do this is given below. Note that both group_by() and summarise() return a data frame.
Here is an example. Consider a simple data frame marks of the final marks for two (fictitious) students that each took five courses during their first year at UofT. The example below uses group_by() then summarise() to calculate the average mark for each student.
library(tidyverse)
marks <- data_frame(student = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), 
                    courses = c("STA130", "MAT137", "ECO100", "CSC148", "PHL100",
                                "STA130", "MAT137", "ECO100", "CSC148", "PHL100"),
                    grade = c(82, 83, 77, 84, 79, 83, 74, 85, 77, 72))
marks_grouped <- group_by(marks, student)
ave_grades <- summarise(marks_grouped, ave = mean(grade))
ave_grades## # A tibble: 2 x 2
##   student   ave
##     <dbl> <dbl>
## 1       1  81  
## 2       2  78.2# verify calculations
mean(c(82, 83, 77, 84, 79)) ## [1] 81mean(c(83, 74, 85, 77, 72)) ## [1] 78.2library(tidyverse)
dat <- summarise(group_by(Galton, family), 
                 mean_kids = mean(height), 
                 mean_father = mean(father))
ggplot(dat) + aes(x = mean_father, y = mean_kids) + geom_point()dat <- summarise(group_by(Galton, family), 
                 mean_kids = mean(height), 
                 mean_mother = mean(mother))
ggplot(dat) + aes(x = mean_mother, y = mean_kids) + geom_point()Tall fathers and mothers have taller kids on average.
An article from https://fivethirtyeight.com explored where people check the weather. Use the weather_check data set in the fivethirtyeight library to answer the following question.
count() function in the dplyr library.library(fivethirtyeight)
glimpse(weather_check)## Observations: 928
## Variables: 9
## $ respondent_id       <dbl> 3887201482, 3887159451, 3887152228, 388714...
## $ ck_weather          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ...
## $ weather_source      <chr> "The default weather app on your phone", "...
## $ weather_source_site <chr> NA, NA, NA, NA, "Iphone app", "AccuWeather...
## $ ck_weather_watch    <ord> Very likely, Very likely, Very likely, Som...
## $ age                 <fct> 30 - 44, 18 - 29, 30 - 44, 30 - 44, 30 - 4...
## $ female              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ hhold_income        <ord> $50,000 to $74,999, Prefer not to answer, ...
## $ region              <chr> "South Atlantic", NA, "Middle Atlantic", N...count(weather_check, weather_source, sort = TRUE)## # A tibble: 9 x 2
##   weather_source                                            n
##   <chr>                                                 <int>
## 1 The default weather app on your phone                   213
## 2 Local TV News                                           189
## 3 A specific website or app (please provide the answer)   175
## 4 The Weather Channel                                     139
## 5 Internet search                                         130
## 6 Newspaper                                                32
## 7 Radio weather                                            31
## 8 <NA>                                                     11
## 9 Newsletter                                                8The number of respondents that check weather on the internet are: 213 + 175 + 130 = 518, and the number that check non-internet sources are 928 - (213 + 175 + 130 + 11) = 399. So, more people in the survey used the internet to check weather.
library(tidyverse)
count(weather_check, age)## # A tibble: 5 x 2
##   age         n
##   <fct>   <int>
## 1 18 - 29   176
## 2 30 - 44   204
## 3 45 - 59   278
## 4 60+       258
## 5 <NA>       12count(group_by(weather_check, age), weather_source)## # A tibble: 34 x 3
## # Groups:   age [5]
##    age     weather_source                                            n
##    <fct>   <chr>                                                 <int>
##  1 18 - 29 A specific website or app (please provide the answer)    28
##  2 18 - 29 Internet search                                          35
##  3 18 - 29 Local TV News                                            14
##  4 18 - 29 Newsletter                                                2
##  5 18 - 29 Newspaper                                                 2
##  6 18 - 29 Radio weather                                             5
##  7 18 - 29 The default weather app on your phone                    64
##  8 18 - 29 The Weather Channel                                      26
##  9 30 - 44 A specific website or app (please provide the answer)    45
## 10 30 - 44 Internet search                                          26
## # ... with 24 more rowsThe proportion of people in 18-29 that watch the weather channel is 26/176 = 0.1477273 and the proportion of 45-59 that watch the weather channel is 49/278 = 0.176259. So, the 45-59 watch the Weather Channel more often than the 18-29 group.
The R function sample(x, size, replace = TRUE) applied to the vector c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) (this also can be written in R as 0:9) can take a random sample of size size from the numbers 0 - 9. If the sampling is truly random then we would expect that each digit 0 - 9 to have an equal chance of being sampled.
dat <- sample(0:9, size = FILL IN CODE, replace = TRUE)
samp <- data_frame(dat)
ssamp_dat <- count(FILL IN CODE)
mutate(FILL IN CODE)dat <- sample(0:9, size = 100, replace = TRUE)
samp <- data_frame(dat)
ssamp_dat <- count(samp, dat)
mutate(ssamp_dat, f = round(n/100, 3))## # A tibble: 10 x 3
##      dat     n     f
##    <int> <int> <dbl>
##  1     0    14  0.14
##  2     1     8  0.08
##  3     2     9  0.09
##  4     3    18  0.18
##  5     4     6  0.06
##  6     5    11  0.11
##  7     6    11  0.11
##  8     7     5  0.05
##  9     8     9  0.09
## 10     9     9  0.09dat <- sample(0:9, size = 1000, replace = TRUE)
samp <- data_frame(dat)
ssamp_dat <- count(samp, dat)
mutate(ssamp_dat, f = round(n/1000, 3))## # A tibble: 10 x 3
##      dat     n     f
##    <int> <int> <dbl>
##  1     0   101 0.101
##  2     1    90 0.09 
##  3     2   103 0.103
##  4     3   104 0.104
##  5     4   122 0.122
##  6     5    99 0.099
##  7     6    81 0.081
##  8     7    85 0.085
##  9     8   100 0.1  
## 10     9   115 0.115dat <- sample(0:9, size = 10000, replace = TRUE)
samp <- data_frame(dat)
ssamp_dat <- count(samp, dat)
mutate(ssamp_dat, f = round(n/10000, 3))## # A tibble: 10 x 3
##      dat     n     f
##    <int> <int> <dbl>
##  1     0   981 0.098
##  2     1   986 0.099
##  3     2  1003 0.1  
##  4     3  1041 0.104
##  5     4  1016 0.102
##  6     5  1010 0.101
##  7     6  1005 0.1  
##  8     7   996 0.1  
##  9     8   960 0.096
## 10     9  1002 0.1If the sampling is truly random we would expect the proportion of each number to be 0.10. Taking a larger sample increases the accuracy of the sampling.