Tutorial grades will be assigned according to the following marking scheme.
Mark | |
---|---|
Attendance for the entire tutorial | 1 |
Assigned homework completiona | 1 |
In-class exercises | 4 |
Total | 6 |
year
that represents year of study at university.year <- 1:4
sample()
function to simulate selecting 100 students and recording their year of study.set.seed(10)
sample(year, size = 100, replace = TRUE)
## [1] 3 2 2 3 1 1 2 2 3 2 3 3 1 3 2 2 1 2 2 4 4 3 4 2 2 3 4 1 4 2 3 1 1 4 2
## [36] 3 4 4 3 3 2 1 1 3 1 1 1 2 1 4 2 4 1 2 1 3 2 2 2 3 1 1 2 2 4 4 3 2 1 1
## [71] 1 3 3 3 1 3 2 4 2 1 4 2 1 3 1 1 4 2 1 1 2 3 3 3 2 4 1 3 3 1
results <- sample(year, size = 100, replace = TRUE)
sum(results == 1)
## [1] 25
sum(results == 2)
## [1] 25
sum(results == 3)
## [1] 25
sum(results == 4)
## [1] 25
library(tidyverse)
results_dat <- data.frame(results)
summarise(results_dat, ave = mean(results), sd = sd(results))
## ave sd
## 1 2.5 1.123666
ave_years <- function(n){
results <- sample(1:4, size = n, replace = TRUE)
mean(results)
}
ave_years(200)
## [1] 2.575
replicate()
function in R can evaluate an expression a fixed number of times. For example, the following code will:replicate(n = 5, mean(sample(1:10, size = 8, replace = TRUE)))
## [1] 7.125 4.625 5.500 5.125 4.250
Use the replicate()
and the function you wrote in part (e) to simulate sampling 500 students per year for 50 years. Plot the distribution of the average years of study for each year. What is the mean and standard deviation of the this distribution? What is the shape of the distribution? Explain.
aves <- replicate(n = 50, expr = ave_years(200))
dat <- data.frame(aves)
ggplot(dat) + aes(x = aves) + geom_histogram(colour = "black", fill = "grey", bins = 10)
summarise(dat, mean = mean(aves), sd = sd(aves))
## mean sd
## 1 2.5002 0.08472019
The Galton
data set in the mosaic
library contains data from Francis Galton in the 1880s.
library(mosaic)
library(tidyverse)
glimpse(Galton)
## Observations: 898
## Variables: 6
## $ family <fct> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5...
## $ father <dbl> 78.5, 78.5, 78.5, 78.5, 75.5, 75.5, 75.5, 75.5, 75.0, 7...
## $ mother <dbl> 67.0, 67.0, 67.0, 67.0, 66.5, 66.5, 66.5, 66.5, 64.0, 6...
## $ sex <fct> M, F, F, F, M, M, F, F, M, F, M, M, F, F, F, M, M, M, F...
## $ height <dbl> 73.2, 69.2, 69.0, 69.0, 73.5, 72.5, 65.5, 65.5, 71.0, 6...
## $ nkids <int> 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 6, 6, 6, 6...
Galton
data use R to calculate the average and variance of child’s height in families 1, 2, and 32. Which family has the largest variance? Explain the meaning of variance in this context.summarise(filter(Galton, family == 1), n= n(), mean = mean(height), var = (sd(height))^2)
## n mean var
## 1 4 70.1 4.28
summarise(filter(Galton, family == 2), n= n(), mean = mean(height), var = (sd(height))^2)
## n mean var
## 1 4 69.25 18.91667
summarise(filter(Galton, family == 32), n= n(), mean = mean(height), var = (sd(height))^2)
## n mean var
## 1 5 69.2 16.575
Family 2 has variance 18.9166667 which is the largest among the three families. This means that the kids’ heights in this family are more spread out compared to the other two families. Looking at the data makes this clear: two kids’ heights are approximately 73in while the other two kids’ heights in this family are approximately 66in.
filter(Galton, family == 1 | family == 2 | family == 32)
## family father mother sex height nkids
## 1 1 78.5 67.0 M 73.2 4
## 2 1 78.5 67.0 F 69.2 4
## 3 1 78.5 67.0 F 69.0 4
## 4 1 78.5 67.0 F 69.0 4
## 5 2 75.5 66.5 M 73.5 4
## 6 2 75.5 66.5 M 72.5 4
## 7 2 75.5 66.5 F 65.5 4
## 8 2 75.5 66.5 F 65.5 4
## 9 32 72.0 62.0 M 74.0 5
## 10 32 72.0 62.0 M 72.0 5
## 11 32 72.0 62.0 M 69.0 5
## 12 32 72.0 62.0 F 67.5 5
## 13 32 72.0 62.0 F 63.5 5
One way to calculate the mean height of kids in each family is to use group_by()
in combination with summarise()
function in the dplyr
library. We haven’t covered group_by()
in class yet, but an example on how to do this is given below. Note that both group_by()
and summarise()
return a data frame.
Here is an example. Consider a simple data frame marks
of the final marks for two (fictitious) students that each took five courses during their first year at UofT. The example below uses group_by()
then summarise()
to calculate the average mark for each student.
library(tidyverse)
marks <- data_frame(student = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
courses = c("STA130", "MAT137", "ECO100", "CSC148", "PHL100",
"STA130", "MAT137", "ECO100", "CSC148", "PHL100"),
grade = c(82, 83, 77, 84, 79, 83, 74, 85, 77, 72))
marks_grouped <- group_by(marks, student)
ave_grades <- summarise(marks_grouped, ave = mean(grade))
ave_grades
## # A tibble: 2 x 2
## student ave
## <dbl> <dbl>
## 1 1 81
## 2 2 78.2
# verify calculations
mean(c(82, 83, 77, 84, 79))
## [1] 81
mean(c(83, 74, 85, 77, 72))
## [1] 78.2
library(tidyverse)
dat <- summarise(group_by(Galton, family),
mean_kids = mean(height),
mean_father = mean(father))
ggplot(dat) + aes(x = mean_father, y = mean_kids) + geom_point()
dat <- summarise(group_by(Galton, family),
mean_kids = mean(height),
mean_mother = mean(mother))
ggplot(dat) + aes(x = mean_mother, y = mean_kids) + geom_point()
Tall fathers and mothers have taller kids on average.
An article from https://fivethirtyeight.com explored where people check the weather. Use the weather_check
data set in the fivethirtyeight
library to answer the following question.
count()
function in the dplyr
library.library(fivethirtyeight)
glimpse(weather_check)
## Observations: 928
## Variables: 9
## $ respondent_id <dbl> 3887201482, 3887159451, 3887152228, 388714...
## $ ck_weather <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ...
## $ weather_source <chr> "The default weather app on your phone", "...
## $ weather_source_site <chr> NA, NA, NA, NA, "Iphone app", "AccuWeather...
## $ ck_weather_watch <ord> Very likely, Very likely, Very likely, Som...
## $ age <fct> 30 - 44, 18 - 29, 30 - 44, 30 - 44, 30 - 4...
## $ female <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ hhold_income <ord> $50,000 to $74,999, Prefer not to answer, ...
## $ region <chr> "South Atlantic", NA, "Middle Atlantic", N...
count(weather_check, weather_source, sort = TRUE)
## # A tibble: 9 x 2
## weather_source n
## <chr> <int>
## 1 The default weather app on your phone 213
## 2 Local TV News 189
## 3 A specific website or app (please provide the answer) 175
## 4 The Weather Channel 139
## 5 Internet search 130
## 6 Newspaper 32
## 7 Radio weather 31
## 8 <NA> 11
## 9 Newsletter 8
The number of respondents that check weather on the internet are: 213 + 175 + 130 = 518, and the number that check non-internet sources are 928 - (213 + 175 + 130 + 11) = 399. So, more people in the survey used the internet to check weather.
library(tidyverse)
count(weather_check, age)
## # A tibble: 5 x 2
## age n
## <fct> <int>
## 1 18 - 29 176
## 2 30 - 44 204
## 3 45 - 59 278
## 4 60+ 258
## 5 <NA> 12
count(group_by(weather_check, age), weather_source)
## # A tibble: 34 x 3
## # Groups: age [5]
## age weather_source n
## <fct> <chr> <int>
## 1 18 - 29 A specific website or app (please provide the answer) 28
## 2 18 - 29 Internet search 35
## 3 18 - 29 Local TV News 14
## 4 18 - 29 Newsletter 2
## 5 18 - 29 Newspaper 2
## 6 18 - 29 Radio weather 5
## 7 18 - 29 The default weather app on your phone 64
## 8 18 - 29 The Weather Channel 26
## 9 30 - 44 A specific website or app (please provide the answer) 45
## 10 30 - 44 Internet search 26
## # ... with 24 more rows
The proportion of people in 18-29 that watch the weather channel is 26/176 = 0.1477273 and the proportion of 45-59 that watch the weather channel is 49/278 = 0.176259. So, the 45-59 watch the Weather Channel more often than the 18-29 group.
The R function sample(x, size, replace = TRUE)
applied to the vector c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
(this also can be written in R as 0:9
) can take a random sample of size size
from the numbers 0 - 9. If the sampling is truly random then we would expect that each digit 0 - 9 to have an equal chance of being sampled.
dat <- sample(0:9, size = FILL IN CODE, replace = TRUE)
samp <- data_frame(dat)
ssamp_dat <- count(FILL IN CODE)
mutate(FILL IN CODE)
dat <- sample(0:9, size = 100, replace = TRUE)
samp <- data_frame(dat)
ssamp_dat <- count(samp, dat)
mutate(ssamp_dat, f = round(n/100, 3))
## # A tibble: 10 x 3
## dat n f
## <int> <int> <dbl>
## 1 0 14 0.14
## 2 1 8 0.08
## 3 2 9 0.09
## 4 3 18 0.18
## 5 4 6 0.06
## 6 5 11 0.11
## 7 6 11 0.11
## 8 7 5 0.05
## 9 8 9 0.09
## 10 9 9 0.09
dat <- sample(0:9, size = 1000, replace = TRUE)
samp <- data_frame(dat)
ssamp_dat <- count(samp, dat)
mutate(ssamp_dat, f = round(n/1000, 3))
## # A tibble: 10 x 3
## dat n f
## <int> <int> <dbl>
## 1 0 101 0.101
## 2 1 90 0.09
## 3 2 103 0.103
## 4 3 104 0.104
## 5 4 122 0.122
## 6 5 99 0.099
## 7 6 81 0.081
## 8 7 85 0.085
## 9 8 100 0.1
## 10 9 115 0.115
dat <- sample(0:9, size = 10000, replace = TRUE)
samp <- data_frame(dat)
ssamp_dat <- count(samp, dat)
mutate(ssamp_dat, f = round(n/10000, 3))
## # A tibble: 10 x 3
## dat n f
## <int> <int> <dbl>
## 1 0 981 0.098
## 2 1 986 0.099
## 3 2 1003 0.1
## 4 3 1041 0.104
## 5 4 1016 0.102
## 6 5 1010 0.101
## 7 6 1005 0.1
## 8 7 996 0.1
## 9 8 960 0.096
## 10 9 1002 0.1
If the sampling is truly random we would expect the proportion of each number to be 0.10. Taking a larger sample increases the accuracy of the sampling.