Tutorial grades will be assigned according to the following marking scheme.
Mark | |
---|---|
Attendance for the entire tutorial | 1 |
Assigned homework completiona | 1 |
In-class exercises | 4 |
Total | 6 |
A researcher is interested in studying the association between birthweight and the mother’s smoking status. The babies
data in the openintro
package has information on 1,236 pregnancies in the San Francisco East Bay area from 1960 to 1967. Use the babies
data set to answer the following question.
library(openintro);
glimpse(babies)
## Observations: 1,236
## Variables: 8
## $ case <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ bwt <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 14...
## $ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351...
## $ parity <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ age <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, ...
## $ height <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, ...
## $ weight <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120...
## $ smoke <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1...
\(H_0: \mu_{nonSmoker} - \mu_{smoker} = 0\)
\(H_A: \mu_{nonSmoker} - \mu_{smoker} \neq 0,\)
where \(\mu_{nonSmoker}\) is the mean birthweight (in ounces) of babies born to non-smoking mothers and \(\mu_{smoker}\) is the mean birthweight (in ounces) of babies born to smoking mothers.
total_n_obs <- nrow(babies)
n_smokers <- babies %>% filter(smoke == "smoker") %>% summarise(n())
n_nonsmokers <- babies %>% filter(smoke == "nonSmoker") %>% summarise(n())
n_missing_bwt <- babies %>% filter(is.na(smoke))
babies_clean <- babies %>% filter(!is.na(smoke) & !is.na(bwt))
n <- babies_clean %>% summarise(n())
In total, there are 1236 observations in the babies dataset. Among these, 0 of the pregnancies are for mothers who are smokers, and 0 are for mothers who are non-smokers. We can see that 1236 observations have a missing value for the ‘smoke’ variable. However, none of the values of birthweight (bwt) are missing. Thus, after removing observations with missing values for ‘smoke’, the sample size for this hypothesis test is 1226.
# calculate the test statistic
mean_bwt_smokers <- babies_clean %>%
filter(smoke==1) %>%
summarize(mean(bwt))
mean_bwt_nonsmokers <- babies_clean %>%
filter(smoke==0) %>%
summarize(mean(bwt))
test_stat <- as.numeric(mean_bwt_nonsmokers - mean_bwt_smokers)
test_stat
## [1] 8.937666
The test statistic is 8.94. In other words, in our sample the mean birthweight of babies of nonsmoking mothers is 8.94 ounces higher than the mean birthweight of babies of smoking mothers.
repetitions <- 1000
simulated_stats <- rep(NA, repetitions)
set.seed(101)
for(i in 1:repetitions){
sim <- babies_clean %>% mutate(smokingStatus = sample(smoke))
sim_test_stat <- sim %>%
group_by(smokingStatus) %>%
summarise(means = mean(bwt)) %>%
summarise(sim_test_stat = diff(means))
simulated_stats[i] <- as.numeric(sim_test_stat)
}
sim <- data_frame(mean_diff = simulated_stats)
sim %>%
filter(mean_diff >= abs(test_stat) | mean_diff <= -abs(test_stat)) %>%
summarise(p_value = n() / repetitions)
We simulated 1,000 values of what the difference in mean birthweight would be for babies of smoking vs non-smoking mothers if we assumed there was no difference (that is, the birthweight values observed were as likely to come from a smoking or non-smoking mother).
There were no simulated values with a difference as large or larger than the difference in mean birthweight that we observed, so our estimate of the p-value is 0. We conclude that we have very strong evidence that there is a difference in the mean birthweight of babies from smoking and non-smoking mothers.
Suppose you are interested in studying altruistic behaviours in men and women. A sample of 200 individuals were asked about what they would do if they received a $100 bill by mail, addressed to their neighbor, but wrongly delivered to them. Would they return it to their neighbor? Of the 69 males sampled, 56 said yes and of the 131 females sampled, 120 said yes.
Hints:
Before you start, think of which variables you want in your dataset and what your observations will be.
The rep()
function is useful, to avoid having to manually type long vectors. For example:
# vector with 50 "yes" values followed by 150 "no" values
myvec <-
c(rep("yes", times = 50), rep("no", 150))
money <-
data_frame(sex = c(rep("male", times = 69), rep("female", times = 131)),
returnMoney = c(
rep("yes", times = 56),
rep("no", times = 69 - 56),
rep("yes", times = 120),
rep("no", times = 131 - 120)
))
money %>%
group_by(sex, returnMoney) %>%
count()
\(H_0: p_{female} - p_{male} = 0\)
\(H_A: p_{female} - p_{male} \neq 0\)
where \(p_{female}\) is proportion of females who return the money and \(p_{male}\) is the proportion of males who return the money.
# calculate the test statistic
n_female <- money %>% filter(sex == "female") %>% summarize(n())
n_male <- money %>% filter(sex == "male") %>% summarize(n())
n_female_return <-
money %>% filter(sex == "female" & returnMoney == "yes") %>% count()
n_male_return <-
money %>% filter(sex == "male" & returnMoney == "yes") %>% count()
test_stat <- n_female_return / n_female - n_male_return / n_male
test_stat
repetitions <- 1000
simulated_stats <- rep(NA, repetitions)
set.seed(1)
for (i in 1:repetitions) {
sim <- money %>% mutate(sex = sample(sex)) # shuffle sex labels
yes_male <-
sim %>% filter(sex == "male" & returnMoney == "yes") %>% summarize(n())
as.numeric(yes_male)
yes_female <-
sim %>% filter(sex == "female" & returnMoney == "yes") %>% summarize(n())
as.numeric(yes_female)
sim_test_stat <-
yes_female / as.numeric(n_female) - yes_male / as.numeric(n_male)
simulated_stats[i] <- as.numeric(sim_test_stat)
}
sim <- data_frame(p_diff = simulated_stats)
sim %>% filter(p_diff >= abs(as.numeric(test_stat)) |
p_diff <= -abs(as.numeric(test_stat))) %>%
summarise(p_value = n() / repetitions)
Assuming there is no difference between men and women as to the rate of returning the money, the probability of seeing a difference in the proportions of women vs men who choose to return the money which is at least as extreme as what we observed is 0.033. Thus, we have evidence that the proportion of people who choose to return the money differs between women and men.
child_care
the number of hours parents spend taking care of children in their household under age 6 (feeding, bathing, dressing, holding, or watching them). In our data for 2006, 664 females and 358 males were surveyed for this question. We will examine whether there is a difference in hours spent on child care between females and males.Simulation was used to test if the median of the number of hours spent on child care for Chinese women is the same as the median of the number of hours spent on child care for Chinese men. A plot of the simulation is below, along with some other statistics.
library(openintro)
glimpse(china)
## Observations: 9,788
## Variables: 3
## $ gender <int> 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, ...
## $ edu <int> 1, 5, 2, 2, 3, NA, 2, 2, 2, NA, NA, 2, 2, 1, 5, 5, ...
## $ child_care <int> -99, -99, -99, -99, -99, -99, -99, -99, -99, -99, -...
new_china <- china %>% filter(child_care != -99)
glimpse(new_china)
## Observations: 1,022
## Variables: 3
## $ gender <int> 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, ...
## $ edu <int> 1, 2, 1, 1, NA, 2, 4, 3, 2, NA, NA, NA, 3, NA, 2, 1...
## $ child_care <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
repetitions <- 1000
simulated_stats <- rep(NA, repetitions)
new_china <- new_china %>% mutate(sex = recode(gender, `1`="male", `2`="female"))
new_china %>% group_by(sex) %>% summarize(medians = median(child_care))
test_stat <-
as.numeric(new_china %>% group_by(sex) %>% summarize(medians=median(child_care)) %>%
summarise(test_stat = diff(medians)))
test_stat
## [1] -11
for (i in 1:repetitions)
{
sim <-
new_china %>% mutate(sex = sample(sex)) # shuffle sleep group labels
sim_test_stat <-
sim %>% group_by(sex) %>% summarise(medians = median(child_care)) %>% summarise(sim_test_stat = diff(medians))
simulated_stats[i] <- as.numeric(sim_test_stat)
}
sim <- data_frame(median_diff = simulated_stats)
sim %>%
filter(median_diff >= abs(test_stat) | median_diff <= -1*abs(test_stat)) %>%
summarise(n() / repetitions)
sim %>% ggplot(aes(median_diff)) +
geom_histogram(colour = "black",
fill = "grey",
binwidth = 0.5)
\[H_0: \tilde \mu_{M}= \tilde \mu_{F} \mbox{ vs. } H_A: \tilde \mu_{M} \ne \tilde \mu_{F}.\]
\(\tilde \mu_{M}, \tilde \mu_{F}\) are the median number of child care hours for males and females respectively.
The test statistic used is the difference between the median number of child care hours for males minus the median number of child care hours for females. The value of the test statistic for this data set is -11. The median number of hours Females spent on child care compared to males is 11 hours.
The observed median difference is -11. The largest median difference is 6 and the smallest median difference is 6. The P-value is the proportion of simulations with a test statistic greater than or equal to 11 or less than or equal to -11. This proportion is 0/1000 = 0.
Yes. We simulated 1000 values of what the difference between females and males in the medians of number of hours spent on child care would be if we assumed that there was no difference, that is the values observed were just as likely to come from a female survey participant or a male survey participant. There were no simulated values with a difference as large or larger than the difference in medians that we observed in our data. Our estimate of the P-value is thus 0 (<1/1000 ). We conclude that we have very strong evidence that there is a difference in the median number of hours spent on child care between females and males.
library(openintro)
glimpse(china)
## Observations: 9,788
## Variables: 3
## $ gender <int> 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, ...
## $ edu <int> 1, 5, 2, 2, 3, NA, 2, 2, 2, NA, NA, 2, 2, 1, 5, 5, ...
## $ child_care <int> -99, -99, -99, -99, -99, -99, -99, -99, -99, -99, -...
new_china <- china %>% filter(child_care != -99)
glimpse(new_china)
## Observations: 1,022
## Variables: 3
## $ gender <int> 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, ...
## $ edu <int> 1, 2, 1, 1, NA, 2, 4, 3, 2, NA, NA, NA, 3, NA, 2, 1...
## $ child_care <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
repetitions <- 1000
simulated_stats <- rep(NA, repetitions)
new_china <- new_china %>% mutate(sex = recode(gender, `1`="male", `2`="female"))
new_china %>% group_by(sex) %>% summarize(sds = sd(child_care))
test_stat <-
as.numeric(new_china %>% group_by(sex) %>% summarize(sds=sd(child_care)) %>%
summarise(test_stat = diff(sds)))
test_stat
## [1] -10.17483
for (i in 1:repetitions)
{
sim <-
new_china %>% mutate(sex = sample(sex)) # shuffle sleep group labels
sim_test_stat <-
sim %>% group_by(sex) %>% summarise(sds = sd(child_care)) %>% summarise(sim_test_stat = diff(sds))
simulated_stats[i] <- as.numeric(sim_test_stat)
}
sim <- data_frame(sd_diff = simulated_stats)
sim %>% ggplot(aes(sd_diff)) +
geom_histogram(colour = "black",
fill = "grey",
binwidth = 0.5)
sim %>%
filter(sd_diff >= abs(test_stat) | sd_diff <= -1*abs(test_stat)) %>%
summarise(p_value = n() / repetitions)
In this case, \(H_0:\sigma_M =\sigma_F\) and \(H_A:\sigma_M \ne \sigma_F\), where \(\sigma_M, \sigma_F\) are the standard deviations of the numbers of hours spent on child care for males and females respectively.
The standard deviation in the female group is 30.1 and 19.93 in the male group. There is more variability in the number of hours females spend on child care compared to males. None of the simulated values of the test statistic were as extreme as the observed value of the test statistic -10.17483, so the P-value of the test is 0. Therefore, there is strong evidence that the standard deviation of child care in females and males are different.