Instructions

What should I bring to tutorial on October 19?

  • R output (e.g., plots and explanations) for Questions 2, 3. You can either bring a hardcopy or bring your laptop with the output.

Tutorial Grading

Tutorial grades will be assigned according to the following marking scheme.

Mark
Attendance for the entire tutorial 1
Assigned homework completion 1
In-class exercises 4
Total 6

Practice Problems

Question 1

A researcher is interested in studying the association between birthweight and the mother’s smoking status. The babies data in the openintro package has information on 1,236 pregnancies in the San Francisco East Bay area from 1960 to 1967. Use the babies data set to answer the following question.

library(openintro)
glimpse(babies)
## Observations: 1,236
## Variables: 8
## $ case      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ bwt       <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 14...
## $ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351...
## $ parity    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ age       <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, ...
## $ height    <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, ...
## $ weight    <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120...
## $ smoke     <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1...
  1. Construct null and alternative hypotheses that correspond to a hypthesis test of birthweight is in mother’s that smoked compared to non-smoking mothers. Explicitly define the statistical parameters being tested.

  2. How many observations does the data set contain? How many mothers are smokers? How many mothers are non-smokers? Are there any missing values for the birthweight variable? What will your sample size be for doing the hypothesis test in part (a)?

  3. What is the test statistic for the hypothesis test from part (a)?

  4. Use simulation (with 1,000 repetitions) to calculate the P-value of the test in part (a). Write 2-4 sentences summarizing your conclusions.

Question 2

Suppose you are interested in studying altruistic behaviours in men and women. A sample of 200 individuals were asked about what they would do if they received a $100 bill by mail, addressed to their neighbor, but wrongly delivered to them. Would they return it to their neighbor? Of the 69 males sampled, 56 said yes and of the 131 females sampled, 120 said yes.

  1. Construct a tidy dataframe for the data described above. Verify that the data frame you created in part correct, by calculating the following numbers and compare them to the numbers which appear in the paragraph above:
  • Number of males who return the money
  • Number of males who don’t return the money
  • Number of females who return the money
  • Number of females who don’t return the money

Hints:

  • Before you start, think of which variables you want in your dataset and what your observations will be.

  • The rep() function is useful, to avoid having to manually type long vectors. For example:

# vector with 50 "yes" values followed by 150 "no" values
myvec <-
  c(rep("yes", times = 50), rep("no", 150))
  1. Write the null and alternative hypotheses to test if men and women return the money at different rates. Explicitly define the statistical parameters being tested.

  2. Use simulation (with 1,000 repetitions) to calculate the P-value of the test in part (a). Write 2-4 sentences summarizing your conclusions.

Question 3

(Adapted from ISRS 4.22) The China Health and Nutrition Survey aims to examine the effects of the health, nutrition, and family planning policies and programs implemented by national and local governments. One of the variables collected on the survey is child_care the number of hours parents spend taking care of children in their household under age 6 (feeding, bathing, dressing, holding, or watching them). In our data for 2006, 664 females and 358 males were surveyed for this question. We will examine whether there is a difference in hours spent on child care between females and males.

Simulation was used to test if the median of the number of hours spent on child care for Chinese women is the same as the median of the number of hours spent on child care for Chinese men. A plot of the simulation is below, along with some other statistics.

library(openintro)
glimpse(china)
## Observations: 9,788
## Variables: 3
## $ gender     <int> 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, ...
## $ edu        <int> 1, 5, 2, 2, 3, NA, 2, 2, 2, NA, NA, 2, 2, 1, 5, 5, ...
## $ child_care <int> -99, -99, -99, -99, -99, -99, -99, -99, -99, -99, -...
new_china <- china %>% filter(child_care != -99)
glimpse(new_china)
## Observations: 1,022
## Variables: 3
## $ gender     <int> 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, ...
## $ edu        <int> 1, 2, 1, 1, NA, 2, 4, 3, 2, NA, NA, NA, 3, NA, 2, 1...
## $ child_care <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
repetitions <- 1000  
simulated_stats <- rep(NA, repetitions) 

new_china <- new_china %>% mutate(sex = recode(gender, `1`="male", `2`="female"))

new_china %>% group_by(sex) %>% summarize(medians = median(child_care))
## # A tibble: 2 x 2
##   sex    medians
##   <chr>    <dbl>
## 1 female      20
## 2 male         9
test_stat <- 
  as.numeric(new_china %>% group_by(sex) %>% summarize(medians=median(child_care)) %>% 
                summarise(test_stat = diff(medians)))
test_stat
## [1] -11
for (i in 1:repetitions)
{
sim <-
new_china %>% mutate(sex = sample(sex))  # shuffle sleep group labels
sim_test_stat <-
sim %>% group_by(sex) %>% summarise(medians = median(child_care)) %>% summarise(sim_test_stat = diff(medians))
simulated_stats[i] <- as.numeric(sim_test_stat)
}

sim <- data_frame(median_diff = simulated_stats)

sim %>% ggplot(aes(median_diff)) +
  geom_histogram(colour = "black",
  fill = "grey",
  binwidth = 0.5)

  1. What are the null and alternative hypotheses being tested?

  2. What is the test statistic used in this hypothesis test and what is it’s value for this data set? Interpret the test statistic.

  3. Estimate the P-value from the histogram.

  4. Is there evidence that females spend a different amount of time on child care compared to males?

  5. Change the code given at the beginning of the question to conduct a hypothesis test to investigate if the standard deviation of hours spent on child care is different for females and males. What do you conclude?