Instructions

What should I bring to tutorial on February 9?

Bring your answer to question 2.

First steps to answering these questions.

  • Download this R Notebook directly into RStudio by typing the following code into the RStudio console window.
file_url <- "https://raw.githubusercontent.com/ntaback/UofT_STA130/master/week5/Week5PracticeProblems-student.Rmd"
download.file(url = file_url , destfile = "Week5PracticeProblems-student.Rmd")

Look for the file “Week5PracticeProblems-student.Rmd” under the Files tab then click on it to open.

  • Change the subtitle to “Week 5 Practice Problems Solutions” and change the author to your name and student number.

  • Type your answers below each question. Remember that R code chunks can be inserted directly into the notebook by choosing Insert R from the Insert menu (see Using R Markdown for Class Assignments). In addition this R Markdown cheatsheet, and reference are great resources as you get started with R Markdown.

The questions in this set of practice problems are adapted from Introductory Statistics with Randomization and Simulation from OpenIntro (ISRS).

Practice Problems

Question 1

In lecture this week, we looked at examples of hypothesis tests for comparing two groups in two situations: (1) comparing proportions between two groups and (2) comparing means between two groups. For the following scenarios, identify whether (1) proportions or (2) means are appropriate for comparing the variable of interest between the groups.

  1. A health survey collected body mass index (BMI, measured in kg/m\(^2\)). Researchers were interested in whether BMI differed between people who consume alcohol and people who do not consume alcohol.

  2. A health survey collected body mass index (BMI, measured in kg/m\(^2\)). Researchers were interested in whether the prevalence of obesity (BMI of 30 kg/m\(^2\) or over) is the same for people who consume alcohol and people who do not consume alcohol.

  3. A study was carried out to examine whether the sex of a baby is related to whether the baby’s mother smoked while she was pregnant.

  4. Using results from a survey of recent college graduates, we would like to compare employment rates between graduates of statistics programs of study and graduates from economics programs of study.

  5. Using results from a survey of recent college graduates, we would like to compare salary between graduates of statistics programs of study and graduates from economics programs of study.

Question 2

Bring your output for this question to tutorial on Friday February 9 (either a hardcopy or on your laptop).

(Adapted from ISRS 3.36) A 2012 Gallup poll surveyed Americans about their employment status and whether or not they have diabetes. The survey results indicate that 1.5% of the 47,774 employed (full or part time) and 2.5% of the 5,855 unemployed 18-29 year olds have diabetes.

To create a tidy data frame for this situation, note that 1.5% of 47,774 is 717 people and 2.5% of 5,855 is 146 people. The R command rep creates a vector which replicates its first argument the number of times indicated by its second argument. For example, the following command creates a vector with 5 elements, each of which is “hello”. (Try it.)

rep("hello",5)

Here’s how to create the data for this question. Create and view this data frame to verify that it is what you expect.

library(tidyverse)

databig <- data_frame(group=c(rep("employed",47774),rep("unemployed",5855)),
                        diabetes=c(rep("yes",717),rep("no",47774-717),rep("yes",146),rep("no",5855-146)))
  1. Carry out an hypothesis test to test the hypothesis that there is no difference in the proportion of people with diabetes between people who are employed and people who are unemployed.

  2. Suppose that Gallup had only observed about one-tenth the number of people, but still with 1.5% of employed people having diabetes and 2.5% of unemployed people having diabetes. We can create a data frame for this scenario with this code:

datasmall <- data_frame(group=c(rep("employed",4777),rep("unemployed",586)),
                        diabetes=c(rep("yes",72),rep("no",4777-72),rep("yes",15),rep("no",586-15)))

Repeat part a. for this smaller data set.

  1. In both parts a. and b., the difference in the proportions between the two groups was only 1%, but the number of people in the data differed. How did this affect the results of your hypothesis tests?

Question 3

(Adapted from ISRS 4.22) The China Health and Nutrition Survey aims to examine the effects of the health, nutrition, and family planning policies and programs implemented by national and local governments. One of the variables collected on the survey is the number of hours parents spend taking care of children in their household under age 6 (feeding, bathing, dressing, holding, or watching them). In our data for 2006, 664 females and 358 males were surveyed for this question. We will examine whether there is a difference in hours spent on child care between females and males.

  1. The data are available in the openintro package in the china dataframe. Load the data and look at it using
library(openintro)
glimpse(china)
    1. How many observations are in the data? Is this what you expected?
    2. What are the values of the child_care variable for the observations shown (which are the first observations in the data frame)? Is this what you expected?
  1. In this part, we’ll clean up the data.
    1. Many more people were surveyed than the 664 females and 358 males who were asked the child care question. For the people who were surveyed but not asked this question, an impossible response was entered in the data for the number of hours spent on child care. Create a new data frame new_china removing the people with this value from the data. (What dplyr function should you use? For indicating which values to remove, the operator != means not equal.)
new_china <- china %>% DPLYR_FUNCTION( CODE THAT INDICATES WHICH VALUES TO REMOVE )
    1. The gender variable is coded with an integer, which is harder to remember than a coding that makes it clear which value corresponds to males and which corresponds to females. Use the following code to create a new variable called sex in the data frame new_china. Look at the data to verify that it has accomplished what you expect.
mutate(sex = recode(gender, `1`="male", `2`="female"))
  1. Now investigate the child_care variable for males and females.
  1. Use group_by and summarize functions to find the average and median values of child_care for females and males. Within summarize, the average can by found using the function mean(VARIABLE_NAME) and the median can be found using the function median(VARIABLE_NAME). Recall that the median of child_care is the value such that 50% of people reported spending more hours taking care of children than this value, and 50% of people reported spending fewer hours taking care of children than this value.
    1. Construct boxplots and histograms of child_care for each sex. How do the distributions of child_care compare for males and females? Look at the plots so see where the values of the means and medians are located. (For code to emulate for the boxplots, see the code on page 27 of the ggplot slides from Week 1. For the histograms, try combining the code to create a histogram (page 14 of the slides) with a facet_wrap (page 41 of the slides).)
  1. For these data, why might you want to look at medians rather than means?

  2. Carry out an hypothesis test, using the sleep deprivation example from lecture as a sample code, to test the null hypothesis that the median of the number of hours spent on child care for Chinese women is the same as the median of the number of hours spent on child care for Chinese men. (Note that the sleep deprivation example compared the means between two groups, but here we want to compare the medians.)

R Markdown source