In lecture this week, we looked at examples of hypothesis tests for comparing two groups in two situations: (1) comparing proportions between two groups and (2) comparing means between two groups. For the following scenarios, identify whether (1) proportions or (2) means are appropriate for comparing the variable of interest between the groups.
Use means since BMI is a numerical variable. We can test whether the mean of BMI is the same in people who consume alcohol and people who do not consume alcohol.
Use proportions, to compare the proportion of people who are obese between the people who consume alcohol and the people who do not consume alcohol.
Use proportions. Sex is a categorical variable and we can compare the proportions of boys (or girls) in mothers who smoked and mothers who didn’t.
We can measure employment rate as the proportion of recent college graduate who are employed, so we should use proportions.
Use means since salary is a numerical variable. We can compare the mean salary between graduates from statistics and economics programs of study.
(Adapted from ISRS 3.36) A 2012 Gallup poll surveyed Americans about their employment status and whether or not they have diabetes. The survey results indicate that 1.5% of the 47,774 employed (full or part time) and 2.5% of the 5,855 unemployed 18-29 year olds have diabetes.
To create a tidy data frame for this situation, note that 1.5% of 47,774 is 717 people and 2.5% of 5,855 is 146 people. The R
command rep
creates a vector which replicates its first argument the number of times indicated by its second argument. For example, the following command creates a vector with 5 elements, each of which is “hello”. (Try it.)
rep("hello",5)
Here’s how to create the data for this question. Create and view this data frame to verify that it is what you expect.
library(tidyverse)
databig <- data_frame(group=c(rep("employed",47774),rep("unemployed",5855)),
diabetes=c(rep("yes",717),rep("no",47774-717),rep("yes",146),rep("no",5855-146)))
repetitions <- 1000
simulated_stats <- rep(NA, repetitions)
n_employed <- databig %>% filter(group=="employed") %>% summarise(n())
n_unemployed <- databig %>% filter(group=="unemployed") %>% summarise(n())
# calculate the test statistic
diab_employed <- databig %>%
filter(diabetes=="yes" & group=="employed") %>%
summarize(n())
diab_unemployed <- databig %>%
filter(diabetes=="yes" & group=="unemployed") %>%
summarize(n())
test_stat <- as.numeric(diab_employed/n_employed - diab_unemployed/n_unemployed)
test_stat
## [1] -0.009927789
for (i in 1:repetitions)
{
sim <- databig %>% mutate(group = sample(group))
diab_employed <- sim %>%
filter(diabetes=="yes" & group=="employed") %>%
summarize(n())
diab_unemployed <- sim %>%
filter(diabetes=="yes" & group=="unemployed") %>%
summarize(n())
p_diff <- diab_employed/n_employed - diab_unemployed/n_unemployed
simulated_stats[i] <- as.numeric(p_diff)
}
sim <- data_frame(p_diff=simulated_stats)
sim %>%
filter(p_diff >= abs(test_stat) | p_diff <= -1*abs(test_stat)) %>%
summarise(p_value = n() / repetitions)
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
Assuming that there is no difference in proportion of people with diabetes between people 18-29 years old who are employed or not employed, the probability of seeing a difference in the proportion of people with diabetes as large or larger that was observed in the survey is estimated to be 0 (it is \(<0.001\)). We have very strong evidence that the proportion of 18-29 year old people with diabetes differs between those who are employed and those who are unemployed.
datasmall <- data_frame(group=c(rep("employed",4777),rep("unemployed",586)),
diabetes=c(rep("yes",72),rep("no",4777-72),rep("yes",15),rep("no",586-15)))
repetitions <- 1000
simulated_stats <- rep(NA, repetitions)
n_employed <- datasmall %>% filter(group=="employed") %>% summarise(n())
n_unemployed <- datasmall %>% filter(group=="unemployed") %>% summarise(n())
diab_employed <- datasmall %>%
filter(diabetes=="yes" & group=="employed") %>%
summarize(n())
diab_unemployed <- datasmall %>%
filter(diabetes=="yes" & group=="unemployed") %>%
summarize(n())
test_stat <- as.numeric(diab_employed/n_employed - diab_unemployed/n_unemployed)
test_stat
## [1] -0.01052505
for (i in 1:repetitions)
{
sim <- datasmall %>% mutate(group = sample(group))
diab_employed <- sim %>%
filter(diabetes=="yes" & group=="employed") %>%
summarize(n())
diab_unemployed <- sim %>%
filter(diabetes=="yes" & group=="unemployed") %>%
summarize(n())
p_diff <- diab_employed/n_employed - diab_unemployed/n_unemployed
simulated_stats[i] <- as.numeric(p_diff)
}
sim <- data_frame(p_diff=simulated_stats)
sim %>%
filter(p_diff >= abs(test_stat) | p_diff <= -1*abs(test_stat)) %>%
summarise(p_value = n() / repetitions)
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.079
Assuming that there is no difference in proportion of people with diabetes between people 18-29 years old who are employed or not employed, the probability of seeing a difference in the proportion of people with diabetes as large or larger that was observed in the survey is estimated to be 0.079. We have only weak evidence that the proportion of 18-29 year old people with diabetes differs between those who are employed and those who are unemployed.
Even though the observed difference in proportions was the same in both a. and b., we got a much smaller P-value with the larger sample size. In general, with large sample sizes, hypothesis tests can detect smaller differences between two groups.
Bring your output for this question to tutorial on Friday February 9 (either a hardcopy or on your laptop).
(Adapted from ISRS 4.22) The China Health and Nutrition Survey aims to examine the effects of the health, nutrition, and family planning policies and programs implemented by national and local governments. One of the variables collected on the survey is the number of hours parents spend taking care of children in their household under age 6 (feeding, bathing, dressing, holding, or watching them). In our data for 2006, 664 females and 358 males were surveyed for this question. We will examine whether there is a difference in hours spent on child care between females and males.
openintro
package in the china
dataframe. Load the data and look at it usinglibrary(openintro)
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
##
## diamonds
## The following objects are masked from 'package:datasets':
##
## cars, trees
glimpse(china)
## Observations: 9,788
## Variables: 3
## $ gender <int> 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, ...
## $ edu <int> 1, 5, 2, 2, 3, NA, 2, 2, 2, NA, NA, 2, 2, 1, 5, 5, ...
## $ child_care <int> -99, -99, -99, -99, -99, -99, -99, -99, -99, -99, -...
There are 9,788 observations in the data, but we expected \(664+358=1022\) from the description of the survey above.
child_care
variable for the observations shown (which are the first observations in the data frame)? Is this what you expected?The first several values of child_care
are \(-99\), which doesn’t make sense as a value for the number of hours taking care of children.
new_china
removing the people with this value from the data. (What dplyr
function should you use? For indicating which values to remove, the operator !=
means not equal.)new_china <- china %>% filter(child_care != -99)
glimpse(new_china)
## Observations: 1,022
## Variables: 3
## $ gender <int> 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, ...
## $ edu <int> 1, 2, 1, 1, NA, 2, 4, 3, 2, NA, NA, NA, 3, NA, 2, 1...
## $ child_care <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
gender
variable is coded with an integer, which is harder to remember than a coding that makes it clear which value corresponds to males and which corresponds to females. Use the following code to create a new variable called sex
in the data frame new_china
. Look at the data to verify that it has accomplished what you expect.new_china <- new_china %>% mutate(sex = recode(gender, `1`="male", `2`="female"))
glimpse(new_china)
## Observations: 1,022
## Variables: 4
## $ gender <int> 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, ...
## $ edu <int> 1, 2, 1, 1, NA, 2, 4, 3, 2, NA, NA, NA, 3, NA, 2, 1...
## $ child_care <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ sex <chr> "female", "male", "male", "female", "female", "male...
child_care
variable for males and females.group_by
and summarize
functions to find the average and median values of child_care
for females and males. Within summarize
, the average can by found using the function mean(VARIABLE_NAME)
and the median can be found using the function median(VARIABLE_NAME)
. Recall that the median of child_care
is the value such that 50% of people reported spending more hours taking care of children than this value, and 50% of people reported spending fewer hours taking care of children than this value.new_china %>% group_by(sex) %>% summarize(means=mean(child_care), medians=median(child_care))
## # A tibble: 2 x 3
## sex means medians
## <chr> <dbl> <dbl>
## 1 female 29.28163 20
## 2 male 15.24581 9
ii. Construct boxplots and histograms of `child_care` for each sex. How do the distributions of `child_care` compare for males and females? Look at the plots so see where the values of the means and medians are located. (For code to emulate for the boxplots, see the code on page 27 of the `ggplot` slides from Week 1. For the histograms, try combining the code to create a histogram (page 14 of the slides) with a `facet_wrap` (page 41 of the slides).)
ggplot(new_china, aes(x=sex, y=child_care)) + geom_boxplot()
ggplot(new_china, aes(x=child_care)) + geom_histogram(binwidth=10) + facet_wrap(~sex)
The distributions of child_care
for both females and males is very right-skewed, with most people having a small number of hours and a few having very large values. The medians are the middle lines in the boxes of the boxplots. In this histograms, the medians are in the bin that includes the most observations (for males) or the bin just to the right of of the bin that includes the most observations (for females). The mean values are to the right of these on the histogram, in a bin where there are fewer observations.
The large values in the right tail of the distribution results in larger means than medians. In data with this type of distribution, the median typically gives a one-number summary that better captures the data than the mean.
repetitions <- 1000
simulated_stats <- rep(NA, repetitions)
test_stat <- as.numeric( new_china %>% group_by(sex) %>% summarize(medians=median(child_care)) %>% summarise(test_stat = diff(medians)))
for (i in 1:repetitions)
{
sim <- new_china %>% mutate(sex = sample(sex)) # shuffle sleep group labels
sim_test_stat <- sim %>% group_by(sex) %>% summarise(medians = median(child_care)) %>% summarise(sim_test_stat = diff(medians))
simulated_stats[i] <- as.numeric(sim_test_stat)
}
sim <- data_frame(median_diff=simulated_stats)
sim %>%
filter(median_diff >= abs(test_stat) | median_diff <= -1*abs(test_stat)) %>%
summarise(p_value = n() / repetitions)
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
We simulated 1000 values of what the difference between females and males in the medians of number of hours spent on child care would be if we assumed that there was no difference, that is the values observed were just as likely to come from a female survey participant or a male survey participant. There were no simulated values with a difference as large or larger than the difference in medians that we observed in our data. Our estimate of the P-value is thus 0 (\(<1/1000\)). We conclude that we have very strong evidence that there is a difference in the median number of hours spent on child care between females and males.