In lecture, we looked at the sampling distribution of the mean arrival delay for 2013 flights from New York to San Francisco for samples of size 25 and 100. We’ll now look at the median of the arrival delay. We’ll take as our population all flights from New to San Francisco in the flights
data.
arr_delay
for the populationarr_delay
for a sample of size 25arr_delay
for a sample of size 100library(nycflights13)
SF <- flights %>% filter(dest == "SFO", !is.na(arr_delay))
ggplot(SF, aes(x=arr_delay)) + geom_histogram() + xlim(-100,1000) +
labs(title="Population of all NY to San Francisco flights")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
sample25 <- SF %>% sample_n(size = 25)
ggplot(sample25, aes(x=arr_delay)) + geom_histogram() + xlim(-100,1000) +
labs(title="Sample of size 25 from all NY to San Francisco flights")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing missing values (geom_bar).
sample100 <- SF %>% sample_n(size = 100)
ggplot(sample100, aes(x=arr_delay)) + geom_histogram() + xlim(-100,1000) +
labs(title="Sample of size 100 from all NY to San Francisco flights")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing missing values (geom_bar).
For ease of comparison, I put all histograms on the same horizontal scale. For all of the population and both samples, the distribution of arr_delay
is very right skewed, with a mode near 0. The values of arr_delay
that we can observe in the histogram range from -100 to very large values (close to 500, possibly larger given the vertical scale makes it impossible to see small counts). Depending on the samples you got, the distribution may look less smooth for the samples, as there are fewer observations. But, overall, the distribution of the samples reflects the distribution of the population.
The sampling distributions of the mean of arr_delay
have a single mode near 0, as did the histograms of the data values in a. The histograms in a. are very right-skewed, but the histogram for the sampling distribution of the mean for samples of size 25 is only slighty right-skewed and the histogram for the sampling distribution of the mean for samples of size 100 is close to symmetric. The range of values for the mean of arr_delay
is much smaller than the range of values for the original data, and is smaller with larger sample size.
sample_medians <- rep(NA, 500) # where we'll store the means
for (i in 1:500)
{
sample25 <- SF %>% sample_n(size = 25)
sample_medians[i] <- as.numeric(sample25 %>% summarize(median(arr_delay)))
}
sample_medians <- data_frame(median_delay=sample_medians)
ggplot(sample_medians, aes(x=median_delay)) + geom_histogram(binwidth=5) +
labs(x="Medians from samples of size 25",
title="Sample size 25")
sample_medians100 <- rep(NA, 500) # where we'll store the means
for (i in 1:500)
{
sample100 <- SF %>% sample_n(size = 100)
sample_medians100[i] <- as.numeric(sample100 %>% summarize(median(arr_delay)))
}
sample_medians100 <- data_frame(median_delay=sample_medians100)
ggplot(sample_medians100, aes(x=median_delay)) + geom_histogram(binwidth=2) +
labs(x="Medians from samples of size 100",
title="Sample size 100")
Both sampling distributions of the median are symmetric with one mode between -10 and 0. Comparing the horizontal scales, there is a smaller range of values for medians from samples of size 100 than for medians of samples of size 25. Note that the shape of the sampling distributions of the median is different than the shape of the original data and has a smaller range of values.
A general note: samples should look like the original data, but sampling distributions don’t have to.
Bring your output for this question to tutorial on Friday February 16 (either a hardcopy or on your laptop).
In this question, we’ll look at the Gestation
data in the mosaicData
library. First load the library:
library(mosaicData)
You can read about the data by looking at the help information for the data frame
help(Gestation)
In this question, you will find confidence intervals for parameters related to the distribution of the mother’s age, which is the variable age
. First remove the two observations which have missing values for age
.
Gestation <- Gestation %>% filter(!is.na(age))
The red dot is the mean of mother’s age for one bootstrap sample. The bootstrap sample is obtained by taking a random sample, with replacement, from the original data, with the same number of observations as the original data.
The 90% confidence interval ranges from appoximately the 5th largest data point to the 95th largest data point. This interval will be from a value a little above 27.0 to a value a little above 27.5.
Find a 99% bootstrap confidence interval for the median of mother’s age. Use 5000 bootstrap samples.
boot_medians <- rep(NA, 5000) # where we'll store the bootstrap means
sample_size <- as.numeric(Gestation %>% summarize(n()))
for (i in 1:5000)
{
boot_samp <- Gestation %>% sample_n(size = sample_size, replace=TRUE)
boot_medians[i] <- as.numeric(boot_samp %>% summarize(median(age)))
}
quantile(boot_medians,c(.005,.995))
## 0.5% 99.5%
## 26 27
The 99% boostrap confidence interval for the median of mother’s age is (26, 27).
In lecture this week, we used Güntürkün’s data to calculate confidence intervals for the proportion of couples who tilt their heads to the right when they kiss. Our 95% confidence interval was (0.56, 0.73).
A larger confidence level ensure that we would capture the population parameter in more samples. This would give a wider confidence interval, extending to more of the bootstrap sampling distribution.
False. We know in this sample that 64.5% of kissing couples tilt their head to the right so we’re 100% confident.
True. (Someone reading this interpretation of a confidence interval would have to know what “95% confident: means!)
True. This is an accurate description of a confidence interval.
In the week 4 lecture, we carried out an hypothesis test to determine whether couples are equally likely to tilt their heads to the right or to the left when they kiss. We tested the hypotheses: \[H_0: p = 0.5\] versus \[H_A: p \ne 0.5\] where \(p\) is the proportion of couples who tilt their heads to the right when they kiss. Using Güntürkün’s data, our P-value was 0.003.
How do this hypothesis test and the confidence interval tell a similar story?
From the hypothesis test, we have strong evidence that the proportion of couples who tilt their head to the right is not 0.5 because our P-value is small. Our 95% confidence interval for this proportion does not include 0.5, so 0.5 is not a plausible value from our data. So the hypothesis test and confidence interval both indicate that couples aren’t equally likely to tilt their heads to the left or right when they kiss.