Tutorial grades will be assigned according to the following marking scheme.
Mark | |
---|---|
Attendance for the entire tutorial | 1 |
Assigned homework completiona | 1 |
In-class exercises | 4 |
Total | 6 |
In this question, we will look at the median duration of flights departing from New York in 2013. We’ll take as our population all flights departing from New York in 2013 in the flights
data.
air_time
for the whole population and describe it.library(nycflights13)
flights <- flights %>% filter(!is.na(air_time))
ggplot(flights, aes(x=air_time)) + geom_histogram() + xlim(0,700) +
labs(title="Population of all flights departing from New York")
The distribution of air_time
is bimodal: it has a large peak around 120 minutes and a second peak around 320 minutes. Since the average flight time from New York to Los Angeles (LAX) is 328 minutes, it seems that the second mode represents flights to the West coast. The distribution is slightly right skewed, mostly because of the second peak. There are also a small number of very long flights departing New York, with durations of approximately 600 minutes - examining the data reveals that these are flights to Honolulu (HNL).
air_time
for (i) a sample of size 25, (ii) a sample of size 100. Compare these histograms and the histogram from part (a).set.seed(1)
flights %>% sample_n(size=25) %>% ggplot(aes(x=air_time)) + geom_histogram() + xlim(0,700) +
labs(title="Sample of size 25 of all flights departing from New York")
flights %>% sample_n(size=100) %>% ggplot(aes(x=air_time)) + geom_histogram() + xlim(0,700) +
labs(title="Sample of size 100 of all flights departing from New York")
The two modes observed in the histogram of air_time
for the whole population in (a) are still apparent in the two histograms here. However, flight durations which were relatively rare in the full population (e.g. ~600 minutes and ~250 minutes) may be absent entirely from samples of the data, particularly when the sample size is smaller. Depending on the samples you got, the distribution may look less smooth for the samples, as there are fewer observations. However, the distribution of the samples reflects the distribution of the population overall.
air_time
in each of these samples.sample_medians25 <- rep(NA, 500) # where we'll store the means
for (i in 1:500)
{
sample25 <- flights %>% sample_n(size = 25)
sample_medians25[i] <- as.numeric(sample25 %>% summarize(median(air_time)))
}
sample_medians25 <- data_frame(median_air_time=sample_medians25)
ggplot(sample_medians25, aes(x=median_air_time)) + geom_histogram(binwidth=4) + xlim(75,220)
labs(x="Medians from samples of size 25",
title="Sample size 25")
## $x
## [1] "Medians from samples of size 25"
##
## $title
## [1] "Sample size 25"
##
## attr(,"class")
## [1] "labels"
air_time
in each of these samples.sample_medians100 <- rep(NA, 500) # where we'll store the means
for (i in 1:500)
{
sample100 <- flights %>% sample_n(size = 100)
sample_medians100[i] <- as.numeric(sample100 %>% summarize(median(air_time)))
}
sample_medians100 <- data_frame(median_air_time=sample_medians100)
ggplot(sample_medians100, aes(x=median_air_time)) + geom_histogram(binwidth=4) + xlim(75,220) +
labs(x="Medians from samples of size 100",
title="Sample size 100")
These are histograms of the sampling distribution of the median flight duration for samples of size 25 (in part (i)) and samples of size 100 (in part (ii)) respectively.
The sampling distribution of the median for samples of size 100 is symmetric with one mode around the population median of 129 minutes. When working with samples of size 25, the sampling distribution is not quite symmetrical (although this may vary depending on your random samples); however it is still centered around the population median.
By looking at the x-axis, we can see that the range of the sampling distribution of the median is much smaller when working with samples of size 100 than when we are working with samples of size 25. This makes sense, as it is more difficult to get a good estimate of the median from a small sample.
A general note: samples should look like the original data, but sampling distributions don’t have to.
Bring your output for this question to tutorial on Friday February 16 (either a hardcopy or on your laptop).
In this question, we will look at the Gestation
data in the mosaicData
library. First load the library:
library(mosaicData)
You can read about the data by looking at the help information for the data frame
help(Gestation)
In this question, you will find confidence intervals for parameters related to the distribution of the mother’s age, which is the variable age
. First remove the two observations which have missing values for age
.
Gestation <- Gestation %>% filter(!is.na(age))
The red dot is the mean of mother’s age for one bootstrap sample. The bootstrap sample is obtained by taking a random sample, with replacement, from the original data, with the same number of observations as the original data.
The 90% confidence interval ranges from appoximately the 5th largest data point to the 95th largest data point. This interval will be from a value a little above 27.0 to a value a little above 27.5.
Find a 99% bootstrap confidence interval for the median of mother’s age. Use 5000 bootstrap samples.
boot_medians <- rep(NA, 5000) # where we'll store the bootstrap means
sample_size <- as.numeric(Gestation %>% summarize(n()))
for (i in 1:5000)
{
boot_samp <- Gestation %>% sample_n(size = sample_size, replace=TRUE)
boot_medians[i] <- as.numeric(boot_samp %>% summarize(median(age)))
}
quantile(boot_medians,c(.005,.995))
## 0.5% 99.5%
## 26 27
The 99% boostrap confidence interval for the median of mother’s age is (26, 27).
In lecture this week, we used Gunturkun’s data to calculate confidence intervals for the proportion of couples who tilt their heads to the right when they kiss. Our 95% confidence interval was (0.56, 0.73).
A larger confidence level ensure that we would capture the population parameter in more samples. This would give a wider confidence interval, extending to more of the bootstrap sampling distribution.
False. We know in this sample that 64.5% of kissing couples tilt their head to the right so we’re 100% confident.
True. (Someone reading this interpretation of a confidence interval would have to know what “95% confident: means!)
True. This is an accurate description of a confidence interval.
In the week 4 lecture, we carried out an hypothesis test to determine whether couples are equally likely to tilt their heads to the right or to the left when they kiss. We tested the hypotheses: \[H_0: p = 0.5\] versus \[H_A: p \ne 0.5\] where \(p\) is the proportion of couples who tilt their heads to the right when they kiss. Using Gunturkun’s data, our P-value was 0.003.
How do this hypothesis test and the confidence interval tell a similar story?
From the hypothesis test, we have strong evidence that the proportion of couples who tilt their head to the right is not 0.5 because our P-value is small. Our 95% confidence interval for this proportion does not include 0.5, so 0.5 is not a plausible value from our data. So the hypothesis test and confidence interval both indicate that couples aren’t equally likely to tilt their heads to the left or right when they kiss.