Lightning Round

Lightning Round Question 1

A clinical oncologist is investigating the efficacy of a new treatment on reduction in tumour size. She randomly assigns patients to the new treatment or old treatment and compares the mean of the reduction in tumour size between the two groups. She carries out a statistical test and the P-value is 0.001. How many of the following are valid interpretations of the P-value?

I. The probability of observing a difference between the treatment groups as large or larger than she observed if the new treatment has the same efficacy as the old treatment.
II. The probability that the new treatment works the same as the old treatment.
III. The probability that the new treatment, on average, reduces tumour size more than the old treatment.

A. None
B. One
C. Two
D. Three

Answer: B. One

Statement I is true. The P-value is the probability of observing a statistic that is at least as extreme as our test statistic (the observed difference between treatments) if the null hypothesis is true (the two treatments are equally effective).
Statement II is false. The P-value is a probability related to the observed data, and is not the probability of the null hypothesis.
Statement III is false. The P-value can be used to conclude how unlikely the observed data are if the null hypotheis is true. It’s value can not be interpreted in terms of the treatments.

Lightning Round Question 2

Environmental scientists want to estimate the mean mercury content in ppm of fish in a lake. They collect a random sample of 50 fish from the lake, measure the mercury content of each, calculate the average mercury content for these 50 fish and use the bootstrap to find a 99% confidence interval for the mean. The confidence interval is (0.82, 1.13). How many of the following are valid interpretations of the confidence interval?

I. We are 99% certain that each fish has approximately 0.82 to 1.13 ppm of mercury.
II. We expect 99% of the fish to have between 0.82 and 1.13 ppm of mercury.
III. We would expect about 99% of all possible sample means from this population to be in between 0.82 and 1.13 ppm of mercury.
IV. We are 99% certain that the confidence interval of (0.82, 1.13) includes the true mean of the mercury content of fish in the lake.

A. None; B. One; C. Two; D. Three; E. All four

Answer: B. One

Statement I is false. This is a confidence interval about the mean mercury content of fish in this population, not each fish.
Statement II is false. As above, the confidence interval is not about the individual fish.
Statement III is false. The confidence interval is not a statement about sample means, it is about the mean of all fish. Note that different samples give different sample means, and also different confidence intervals. A sample mean will be in its corresponding confidence interval for the mean. We can’t say whether or not it will be in a confidence interval computed from another sample of data.
Statement IV is true. In repeated sampling of 50 fish from the lake, 99% of the calculated confidence intervals for the mean will include the mean mercury content for all fish in the lake.

Lightning Round Question 3

Fill in the respective blanks:
Suppose we wish to test the null hypothesis that a Yoga method does not have an effect on blood pressure versus the alternative that it does have an effect. A _____________ error would be make by concluding that the Yoga method _____________ on blood pressure if in fact the Yoga method ____________ on blood pressure.

A. Type 2; does not have an effect; does have an effect
B. Type 2; does not have an effect; does not have an effect
C. Type 2; does have an effect; does not have an effect
D. Type 1; does not have an effect; does have an effect
E. Type 1; does not have an effect; does not have an effect

Answer: A. This is the definition of a Type 2 error. (C would be correct if it was Type 1.)

Lightning Round Question 4

On the next slide are 4 histograms:

One is the histogram of the variable x in the population (which consists of 1,000,000 individuals).
One is the histogram of the variable x for a sample of size 1000 from the population.
One is the histogram for 1000 means of x, each from a sample of size 25.
One is the histogram for 1000 means of x, each from a sample of size 100.

Which is which?

Answer:

Histogram 3 is the histogram of x in the population. We can tell by the counts on the y-axis.
Histogram 1 is the sample of x of size 100 from the population. Histograms of samples of a variable look like the histogram of the variable in the population.
Histogram 2 is the sample means of size 25 and Histogram 4 is the sample means of size 100. Sampling distributions of the mean are less variable and more symmetric than the population values and are less variable for larger sample size.

From the above plots, consider these two plots:

The histogram for 1000 means of x, each from a sample of size 25
The histogram for 1000 means of x, each from a sample of size 100

What is a name for what is plotted in these two plots? Sampling distribution
What do these plots tell you about the effect of sample size on a confidence interval for the mean?
Just as sampling distributions are less variable for larger sample sizes, bootstrap sampling distibutions are less variable for larger sample sizes, resulting in narrower confidence intervals (that is, more precision in the estimate of the parameter).

Lightning Round Question 5

In statistical inference, we want to make conclusions about what we think about the theoretical world (a scientific model or population) based on what we’ve observed in the real world (data, typically observed on a random sample).
Do the following items exist in the theoretical world or the real world?

Statistic
real world
Parameter
theoretical world
Null hypothesis (and alternative hypothesis)
theoretical world (these are competing models for what is true in the theoretical world)
Test statistic
real world (calculated form data)
Simulated values of the test statistic under the null hypothesis
theoretical world
P-value
in between. It tells us how the test statistic (from the real world data) looks in the null hypothesis world (a model for the theoretical world)
Sampling distribution
theoretical world
Bootstrap sampling distribution
real world

Lightning Round Question 6

Suppose we have a vector of data values, x

x <- c(1, 3, 4, 4, 7)

We’ve used the sample() function various ways. What output is possible with each of the following commands?

sample(x)

An example of possible output: 3 4 1 7 4
The command shuffles the 5 values of x in random order

sample(x, replace=TRUE)

An example of possible output: 1 1 1 1 1
The output is 5 values from x, values may or may not repeat, in random order

sample(x, size=2, replace=TRUE)

An example of possible output: 1 2; could also be 7 7, etc.
The output is 2 values from x, values may or may not repeat, in random order

sample(x, size=2, prob=c(0.5, 0.5, 0, 0, 0))

Only two possibilities for the output: 1 3 or 3 1
The output is 2 values from the first two values in x, chosen with equal probability, values may not repeat (without replacement), in random order

Which one of these is a bootstrap sample?

sample(x, replace=TRUE)
sample with the same sample size, with replacement

Case Study: American Community Survey 2012

The American Community Survey is conducted by the US Census Bureau each year on a random sample of 3.5 million households. Findings from the survey influence the allocation of more than $400 billion in federal and state funds. The dataset acs12 is a random sample from the people who completed the American Community Survey in 2012.

Here is a look at the data and some of the variables we will consider later:

glimpse(acs12)

## Observations: 2,000
## Variables: 13
## $ income       <int> 60000, 0, NA, 0, 0, 1700, NA, NA, NA, 45000, NA, ...
## $ employment   <fct> not in labor force, not in labor force, NA, not i...
## $ hrs_work     <int> 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, N...
## $ race         <fct> white, white, white, white, white, other, white, ...
## $ age          <int> 68, 88, 12, 17, 77, 35, 11, 7, 6, 27, 8, 69, 69, ...
## $ gender       <fct> female, male, female, male, female, female, male,...
## $ citizen      <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,...
## $ time_to_work <int> NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA...
## $ lang         <fct> english, english, english, other, other, other, e...
## $ married      <fct> no, no, no, no, no, yes, no, no, no, yes, no, no,...
## $ edu          <fct> college, hs or lower, hs or lower, hs or lower, h...
## $ disability   <fct> no, yes, no, no, yes, yes, no, yes, no, no, no, n...
## $ birth_qrtr   <fct> jul thru sep, jan thru mar, oct thru dec, oct thr...

table(acs12$employment)

## 
## not in labor force         unemployed           employed 
##                656                106                843

table(acs12$edu)

## 
## hs or lower     college        grad 
##        1439         359         144

Case study question 1

Describe the data frames that are created by each of the following commands:

labor_force <- acs12 %>% filter(!is.na(employment)) %>% 
  filter(employment == "employed" | employment == "unemployed")

the resulting data frame has fewer observations; observations for which the variable employment is missing or has the value not in labor force are removed

employed <- labor_force %>% filter(employment == "employed")

the resulting data frame has even fewer observations; only observations for which employment has the value employed are kept

employed <- employed %>% 
  mutate(edu2 = recode(edu, "hs or lower" = "hs_or_lower", 
                       "college"="more_than_hs", "grad"="more_than_hs"))

the resulting data frame has a new variable edu2 which is the same as edu except observations with value college or grad are changed to the value more_than_hs

cat_vars <- acs12 %>% select(employment, race, gender, citizen, lang, married, edu, 
                             disability, birth_qrtr)

the resulting data frame has fewer variables, only those “selected”

Case study question 2

We’ve used these plot geometries:
geom_bar, geom_boxplot, geom_dotplot, geom_histogram, geom_line, geom_point, geom_vline

Recall this plot vocabulary:

Bar plots: modes, frequency
Histograms / boxplots: centre, spread, modes (unimodal, bimodal, multimodal, no mode), frequency, symmetric / left-skewed / right-skewed, outliers
Scatterplots: strong / weak / no relationship, linear (positive or negative) / nonlinear relationshiop, outliers, clusters

On the next several slides are a number of plots, each constructed from the dataset employed. For each:

What type of plot is it?
What ggplot geometry is used?
What is the purpose of the plot?
Describe the distribution(s) of the variable(s).

Plot 1

Type of plot: side-by-side boxplots
Geometry: geom_boxplot
Purpose: compare the distribution of income (a numerical variable) for different values of edu (a categorical variable)
Description: For all values of edu, the distribution of income is right-skewed. All of the median, first quartile, and third quartile increase with more education.

Plot 2

Why use a log transformation?
The log transformation spreads out small values in right-skewed distributions and makes large values less extreme.

Type of plot: scatterplot
Geometry: geom_point
Purpose: examine the relationship between log of income and hrs_work (two numerical variables)
Description: Ignoring the set of 0 values for income, there is a moderate, positive, linear relationship between log of income and hrs_work.

Plot 3

Type of plot: bar plot
Geometry: geom_bar
Purpose: examine the distribution of a categorical variable, edu
Description: Most observations have high school or lower education. The number with college education is about half as many of the number with high school or lower education and the number with graduate school education is about one-fifth as many of the number with high school or lower education.

Plot 4

Type of plot: histogram
Geometry: geom_histogram
Purpose: examine the distribution of a numerical variable, income
Description: The distribution has a single mode at about 25,000 and is very right-skewed.

What is the difference in these histograms? How do you get the second from the first?

In the count plot, the height of a bar is the number of observations that are in that bin. In the density plot, the area of a bar is the proportion of observations that are in that bin. The height of the bar in the density plot is the height of the bar in the count plot divided by the product of the binwidth and the total number of observations.

Case study question 3

We’ve looked at simulations to:

See how a statistic calculated from a population might vary from sample to sample (what is this called? sampling distribution)
Estimate 1. when we only have one set of data (what is this called? bootstrap sampling distribution)
See the distribution of possible values of a statistic under an assumption (what is this assumption called? null hypothesis). Two cases of this we considered:
- 1. simulate outcomes for a proportion (how did we do this under our assumption? “flip a coin” with the number of times equal to the number of observations and with the probability of outcomes specified by the null hypothesis and calculate the proportion of the outcome of interest in these “coin flips”)
- 1. simulate the difference in a statistic between groups (how did we do this under our assumption? shuffle the group labels and calculate the resulting difference in the statistic between the shuffled groups)

Below is code for three simulations. For each:

What is the purpose of the simulation (from the 5 choices above)?
Is the simulation used for a test or for a confidence interval or neither?
For the dotplot of simulated values, where is its centre?

If the purpose of the simulation is to …

… find a confidence interval for a parameter, describe how you would estimate a 90% confidence interval from the dotplot of simulated values.
… carry out an hypothesis test, what is the null hypothesis? Estimate the P-value from the values plotted in the dotplot. What is your conclusion?

Some statistics that might be useful

labor_force  %>% group_by(employment) %>% summarise(n_group = n()) %>% 
  mutate(percent = n_group / sum(n_group))

employed %>% group_by(edu2) %>% summarise(mean(income))

Simulation 1

repetitions <- 100
x <- rep(NA, repetitions) 

n <- as.numeric(labor_force %>% summarize(n()))

for (i in 1:repetitions)
{
  sim <- sample(c("unemployed", "employed"), size=n, prob=c(0.089, 1-0.089), replace=TRUE)
  sim_stat <- sum(sim == "unemployed") / n
  x[i] <- as.numeric(sim_stat)
}

Purpose: 3a, simulate outomes of a proportion under an assumption.
We are doing this to examine the distribution of possible values of the proportion of people unemployed for a sample this size, assuming that the true proportion of people unemployed is 0.089 (8.9% is the 2011 US value).
The simulation is for an hypothesis test with $H_0: p=0.089$ and $H_A: p \ne 0.089$ where p is the proportion of people unemployed in the population.

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

Note: The grid marks are 0.005 apart on the horizontal axis.

The centre of the histogram is 0.089 (the null hypothesis value).
The test statistic is 0.111.
For the P-value, we need the number of observations in the plot that are greater than or equal to 0.111 or less than or equal to 0.067. It appears that there may be one value (in the right tail) that satisfies this (although we can’t say for sure as the value is close to 0.11). Assuming that value is indeed greater than or equal to 0.111, our estimate of the P-value is 1/100 or 0.01.
We conclude that we have strong evidence that the unemployment rate in 2012 is different from the 2011 value of 0.089.

Simulation 2

repetitions <- 100  
x <- rep(NA, repetitions) 

for (i in 1:repetitions)
{
  sim <- employed %>% mutate(edu2 = sample(edu2))  
  sim_stat <- sim %>% group_by(edu2) %>% 
                    summarise(means = mean(income)) %>% 
                    summarise(diff(means))
  x[i] <- as.numeric(sim_stat)
}

Purpose: 3b, simulate differences in a statistic between two groups, defined by the value of edu2. To do this we shuffle the values of edu2 and compare the mean income between the resulting two education groups.
We are doing this to examine the distribution of possible values of the difference in the mean income between the two education groups, assuming that there is no difference.
The simulation is for an hypothesis test with $H_0: \mu_1=\mu_2$ and $H_A: \mu_1 \ne \mu_2$ where $\mu_1$ is the mean income of people with high school or lower education in the population and $\mu_2$ is the mean income of people with more than high school education in the population.

Note: The grid marks are 2500 apart on the horizontal axis.

The centre of the histogram is 0 (the null hypothesis value for the difference in the means).
The test statistic is $29963.08 - 65010.21 = -35047.13$ (using statistics given above).
For the P-value, we need the number of observations in the plot that are greater than or equal to 35047.13 or less than or equal to -35047.13. These values are way off the scale of the plot so our estimate of the P-value is 0 (to at least 2 decimal places).
We conclude that we have very strong evidence that the mean income is different for people with high school or less education and people with some college education.

Simulation 3

repetitions <- 100
x <- rep(NA, repetitions) 

n <- as.numeric(labor_force %>% summarize(n()))
 
for (i in 1:repetitions)
{
  sim <- labor_force %>% sample_n(size = n, replace=TRUE)
  x[i] <- as.numeric(sim %>% filter(employment == "unemployed") %>% 
                            summarize(n())) / n
}

Purpose: 2, estimate a sampling distribution of the proportion unemployed from our data, that is a bootstrap sampling distribution.
The simulation is to estimate a confidence interval for the proportion unemployed in the population.

Note: The grid marks are 0.005 apart on the horizontal axis.

The centre of the histogram is 0.111 (the estimated proportion unemployed in the data).
For a 90% confidence interval, disregard 5% of the the observations (that is 5 observations) from the left and right tails of the boostrap distribution. Our estimate of the 90% confidence interval for the proportion of people unemployed in the population is roughly (0.095, 0.13).

Below is a boxplot for Simulation 3, plotting the same data as the histogram.
Can you use it to estimate a confidence interval? For any confidence level?

For a confidence interval, we need percentiles of the bootstrap sampling distribution. The boxplot gives us the 25th percentile (about 0.104) and the 75th percentile (about 0.12). So a 50% confidence interval for the proportion of people unemployed in the population is approximately (0.104, 0.12).

STA130H1 – Winter 2018 – Test Review Solutions

Prof. Gibbs

February 26, 2018

Lightning Round

Lightning Round Question 1

Lightning Round Question 2

Lightning Round Question 3

Lightning Round Question 4

Lightning Round Question 5

Lightning Round Question 6

Case Study: American Community Survey 2012

Case Study: American Community Survey 2012

Case study question 1

Case study question 2

Plot 1

Plot 2

Plot 3

Plot 4

Case study question 3

Some statistics that might be useful

Simulation 1

Simulation 2

Simulation 3