A clinical oncologist is investigating the efficacy of a new treatment on reduction in tumour size. She randomly assigns patients to the new treatment or old treatment and compares the mean of the reduction in tumour size between the two groups. She carries out a statistical test and the P-value is 0.001. How many of the following are valid interpretations of the P-value?
I. The probability of observing a difference between the treatment groups as large or larger than she observed if the new treatment has the same efficacy as the old treatment.
II. The probability that the new treatment works the same as the old treatment.
III. The probability that the new treatment, on average, reduces tumour size more than the old treatment.
A. None
B. One
C. Two
D. Three
Answer: B. One
Environmental scientists want to estimate the mean mercury content in ppm of fish in a lake. They collect a random sample of 50 fish from the lake, measure the mercury content of each, calculate the average mercury content for these 50 fish and use the bootstrap to find a 99% confidence interval for the mean. The confidence interval is (0.82, 1.13). How many of the following are valid interpretations of the confidence interval?
I. We are 99% certain that each fish has approximately 0.82 to 1.13 ppm of mercury.
II. We expect 99% of the fish to have between 0.82 and 1.13 ppm of mercury.
III. We would expect about 99% of all possible sample means from this population to be in between 0.82 and 1.13 ppm of mercury.
IV. We are 99% certain that the confidence interval of (0.82, 1.13) includes the true mean of the mercury content of fish in the lake.
A. None; B. One; C. Two; D. Three; E. All four
Answer: B. One
Fill in the respective blanks:
Suppose we wish to test the null hypothesis that a Yoga method does not have an effect on blood pressure versus the alternative that it does have an effect. A _____________ error would be make by concluding that the Yoga method _____________ on blood pressure if in fact the Yoga method ____________ on blood pressure.
A. Type 2; does not have an effect; does have an effect
B. Type 2; does not have an effect; does not have an effect
C. Type 2; does have an effect; does not have an effect
D. Type 1; does not have an effect; does have an effect
E. Type 1; does not have an effect; does not have an effect
Answer: A. This is the definition of a Type 2 error. (C would be correct if it was Type 1.)
On the next slide are 4 histograms:
x
in the population (which consists of 1,000,000 individuals).x
for a sample of size 1000 from the population.x
, each from a sample of size 25.x
, each from a sample of size 100.Which is which?
Answer:
x
in the population. We can tell by the counts on the y-axis.x
of size 100 from the population. Histograms of samples of a variable look like the histogram of the variable in the population.From the above plots, consider these two plots:
x
, each from a sample of size 25x
, each from a sample of size 100In statistical inference, we want to make conclusions about what we think about the theoretical world (a scientific model or population) based on what we’ve observed in the real world (data, typically observed on a random sample).
Do the following items exist in the theoretical world or the real world?
Suppose we have a vector of data values, x
x <- c(1, 3, 4, 4, 7)
We’ve used the sample()
function various ways. What output is possible with each of the following commands?
sample(x)
An example of possible output: 3 4 1 7 4
The command shuffles the 5 values of x
in random order
sample(x, replace=TRUE)
An example of possible output: 1 1 1 1 1
The output is 5 values from x
, values may or may not repeat, in random order
sample(x, size=2, replace=TRUE)
An example of possible output: 1 2
; could also be 7 7
, etc.
The output is 2 values from x
, values may or may not repeat, in random order
sample(x, size=2, prob=c(0.5, 0.5, 0, 0, 0))
Only two possibilities for the output: 1 3
or 3 1
The output is 2 values from the first two values in x
, chosen with equal probability, values may not repeat (without replacement), in random order
Which one of these is a bootstrap sample?
sample(x, replace=TRUE)
sample with the same sample size, with replacement
The American Community Survey is conducted by the US Census Bureau each year on a random sample of 3.5 million households. Findings from the survey influence the allocation of more than $400 billion in federal and state funds. The dataset acs12
is a random sample from the people who completed the American Community Survey in 2012.
Here is a look at the data and some of the variables we will consider later:
glimpse(acs12)
## Observations: 2,000
## Variables: 13
## $ income <int> 60000, 0, NA, 0, 0, 1700, NA, NA, NA, 45000, NA, ...
## $ employment <fct> not in labor force, not in labor force, NA, not i...
## $ hrs_work <int> 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, N...
## $ race <fct> white, white, white, white, white, other, white, ...
## $ age <int> 68, 88, 12, 17, 77, 35, 11, 7, 6, 27, 8, 69, 69, ...
## $ gender <fct> female, male, female, male, female, female, male,...
## $ citizen <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,...
## $ time_to_work <int> NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA...
## $ lang <fct> english, english, english, other, other, other, e...
## $ married <fct> no, no, no, no, no, yes, no, no, no, yes, no, no,...
## $ edu <fct> college, hs or lower, hs or lower, hs or lower, h...
## $ disability <fct> no, yes, no, no, yes, yes, no, yes, no, no, no, n...
## $ birth_qrtr <fct> jul thru sep, jan thru mar, oct thru dec, oct thr...
table(acs12$employment)
##
## not in labor force unemployed employed
## 656 106 843
table(acs12$edu)
##
## hs or lower college grad
## 1439 359 144
Describe the data frames that are created by each of the following commands:
labor_force <- acs12 %>% filter(!is.na(employment)) %>%
filter(employment == "employed" | employment == "unemployed")
the resulting data frame has fewer observations; observations for which the variable employment
is missing or has the value not in labor force
are removed
employed <- labor_force %>% filter(employment == "employed")
the resulting data frame has even fewer observations; only observations for which employment
has the value employed
are kept
employed <- employed %>%
mutate(edu2 = recode(edu, "hs or lower" = "hs_or_lower",
"college"="more_than_hs", "grad"="more_than_hs"))
the resulting data frame has a new variable edu2
which is the same as edu
except observations with value college
or grad
are changed to the value more_than_hs
cat_vars <- acs12 %>% select(employment, race, gender, citizen, lang, married, edu,
disability, birth_qrtr)
the resulting data frame has fewer variables, only those “selected”
We’ve used these plot geometries:
geom_bar, geom_boxplot, geom_dotplot, geom_histogram, geom_line, geom_point, geom_vline
Recall this plot vocabulary:
On the next several slides are a number of plots, each constructed from the dataset employed
. For each:
ggplot
geometry is used?Type of plot: side-by-side boxplots
Geometry: geom_boxplot
Purpose: compare the distribution of income
(a numerical variable) for different values of edu
(a categorical variable)
Description: For all values of edu
, the distribution of income
is right-skewed. All of the median, first quartile, and third quartile increase with more education.
Why use a log transformation?
The log transformation spreads out small values in right-skewed distributions and makes large values less extreme.
Type of plot: scatterplot
Geometry: geom_point
Purpose: examine the relationship between log of income
and hrs_work
(two numerical variables)
Description: Ignoring the set of 0 values for income
, there is a moderate, positive, linear relationship between log of income
and hrs_work
.
Type of plot: bar plot
Geometry: geom_bar
Purpose: examine the distribution of a categorical variable, edu
Description: Most observations have high school or lower education. The number with college education is about half as many of the number with high school or lower education and the number with graduate school education is about one-fifth as many of the number with high school or lower education.
Type of plot: histogram
Geometry: geom_histogram
Purpose: examine the distribution of a numerical variable, income
Description: The distribution has a single mode at about 25,000 and is very right-skewed.
What is the difference in these histograms? How do you get the second from the first?
In the count plot, the height of a bar is the number of observations that are in that bin. In the density plot, the area of a bar is the proportion of observations that are in that bin. The height of the bar in the density plot is the height of the bar in the count plot divided by the product of the binwidth and the total number of observations.
We’ve looked at simulations to:
Below is code for three simulations. For each:
If the purpose of the simulation is to …
labor_force %>% group_by(employment) %>% summarise(n_group = n()) %>%
mutate(percent = n_group / sum(n_group))
employed %>% group_by(edu2) %>% summarise(mean(income))
repetitions <- 100
x <- rep(NA, repetitions)
n <- as.numeric(labor_force %>% summarize(n()))
for (i in 1:repetitions)
{
sim <- sample(c("unemployed", "employed"), size=n, prob=c(0.089, 1-0.089), replace=TRUE)
sim_stat <- sum(sim == "unemployed") / n
x[i] <- as.numeric(sim_stat)
}
Purpose: 3a, simulate outomes of a proportion under an assumption.
We are doing this to examine the distribution of possible values of the proportion of people unemployed for a sample this size, assuming that the true proportion of people unemployed is 0.089 (8.9% is the 2011 US value).
The simulation is for an hypothesis test with \(H_0: p=0.089\) and \(H_A: p \ne 0.089\) where p is the proportion of people unemployed in the population.
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
Note: The grid marks are 0.005 apart on the horizontal axis.
The centre of the histogram is 0.089 (the null hypothesis value).
The test statistic is 0.111.
For the P-value, we need the number of observations in the plot that are greater than or equal to 0.111 or less than or equal to 0.067. It appears that there may be one value (in the right tail) that satisfies this (although we can’t say for sure as the value is close to 0.11). Assuming that value is indeed greater than or equal to 0.111, our estimate of the P-value is 1/100 or 0.01.
We conclude that we have strong evidence that the unemployment rate in 2012 is different from the 2011 value of 0.089.
repetitions <- 100
x <- rep(NA, repetitions)
for (i in 1:repetitions)
{
sim <- employed %>% mutate(edu2 = sample(edu2))
sim_stat <- sim %>% group_by(edu2) %>%
summarise(means = mean(income)) %>%
summarise(diff(means))
x[i] <- as.numeric(sim_stat)
}
Purpose: 3b, simulate differences in a statistic between two groups, defined by the value of edu2
. To do this we shuffle the values of edu2
and compare the mean income between the resulting two education groups.
We are doing this to examine the distribution of possible values of the difference in the mean income between the two education groups, assuming that there is no difference.
The simulation is for an hypothesis test with \(H_0: \mu_1=\mu_2\) and \(H_A: \mu_1 \ne \mu_2\) where \(\mu_1\) is the mean income of people with high school or lower education in the population and \(\mu_2\) is the mean income of people with more than high school education in the population.
Note: The grid marks are 2500 apart on the horizontal axis.
The centre of the histogram is 0 (the null hypothesis value for the difference in the means).
The test statistic is \(29963.08 - 65010.21 = -35047.13\) (using statistics given above).
For the P-value, we need the number of observations in the plot that are greater than or equal to 35047.13 or less than or equal to -35047.13. These values are way off the scale of the plot so our estimate of the P-value is 0 (to at least 2 decimal places).
We conclude that we have very strong evidence that the mean income is different for people with high school or less education and people with some college education.
repetitions <- 100
x <- rep(NA, repetitions)
n <- as.numeric(labor_force %>% summarize(n()))
for (i in 1:repetitions)
{
sim <- labor_force %>% sample_n(size = n, replace=TRUE)
x[i] <- as.numeric(sim %>% filter(employment == "unemployed") %>%
summarize(n())) / n
}
Purpose: 2, estimate a sampling distribution of the proportion unemployed from our data, that is a bootstrap sampling distribution.
The simulation is to estimate a confidence interval for the proportion unemployed in the population.
Note: The grid marks are 0.005 apart on the horizontal axis.
The centre of the histogram is 0.111 (the estimated proportion unemployed in the data).
For a 90% confidence interval, disregard 5% of the the observations (that is 5 observations) from the left and right tails of the boostrap distribution. Our estimate of the 90% confidence interval for the proportion of people unemployed in the population is roughly (0.095, 0.13).
Below is a boxplot for Simulation 3, plotting the same data as the histogram.
Can you use it to estimate a confidence interval? For any confidence level?
For a confidence interval, we need percentiles of the bootstrap sampling distribution. The boxplot gives us the 25th percentile (about 0.104) and the 75th percentile (about 0.12). So a 50% confidence interval for the proportion of people unemployed in the population is approximately (0.104, 0.12).