Instructions

What should I bring to tutorial on September 14?

  • R output (e.g., plots) for Question 2. You can either bring a hardcopy or bring your laptop with the output.

Practice Problems

Question 1

The Marriage data is in the mosaic package, which you must first load with the command library(mosaic). You can read more about the data and the variables here: https://rdrr.io/cran/mosaicData/man/Marriage.html.

  1. Choose two categorical variables and plot thier distributions. Interpret the plots.
# Construct your plots in this code chunk
library(mosaic)
library(tidyverse)

ggplot(data = Marriage) + aes(x = officialTitle) + geom_bar() + coord_flip()

ggplot(data = Marriage) + aes(x = sign) + geom_bar() + coord_flip()

The first plot shows that the majority of marriages were performed by a marriage offical, pastor, or minister. The second plot shows that the majority of people that filled out the application are Pisces, and the next most common sign is Virgo and Aries.

  1. Choose a quantitative variables and plot it’s distributions. Interpret the plot.
# Construct your plots in this code chunk
library(mosaic)
library(tidyverse)

ggplot(data = Marriage) + aes(x = age) + geom_histogram(fill = "grey", colour = "black") 

The distribution of applicant age is right skewed since the data trail off to the right. There are two prominent peaks so the distribution is bimodal. The modes appear around 20 and 40. This means that the two largest groups of marriage license applicants are in thier twenties and forties.

  1. Construct a plot that shows the relationship between two variables. What can you say about the relationship?
# Construct your plots in this code chunk
library(mosaic)
library(tidyverse)

ggplot(data = Marriage) + 
  aes(x = age) + 
  geom_histogram(fill = "grey", colour = "black") + 
  facet_wrap(~prevcount)

The distribution of age is shown by the number of previous marriages of the applicant. Applicants with at least two marriages tend to be older compared to applicants with fewer previous marriages.

Question 2

The Gestation data set is also part of the mosaic package. Use the help command ?Gestation for a description of the data.

  1. Create three histograms of the length of gestation using the number of bins defined as 2, 25, and 50. What is the relationship between the number of bins and the width of the bins? Which number of bins do you think is most appropriate to display this distribution? What is the shape of the distribution? Explain.
library(tidyverse)
library(mosaic)

ggplot(data = Gestation) + aes(x = gestation) + 
  geom_histogram(fill = "grey", colour = "black", bins = 2)

ggplot(data = Gestation) + aes(x = gestation) + 
  geom_histogram(fill = "grey", colour = "black", bins = 25)

ggplot(data = Gestation) + aes(x = gestation) + 
  geom_histogram(fill = "grey", colour = "black", bins = 50)

The bins are defined by lower and upper bounds of gestation days; 2 bins results in larger bin widths, while 25 and 50 result in smaller bin widths. When the number of bins is 25 or 50 we can see that the shape of the distribution is symmetric. If the number of bins is set to 2 then it’s difficult to discern the shape of the distribution.

  1. Do high school graduates and college graduates have different gestation distributions? Construct a data vizualization to investigate the answer to this question. Explain why you chose this vizualization.
library(tidyverse)
library(mosaic)

ggplot(data = Gestation) + aes(x = gestation) + 
  geom_histogram(fill = "grey", colour = "black", bins = 25) +
  facet_grid(~ed)

We can use faceting to compare the histograms for high school graduates (2) and college graduates (4). The shape of the histograms are similar: both are symmetric and centred around 280 days.

  1. Create a vizualization to explore the relationship between a babies birth weight and gestation length. Explain why you chose this vizualization. What can you learn from your vizualization?
library(tidyverse)
library(mosaic)

ggplot(data = Gestation) + aes(y = wt, x = gestation) + 
  geom_point()

ggplot(data = Gestation) + aes(y = wt, x = gestation, colour = smoke) + 
  geom_point()

ggplot(data = Gestation) + aes(y = wt, x = gestation) + 
  geom_point() +
  facet_grid(~smoke)

The scatterplot shows the relationship between the birth weight wt and length of gestation. number. The plot shows that as gestation time increases babies birth weight increases.

  1. Modify the vizualization that you created in part (c) to evaluate if the relationship between a babies weight and gestation time is the same for mother’s that never smoked compared to meother’s that smoke now. Explain how you modified your vizualization.
library(tidyverse)
library(mosaic)

ggplot(data = Gestation) + aes(y = wt, x = gestation) + 
  geom_point() +
  facet_grid(~smoke)

A facet plot shows scatterplots for different categories of smokers. Comparing plots with smoke = 0 and smoke = 1 doesn’t reveal a difference in the relationship between baby weight and gestation in mother’s that never smoked compared to mother’s that smoked.

Question 3

For this exercise, you will load data from an external source. You can read about the data here: http://sta220.utstat.utoronto.ca/data/the-skeleton-data/.

The data are in a plain text file with spaces between columns here: http://stats.onlinelearning.utoronto.ca/wp-content/uploaded/Data/SkeletonDatacomplete.txt. The following code will load the data into a tibble (the tidyverse version of a data frame).

  1. Read the data into R using the following code.
library(tidyverse)
data_url <- "http://stats.onlinelearning.utoronto.ca/wp-content/uploaded/Data/SkeletonDatacomplete.txt"
skeleton_data <- read_table(data_url)

Inspect the data to make sure it is read in completely. You can compare by going directly to the data_url.

  1. Construct at least four interesting graphs with the data, including: a graph of one categorical variable, a graph of one quantitative variable, a graph with at least two variables, a graph with at least three variables.

Example graph of one categorical variable:

ggplot(data = skeleton_data) + aes(BMIcat) + geom_bar()

ggplot(data= skeleton_data) + aes(Sex) + geom_bar()

Example graphs of one quantitative variable:

ggplot(data = skeleton_data, aes(BMIquant)) + geom_histogram(bins = 20, colour = "black", fill = "grey")

ggplot(data = skeleton_data, aes(DGerror)) + geom_histogram(bins = 30, colour = "black", fill = "grey")

ggplot(data = skeleton_data, aes(SBerror)) + geom_histogram(bins = 30, colour = "black", fill = "grey")

Example graphs with two variables:

ggplot(data = skeleton_data, aes(Age, DGerror)) + geom_point()

ggplot(data = skeleton_data, aes(Age, SBerror)) + geom_point()

Example graphs with three variables:

ggplot(data = skeleton_data, aes(Age, DGerror)) + geom_point() + facet_wrap(~Sex)

ggplot(data = skeleton_data, aes(Age, SBerror)) + geom_point() + facet_wrap(~Sex)

  1. Describe what you learned about the data from your graphs.

Some observations that can be made from the plots:

  • Most (more than half) of the skeletons in the data were from people of normal weight. Very few were obese.
  • There are more than twice as many skeletons in the data who are male than female.
  • The distribution of the quantitative measurement of BMI is symmetric and unimodal. The mode is between 21 and 25 kg/m\(^2\).
  • There is a strong, negative, linear relationship between age and error for both methods. This relationship is true for both sexes and is similar for both sexes. It is also true for all BMI categories.
  • The distribution of error using the Di Gangi method is slightly left-skewed. The distribution of error using the Suchey-Brooks method is closer to symmetric.