The Marriage
data is in the mosaic
package, which you must first load with the command library(mosaic)
. You can read more about the data and the variables here: https://rdrr.io/cran/mosaicData/man/Marriage.html.
# Construct your plots in this code chunk
library(mosaic)
library(tidyverse)
ggplot(data = Marriage) + aes(x = officialTitle) + geom_bar() + coord_flip()
ggplot(data = Marriage) + aes(x = sign) + geom_bar() + coord_flip()
The first plot shows that the majority of marriages were performed by a marriage offical, pastor, or minister. The second plot shows that the majority of people that filled out the application are Pisces, and the next most common sign is Virgo and Aries.
# Construct your plots in this code chunk
library(mosaic)
library(tidyverse)
ggplot(data = Marriage) + aes(x = age) + geom_histogram(fill = "grey", colour = "black")
The distribution of applicant age is right skewed since the data trail off to the right. There are two prominent peaks so the distribution is bimodal. The modes appear around 20 and 40. This means that the two largest groups of marriage license applicants are in thier twenties and forties.
# Construct your plots in this code chunk
library(mosaic)
library(tidyverse)
ggplot(data = Marriage) +
aes(x = age) +
geom_histogram(fill = "grey", colour = "black") +
facet_wrap(~prevcount)
The distribution of age is shown by the number of previous marriages of the applicant. Applicants with at least two marriages tend to be older compared to applicants with fewer previous marriages.
The Gestation
data set is also part of the mosaic
package. Use the help command ?Gestation
for a description of the data.
library(tidyverse)
library(mosaic)
ggplot(data = Gestation) + aes(x = gestation) +
geom_histogram(fill = "grey", colour = "black", bins = 2)
ggplot(data = Gestation) + aes(x = gestation) +
geom_histogram(fill = "grey", colour = "black", bins = 25)
ggplot(data = Gestation) + aes(x = gestation) +
geom_histogram(fill = "grey", colour = "black", bins = 50)
The bins are defined by lower and upper bounds of gestation days; 2 bins results in larger bin widths, while 25 and 50 result in smaller bin widths. When the number of bins is 25 or 50 we can see that the shape of the distribution is symmetric. If the number of bins is set to 2 then it’s difficult to discern the shape of the distribution.
library(tidyverse)
library(mosaic)
ggplot(data = Gestation) + aes(x = gestation) +
geom_histogram(fill = "grey", colour = "black", bins = 25) +
facet_grid(~ed)
We can use faceting to compare the histograms for high school graduates (2) and college graduates (4). The shape of the histograms are similar: both are symmetric and centred around 280 days.
library(tidyverse)
library(mosaic)
ggplot(data = Gestation) + aes(y = wt, x = gestation) +
geom_point()
ggplot(data = Gestation) + aes(y = wt, x = gestation, colour = smoke) +
geom_point()
ggplot(data = Gestation) + aes(y = wt, x = gestation) +
geom_point() +
facet_grid(~smoke)
The scatterplot shows the relationship between the birth weight wt
and length of gestation. number
. The plot shows that as gestation time increases babies birth weight increases.
library(tidyverse)
library(mosaic)
ggplot(data = Gestation) + aes(y = wt, x = gestation) +
geom_point() +
facet_grid(~smoke)
A facet plot shows scatterplots for different categories of smokers. Comparing plots with smoke = 0
and smoke = 1
doesn’t reveal a difference in the relationship between baby weight and gestation in mother’s that never smoked compared to mother’s that smoked.
For this exercise, you will load data from an external source. You can read about the data here: http://sta220.utstat.utoronto.ca/data/the-skeleton-data/.
The data are in a plain text file with spaces between columns here: http://stats.onlinelearning.utoronto.ca/wp-content/uploaded/Data/SkeletonDatacomplete.txt. The following code will load the data into a tibble (the tidyverse version of a data frame).
R
using the following code.library(tidyverse)
data_url <- "http://stats.onlinelearning.utoronto.ca/wp-content/uploaded/Data/SkeletonDatacomplete.txt"
skeleton_data <- read_table(data_url)
Inspect the data to make sure it is read in completely. You can compare by going directly to the data_url
.
Example graph of one categorical variable:
ggplot(data = skeleton_data) + aes(BMIcat) + geom_bar()
ggplot(data= skeleton_data) + aes(Sex) + geom_bar()
Example graphs of one quantitative variable:
ggplot(data = skeleton_data, aes(BMIquant)) + geom_histogram(bins = 20, colour = "black", fill = "grey")
ggplot(data = skeleton_data, aes(DGerror)) + geom_histogram(bins = 30, colour = "black", fill = "grey")
ggplot(data = skeleton_data, aes(SBerror)) + geom_histogram(bins = 30, colour = "black", fill = "grey")
Example graphs with two variables:
ggplot(data = skeleton_data, aes(Age, DGerror)) + geom_point()
ggplot(data = skeleton_data, aes(Age, SBerror)) + geom_point()
Example graphs with three variables:
ggplot(data = skeleton_data, aes(Age, DGerror)) + geom_point() + facet_wrap(~Sex)
ggplot(data = skeleton_data, aes(Age, SBerror)) + geom_point() + facet_wrap(~Sex)
Some observations that can be made from the plots: