Exercise 3.1 in the textbook uses data that come with R
. The dataset is in the mosaic
package, which you must first load with the command library(mosaic)
. The name of the dataframe is Galton
.
Construct the plots that you are asked to construct in Exercise 3.1.
Name three additional plots that would be interesting to examine.
Load libraries and look at data:
# load libraries
library(tidyverse)
library(mosaic)
glimpse(Galton)
## Observations: 898
## Variables: 6
## $ family <fctr> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, ...
## $ father <dbl> 78.5, 78.5, 78.5, 78.5, 75.5, 75.5, 75.5, 75.5, 75.0, 7...
## $ mother <dbl> 67.0, 67.0, 67.0, 67.0, 66.5, 66.5, 66.5, 66.5, 64.0, 6...
## $ sex <fctr> M, F, F, F, M, M, F, F, M, F, M, M, F, F, F, M, M, M, ...
## $ height <dbl> 73.2, 69.2, 69.0, 69.0, 73.5, 72.5, 65.5, 65.5, 71.0, 6...
## $ nkids <int> 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 6, 6, 6, 6...
Part 1: scatterplot pf each person’s height
Galton %>% ggplot(aes(height,father)) + geom_point()
Parts 2 and 3: separate by facets and add regression lines to each facet
Galton %>% ggplot(aes(height,father)) + geom_point() + facet_wrap(~sex) + geom_smooth(method="lm")
Some other plots that might be interesting:
Galton %>% ggplot(aes(nkids)) + geom_bar()
Galton %>% ggplot(aes(sex, height)) + geom_boxplot()
Galton %>% ggplot(aes(height, mother)) + geom_point()
Galton %>% ggplot(aes(height, mother)) + geom_point() + facet_wrap(~sex)
Galton %>% ggplot(aes(height, mother, color=sex)) + geom_point() + facet_wrap(~nkids)
Bring your output for this question to tutorial on Friday January 12 (either a hardcopy or on your laptop). For this question, we will use the data in Exercise 3.4 in the texbook. You can read more about the data and the variables here: https://rdrr.io/cran/mosaicData/man/Marriage.html.
Look at the data:
glimpse(Marriage)
## Observations: 98
## Variables: 15
## $ bookpageID <fctr> B230p539, B230p677, B230p766, B230p892, B230p99...
## $ appdate <fctr> 10/29/96, 11/12/96, 11/19/96, 12/2/96, 12/9/96,...
## $ ceremonydate <fctr> 11/9/96, 11/12/96, 11/27/96, 12/7/96, 12/14/96,...
## $ delay <int> 11, 0, 8, 5, 5, 0, 16, 0, 28, 10, 8, 0, 4, 4, 0,...
## $ officialTitle <fctr> CIRCUIT JUDGE , MARRIAGE OFFICIAL, MARRIAGE OFF...
## $ person <fctr> Groom, Groom, Groom, Groom, Groom, Groom, Groom...
## $ dob <fctr> 4/11/64, 8/6/64, 2/20/62, 5/20/56, 12/14/66, 2/...
## $ age <dbl> 32.60274, 32.29041, 34.79178, 40.57808, 30.02192...
## $ race <fctr> White, White, Hispanic, Black, White, White, Wh...
## $ prevcount <int> 0, 1, 1, 1, 0, 1, 1, 1, 0, 3, 1, 1, 0, 0, 1, 0, ...
## $ prevconc <fctr> NA, Divorce, Divorce, Divorce, NA, NA, Divorce,...
## $ hs <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, ...
## $ college <int> 7, 0, 3, 4, 0, 0, 0, 0, 0, 6, 2, 1, 1, 0, 0, 4, ...
## $ dayOfBirth <dbl> 102.00, 219.00, 51.50, 141.00, 348.50, 52.50, 28...
## $ sign <fctr> Aries, Leo, Pisces, Gemini, Saggitarius, Pisces...
Make barplots of any of the categorical variables, such as the following:
Marriage %>% ggplot(aes(officialTitle)) + geom_bar() + coord_flip()
Marriage %>% ggplot(aes(race)) + geom_bar()
Marriage %>% ggplot(aes(prevcount)) + geom_bar() # not categorical but only a small number of values
Marriage %>% ggplot(aes(prevconc)) + geom_bar() # not categorical but only a small number of values
Marriage %>% ggplot(aes(sign)) + geom_bar() # not categorical but only a small number of values
A histogram or boxplot is appropriate here. Some examples:
Marriage %>% ggplot(aes(age)) + geom_histogram(binwidth=5)
Marriage %>% ggplot(aes(x="age", y=age)) + geom_boxplot()
Marriage %>% ggplot(aes(college)) + geom_histogram()
Marriage %>% ggplot(aes(x="college", y=college)) + geom_boxplot()
Marriage %>% ggplot(aes(dayOfBirth)) + geom_histogram(binwidth=30)
Marriage %>% ggplot(aes(x="dayOfBirth", y=dayOfBirth)) + geom_boxplot()
These plots could be side-by-side boxplots or scatterplots. Here are some examples and observations about the relationships from the plots:
Marriage %>% ggplot(aes(race, age)) + geom_boxplot()
From the plot above, we see that thehe distribution of age is approximately the same for all races. Note that there are very few observations for American Indian and Hispanic people.
Marriage %>% ggplot(aes(person, age)) + geom_boxplot()
The location of the distribution of ages for grooms is higher than for brides, indicating grooms tend to be older when they marry.
Marriage %>% ggplot(aes(prevcount, age)) + geom_point()
There is a weak relationship positive relationship between age and the number of previous marriages. The age of marriage increases, on average, with the number of previous marriages.
Marriage %>% ggplot(aes(college, age)) + geom_point()
There is no relationship between age at marriage and the number of years of college education.
Marriage %>% ggplot(aes(dayOfBirth, age)) + geom_point()
There is no systematic relationship between age at marriage and day of birth.
Some example plots with 3 variables:
Marriage %>% ggplot(aes(prevcount, age, color=person)) + geom_point()
Marriage %>% ggplot(aes(prevcount, age)) + geom_point() + facet_wrap(~person)
Marriage %>% ggplot(aes(college, age)) + geom_point() + facet_wrap(~person)
An example plot with 4 variables:
Marriage %>% ggplot(aes(college, age, color=person)) + geom_point() + facet_wrap(~sign)
For this exercise, you will load data from an external source. You can read about the data here: http://sta220.utstat.utoronto.ca/data/the-skeleton-data/.
The data are in a plain text file with spaces between columns here: http://stats.onlinelearning.utoronto.ca/wp-content/uploaded/Data/SkeletonDatacomplete.txt. The following code will load the data into a tibble (the tidyverse version of a data frame).
R
using the following code.library(tidyverse)
data_url <- "http://stats.onlinelearning.utoronto.ca/wp-content/uploaded/Data/SkeletonDatacomplete.txt"
skeleton_data <- read_table(data_url)
Inspect the data to make sure it is read in completely. You can compare by going directly to the data_url
.
# View(skeleton_data) # useful when you're working but not for the knit document
head(skeleton_data) # see the first few rows
## # A tibble: 6 x 8
## Sex BMIcat BMIquant Age DGestimate DGerror SBestimate SBerror
## <int> <chr> <dbl> <int> <int> <int> <int> <int>
## 1 2 underweight 15.66 78 44 -34 60 -18
## 2 1 normal 23.03 44 32 -12 35 -9
## 3 1 overweight 27.92 72 32 -40 61 -11
## 4 1 overweight 27.83 59 44 -15 61 2
## 5 1 normal 21.41 60 32 -28 46 -14
## 6 1 underweight 13.65 34 25 -9 35 1
glimpse(skeleton_data) # see its structure
## Observations: 400
## Variables: 8
## $ Sex <int> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, ...
## $ BMIcat <chr> "underweight", "normal", "overweight", "overweight"...
## $ BMIquant <dbl> 15.66, 23.03, 27.92, 27.83, 21.41, 13.65, 25.86, 14...
## $ Age <int> 78, 44, 72, 59, 60, 34, 50, 73, 70, 60, 58, 61, 52,...
## $ DGestimate <int> 44, 32, 32, 44, 32, 25, 32, 50, 39, 44, 32, 32, 44,...
## $ DGerror <int> -34, -12, -40, -15, -28, -9, -18, -23, -31, -16, -2...
## $ SBestimate <int> 60, 35, 61, 61, 46, 35, 35, 61, 46, 46, 35, 61, 48,...
## $ SBerror <int> -18, -9, -11, 2, -14, 1, -15, -12, -24, -14, -23, 0...
Example graphs of one categorical variable:
skeleton_data %>% ggplot(aes(BMIcat)) + geom_bar()
skeleton_data %>% ggplot(aes(Sex)) + geom_bar()
Example graphs of one quantitative variable:
skeleton_data %>% ggplot(aes(BMIquant)) + geom_histogram(binwidth=2)
skeleton_data %>% ggplot(aes(Age)) + geom_histogram(binwidth=5)
skeleton_data %>% ggplot(aes(DGerror)) + geom_histogram(binwidth=5)
skeleton_data %>% ggplot(aes(SBerror)) + geom_histogram(binwidth=5)
skeleton_data %>% ggplot(aes("BMIquant", BMIquant)) + geom_boxplot()
skeleton_data %>% ggplot(aes("Age", Age)) + geom_boxplot()
skeleton_data %>% ggplot(aes("DGerror", DGerror)) + geom_boxplot()
skeleton_data %>% ggplot(aes("SBerror", SBerror)) + geom_boxplot()
Example graphs with two variables:
skeleton_data %>% ggplot(aes(factor(Sex), DGerror)) + geom_boxplot()
skeleton_data %>% ggplot(aes(BMIcat, DGerror)) + geom_boxplot()
skeleton_data %>% ggplot(aes(factor(Sex), SBerror)) + geom_boxplot()
skeleton_data %>% ggplot(aes(BMIcat, SBerror)) + geom_boxplot()
skeleton_data %>% ggplot(aes(Age, DGerror)) + geom_point()
skeleton_data %>% ggplot(aes(Age, SBerror)) + geom_point()
skeleton_data %>% ggplot(aes(BMIquant, DGerror)) + geom_point() + geom_smooth()
skeleton_data %>% ggplot(aes(BMIquant, SBerror)) + geom_point() + geom_smooth()
skeleton_data %>% ggplot(aes(DGerror, SBerror)) + geom_point()
Example graphs with three variables:
skeleton_data %>% ggplot(aes(Age, DGerror)) + geom_point() + facet_wrap(~Sex)
skeleton_data %>% ggplot(aes(Age, SBerror)) + geom_point() + facet_wrap(~Sex)
skeleton_data %>% ggplot(aes(Age, DGerror)) + geom_point() + facet_wrap(~BMIcat)
skeleton_data %>% ggplot(aes(Age, SBerror)) + geom_point() + facet_wrap(~BMIcat)
Some observations that can be made from the plots: