Your answer to questions 1c, 2e, and 3d
file_url <- "https://raw.githubusercontent.com/ntaback/UofT_STA130/master/week9/Week9PracticeProblems-student.Rmd"
download.file(url = file_url , destfile = "Week9PracticeProblems-student.Rmd")
Look for the file “Week8PracticeProblems-student.Rmd” under the Files tab then click on it to open.
Change the subtitle to “Week 9 Practice Problems Solutions” and change the author to your name and student number.
Type your answers below each question. Remember that R code chunks can be inserted directly into the notebook by choosing Insert R from the Insert menu (see Using R Markdown for Class Assignments). In addition this R Markdown cheatsheet, and reference are great resources as you get started with R Markdown.
Tutorial grades will be assigned according to the following marking scheme.
Mark | |
---|---|
Attendance for the entire tutorial | 1 |
Assigned homework completiona | 1 |
In-class exercises | 4 |
Total | 6 |
In this question you will derive the least squares estimators of the intercept \(\beta_0\) and slope \(\beta_1\) in the simple linear regression model.
Write down the simple linear regression model. Explain each term in the model.
Calculate \(\frac{\partial L}{\partial \beta_0}\) and \(\frac{\partial L}{\partial \beta_1}\), where \(L(\beta_0,\beta_1)\) is the sum of squared errors.
Use your calculations in (b) to show that
\[ \begin{aligned} \hat{\beta_0} &= \bar{y} - \hat{\beta_1} \bar{x}, \\ \hat{\beta_1} &= \frac{(\sum_{i=1}^n y_ix_i) - n \bar{x}\bar{y}}{(\sum_{i=1}^n x_i^2) - n\bar{x}^2}. \end{aligned} \]
The internet movie database, http://imdb.com/, is a website devoted to collecting movie data supplied by studios and fans. It claims to be the biggest movie database on the web and is run by amazon. More about information imdb.com can be found online, http://imdb.com/help/show_leaf?about, including information about the data collection process, http://imdb.com/help/show_leaf?infosource. The is packaged in the R library ggplot2movies
in the data frame movies
. The help documentation for this data (?ggplot2movies::movies
) describes each of the variables.
In this question you will use linear regression to build a model that predicts IMDB user ratings based on covariates in the movies
data. More information about IMDB user ratings is available here.
library(ggplot2movies)
library(tidyverse)
glimpse(movies)
## Observations: 58,788
## Variables: 24
## $ title <chr> "$", "$1000 a Touchdown", "$21 a Day Once a Month"...
## $ year <int> 1971, 1939, 1941, 1996, 1975, 2000, 2002, 2002, 19...
## $ length <int> 121, 71, 7, 70, 71, 91, 93, 25, 97, 61, 99, 96, 10...
## $ budget <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ rating <dbl> 6.4, 6.0, 8.2, 8.2, 3.4, 4.3, 5.3, 6.7, 6.6, 6.0, ...
## $ votes <int> 348, 20, 5, 6, 17, 45, 200, 24, 18, 51, 23, 53, 44...
## $ r1 <dbl> 4.5, 0.0, 0.0, 14.5, 24.5, 4.5, 4.5, 4.5, 4.5, 4.5...
## $ r2 <dbl> 4.5, 14.5, 0.0, 0.0, 4.5, 4.5, 0.0, 4.5, 4.5, 0.0,...
## $ r3 <dbl> 4.5, 4.5, 0.0, 0.0, 0.0, 4.5, 4.5, 4.5, 4.5, 4.5, ...
## $ r4 <dbl> 4.5, 24.5, 0.0, 0.0, 14.5, 14.5, 4.5, 4.5, 0.0, 4....
## $ r5 <dbl> 14.5, 14.5, 0.0, 0.0, 14.5, 14.5, 24.5, 4.5, 0.0, ...
## $ r6 <dbl> 24.5, 14.5, 24.5, 0.0, 4.5, 14.5, 24.5, 14.5, 0.0,...
## $ r7 <dbl> 24.5, 14.5, 0.0, 0.0, 0.0, 4.5, 14.5, 14.5, 34.5, ...
## $ r8 <dbl> 14.5, 4.5, 44.5, 0.0, 0.0, 4.5, 4.5, 14.5, 14.5, 4...
## $ r9 <dbl> 4.5, 4.5, 24.5, 34.5, 0.0, 14.5, 4.5, 4.5, 4.5, 4....
## $ r10 <dbl> 4.5, 14.5, 24.5, 45.5, 24.5, 14.5, 14.5, 14.5, 24....
## $ mpaa <chr> "", "", "", "", "", "", "R", "", "", "", "", "", "...
## $ Action <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,...
## $ Animation <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Comedy <int> 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0,...
## $ Drama <int> 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ Documentary <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Romance <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Short <int> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,...
Create at least one appropriate graphical summary of the distribution of user ratings. Briefly explain why you chose this summary and the shape of the distribution. How many user ratings are in the data set? Calculate the mean and median ratings.
Repeat part (a), but use the variable budget
. Create a scatterplot of budget
and rating
of release. What is the relationship between budget and rating
?
Use R to fit a linear regression model of rating
on budget
. Interpret the regression coefficients and the coefficient of determination.
Add the estimated linear regression line that you calculated in (c) to the scatterplot you generated in (b). Does the linear regression line capture the relationship between rating
and budget
?
Does a simple linear regression model seem appropriate to predict IMDB ratings using a film’s budget? Explain.
The Housing data for 506 census tracts of Boston from the 1970 census. The dataframe BostonHousing2
contains the original corrected data by Harrison and Rubinfeld (1979).
In this question you will build a linear regression model to predict the median value of owner occupied homes in USD 1000’s medv
.
library(mlbench)
data("BostonHousing2")
glimpse(BostonHousing2)
## Observations: 506
## Variables: 19
## $ town <fct> Nahant, Swampscott, Swampscott, Marblehead, Marblehead...
## $ tract <int> 2011, 2021, 2022, 2031, 2032, 2033, 2041, 2042, 2043, ...
## $ lon <dbl> -70.9550, -70.9500, -70.9360, -70.9280, -70.9220, -70....
## $ lat <dbl> 42.2550, 42.2875, 42.2830, 42.2930, 42.2980, 42.3040, ...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
## $ cmedv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 22.1, 16.5, ...
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <int> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ b <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
Create a scatterplot of median value of homes medv
and percentage of lower status of the population that lives in the census tract lstat
. Describe the relationship.
Write out the mathematical description of a simple linear regression model of medv
on lstat
. Describe each of the variables in the model.
Use 80% of the data to select a training set, and leave 20% of the data for testing. Fit a linear regression model of medv
on lstat
on the training set.
Calculate RMSE on the training and test data using the linear regression model you fit on the training set. Is there any evidence of overfitting?
Calculate the coefficient of determination.
Use the training and test sets to build a multiple regression model to predict medev
where the following covariates (inputs) or used in addition to lstat
: crim
, zn
, rm
, nox
, age
, tax
, ptratio
. Does this model provide more accurate predictions compared to the model that you fit in part (c)? Explain.