Your answer to questions 1c, 2f, and 3d
file_url <- "https://raw.githubusercontent.com/ntaback/UofT_STA130/master/week8/Week8PracticeProblems-student1.Rmd"
download.file(url = file_url , destfile = "Week8PracticeProblems-student.Rmd")
Look for the file “Week8PracticeProblems-student.Rmd” under the Files tab then click on it to open.
Change the subtitle to “Week 8 Practice Problems Solutions” and change the author to your name and student number.
Type your answers below each question. Remember that R code chunks can be inserted directly into the notebook by choosing Insert R from the Insert menu (see Using R Markdown for Class Assignments). In addition this R Markdown cheatsheet, and reference are great resources as you get started with R Markdown.
Tutorial grades will be assigned according to the following marking scheme.
Mark | |
---|---|
Attendance for the entire tutorial | 1 |
Assigned homework completiona | 1 |
In-class exercises | 4 |
Total | 6 |
In this question you will derive the least squares estimators of the intercept \(\beta_0\) and slope \(\beta_1\) in the simple linear regression model.
A simple linear regression model is:
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i,\]
\(i=1,\ldots,n\) where \(n\) is the number of observations.
\(y_i\) is the \(i^{th}\) movie’s average IMDB rating. \(\beta_0\) is the intercept term in the model. \(\beta_1\) is the slope term in the model. \(x_i\) is the \(i^{th}\) movie’s budget. \(\epsilon_i\) is the \(i^{th}\) error term.
\[L(\beta_0,\beta_1) = \sum_{i=1}^n \left(y_i -\beta_0 - \beta_1 x_i\right)^2.\] So,
\[ \begin{aligned} \frac{\partial L}{\partial \beta_0} &= -2 \sum_{i=1}^n (y_i -\beta_0 - \beta_1 x_i) = 0, \\ \frac{\partial L}{\partial \beta_0} &= -2 \sum_{i=1}^n (y_i -\beta_0 - \beta_1 x_i)x_{i} =0. \end{aligned} \]
\[ \begin{aligned} \hat{\beta_0} &= \bar{y} - \hat{\beta_1} \bar{x} \\ \hat{\beta_1} &= \frac{(\sum_{i=1}^n y_ix_i) - n \bar{x}\bar{y}}{(\sum_{i=1}^n x_i^2) - n\bar{x}^2}, \end{aligned} \]
In the equations \(\frac{\partial L}{\partial \beta_0}=0, \frac{\partial L}{\partial \beta_1}=0\) replace \(\beta_0,\beta_1\) by \(\hat \beta_0,\hat \beta_1\). This symbollically distinguishes that we are solving for the values of \(\beta_0,\beta_1\) that are solutions to the equations. We will make use the of the fact that \(\sum_{i=1}^n x_i/n = \bar x \Rightarrow \sum_{i=1}^n x_i = n\bar x.\)
\[\begin{aligned} \frac{\partial L}{\partial \beta_0}=0 & \Rightarrow \sum_{i=1}^n y_i -n\hat \beta_0 - \sum_{i=1}^n \hat \beta_1 x_i = 0 \\ & \Rightarrow \sum_{i=1}^n y_i - \sum_{i=1}^n \hat \beta_1 x_i = n\hat \beta_0 \\ & \Rightarrow n \bar y - \hat \beta_1 n \bar x = -n\hat \beta_0 \\ & \Rightarrow \bar y - \hat \beta_1 \bar x = \hat \beta_0. \end{aligned}\]
\[\begin{aligned} \frac{\partial L}{\partial \beta_1}=0 & \Rightarrow \sum_{i=1}^n y_i x_i - \hat \beta_0 \sum_{i=1}^n x_i - \hat \beta_1 \sum_{i=1}^nx_i^2 =0 \\ & \Rightarrow \sum_{i=1}^n y_i x_i -(\bar y - \hat \beta_1 \bar x) \sum_{i=1}^n x_i - \hat \beta_1 \sum_{i=1}^nx_i^2 =0 \\ & \Rightarrow \sum_{i=1}^n y_i x_i -\bar y \sum_{i=1}^n x_i + \hat \beta_1 \bar x \sum_{i=1}^n x_i - \hat \beta_1 \sum_{i=1}^nx_i^2 =0 \\ & \Rightarrow \sum_{i=1}^n y_i x_i -n\bar y \bar x + \hat \beta_1 n {\bar x}^2 - \hat \beta_1 \sum_{i=1}^nx_i^2 =0, \mbox{ } \text{since }\sum_{i=1}^n x_i = n\bar x \\ & \Rightarrow \sum_{i=1}^n y_i x_i -n\bar y \bar x + \hat \beta_1 (n {\bar x}^2 - \sum_{i=1}^nx_i^2) =0 \\ & \Rightarrow \sum_{i=1}^n y_i x_i -n\bar y \bar x = \hat \beta_1 ( \sum_{i=1}^nx_i^2 - n {\bar x}^2 ) \\ & \Rightarrow \frac{\sum_{i=1}^n y_i x_i -n\bar y \bar x}{( \sum_{i=1}^nx_i^2 - n {\bar x}^2 )} = \hat \beta_1. \end{aligned}\]
The internet movie database, http://imdb.com/, is a website devoted to collecting movie data supplied by studios and fans. It claims to be the biggest movie database on the web and is run by amazon. More about information imdb.com can be found online, http://imdb.com/help/show_leaf?about, including information about the data collection process, http://imdb.com/help/show_leaf?infosource. The is packaged in the R library ggplot2movies
in the data frame movies
. The help documentation for this data (?ggplot2movies::movies
) describes each of the variables.
In this question you will use linear regression to build a model that predicts IMDB user ratings based on covariates in the movies
data. More information about IMDB user ratings is available here.
library(ggplot2movies)
library(tidyverse)
glimpse(movies)
## Observations: 58,788
## Variables: 24
## $ title <chr> "$", "$1000 a Touchdown", "$21 a Day Once a Month"...
## $ year <int> 1971, 1939, 1941, 1996, 1975, 2000, 2002, 2002, 19...
## $ length <int> 121, 71, 7, 70, 71, 91, 93, 25, 97, 61, 99, 96, 10...
## $ budget <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ rating <dbl> 6.4, 6.0, 8.2, 8.2, 3.4, 4.3, 5.3, 6.7, 6.6, 6.0, ...
## $ votes <int> 348, 20, 5, 6, 17, 45, 200, 24, 18, 51, 23, 53, 44...
## $ r1 <dbl> 4.5, 0.0, 0.0, 14.5, 24.5, 4.5, 4.5, 4.5, 4.5, 4.5...
## $ r2 <dbl> 4.5, 14.5, 0.0, 0.0, 4.5, 4.5, 0.0, 4.5, 4.5, 0.0,...
## $ r3 <dbl> 4.5, 4.5, 0.0, 0.0, 0.0, 4.5, 4.5, 4.5, 4.5, 4.5, ...
## $ r4 <dbl> 4.5, 24.5, 0.0, 0.0, 14.5, 14.5, 4.5, 4.5, 0.0, 4....
## $ r5 <dbl> 14.5, 14.5, 0.0, 0.0, 14.5, 14.5, 24.5, 4.5, 0.0, ...
## $ r6 <dbl> 24.5, 14.5, 24.5, 0.0, 4.5, 14.5, 24.5, 14.5, 0.0,...
## $ r7 <dbl> 24.5, 14.5, 0.0, 0.0, 0.0, 4.5, 14.5, 14.5, 34.5, ...
## $ r8 <dbl> 14.5, 4.5, 44.5, 0.0, 0.0, 4.5, 4.5, 14.5, 14.5, 4...
## $ r9 <dbl> 4.5, 4.5, 24.5, 34.5, 0.0, 14.5, 4.5, 4.5, 4.5, 4....
## $ r10 <dbl> 4.5, 14.5, 24.5, 45.5, 24.5, 14.5, 14.5, 14.5, 24....
## $ mpaa <chr> "", "", "", "", "", "", "R", "", "", "", "", "", "...
## $ Action <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,...
## $ Animation <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Comedy <int> 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0,...
## $ Drama <int> 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ Documentary <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Romance <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Short <int> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,...
movies %>% ggplot(aes(rating)) + geom_histogram(colour = "grey") + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
movies %>% ggplot(aes(budget)) + geom_histogram(colour = "grey") + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 53573 rows containing non-finite values (stat_bin).
movies %>% filter(!is.na(budget))
## # A tibble: 5,215 x 24
## title year length budget rating votes r1
## <chr> <int> <int> <int> <dbl> <int> <dbl>
## 1 'G' Men 1935 85 450000 7.2 281 0.0
## 2 'Manos' the Hands of Fate 1966 74 19000 1.6 7996 74.5
## 3 'Til There Was You 1997 113 23000000 4.8 799 4.5
## 4 .com for Murder 2002 96 5000000 3.7 271 64.5
## 5 10 Things I Hate About You 1999 97 16000000 6.7 19095 4.5
## 6 100 Mile Rule 2002 98 1100000 5.6 181 4.5
## 7 100 Proof 1997 94 140000 3.3 19 14.5
## 8 101 1989 117 200000 7.8 299 4.5
## 9 101-vy kilometer 2001 103 200000 5.8 7 0.0
## 10 102 Dalmatians 2000 100 85000000 4.7 1987 4.5
## # ... with 5,205 more rows, and 17 more variables: r2 <dbl>, r3 <dbl>,
## # r4 <dbl>, r5 <dbl>, r6 <dbl>, r7 <dbl>, r8 <dbl>, r9 <dbl>, r10 <dbl>,
## # mpaa <chr>, Action <int>, Animation <int>, Comedy <int>, Drama <int>,
## # Documentary <int>, Romance <int>, Short <int>
mod1 <- lm(rating ~ budget + votes + length + year + Action + Animation + Comedy + Drama + Documentary + Romance + Short, data = movies)
summary(mod1)
##
## Call:
## lm(formula = rating ~ budget + votes + length + year + Action +
## Animation + Comedy + Drama + Documentary + Romance + Short,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6618 -0.7247 0.1209 0.8321 4.7677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.089e+01 1.819e+00 11.485 < 2e-16 ***
## budget -7.348e-09 9.989e-10 -7.356 2.19e-13 ***
## votes 4.135e-05 1.803e-06 22.941 < 2e-16 ***
## length 1.163e-02 8.779e-04 13.253 < 2e-16 ***
## year -8.367e-03 9.170e-04 -9.124 < 2e-16 ***
## Action -2.380e-01 5.423e-02 -4.388 1.17e-05 ***
## Animation 8.759e-01 1.226e-01 7.147 1.01e-12 ***
## Comedy 2.427e-01 4.212e-02 5.762 8.80e-09 ***
## Drama 6.693e-01 4.076e-02 16.419 < 2e-16 ***
## Documentary 1.292e+00 1.219e-01 10.598 < 2e-16 ***
## Romance 1.265e-01 5.396e-02 2.344 0.0191 *
## Short 2.561e+00 9.545e-02 26.831 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.321 on 5203 degrees of freedom
## (53573 observations deleted due to missingness)
## Multiple R-squared: 0.2725, Adjusted R-squared: 0.2709
## F-statistic: 177.2 on 11 and 5203 DF, p-value: < 2.2e-16
A histogram or boxplot is appropriate since rating
is a continuous variable.
library(ggplot2movies)
library(tidyverse)
movies %>% ggplot(aes(rating)) + geom_histogram(colour = "grey") + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
movies %>% ggplot(aes(x = "", y = rating)) + geom_boxplot()
movies %>% summarize(n = n(), mean_rating = mean(rating), median_rating = median(rating))
## # A tibble: 1 x 3
## n mean_rating median_rating
## <int> <dbl> <dbl>
## 1 58788 5.93285 6.1
budget
. Create a scatterplot of budget
and rating
of release. What is the relationship between budget and rating
?The relationship is non-linear. The scatterplot shows a lot of variation in ratings for large and small budgets.
library(ggplot2movies)
library(tidyverse)
movies %>% ggplot(aes(budget)) + geom_histogram(colour = "grey") + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 53573 rows containing non-finite values (stat_bin).
movies %>% ggplot(aes(x = "", y = budget)) + geom_boxplot()
## Warning: Removed 53573 rows containing non-finite values (stat_boxplot).
movies %>% filter(!is.na(budget)) %>% summarize(n = n(), mean_rating = mean(budget), median_budget = median(budget))
## # A tibble: 1 x 3
## n mean_rating median_budget
## <int> <dbl> <int>
## 1 5215 13412513 3000000
movies %>% ggplot(aes(budget, rating)) + geom_point() + theme_minimal()
## Warning: Removed 53573 rows containing missing values (geom_point).
rating
on budget
. Interpret the regression coefficients and the coefficient of determination.library(ggplot2movies)
library(tidyverse)
mod1 <- lm(rating ~ budget, data = movies)
mod1_summ <- summary(mod1)
mod1_summ$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.153335e+00 2.470552e-02 249.067167 0.0000000
## budget -9.427219e-10 9.175291e-10 -1.027457 0.3042529
mod1_summ$r.squared
## [1] 0.0002024659
The estimate of the intercept of \(\hat \beta_0 =\) 6.15. This means that when the budget is 0 the average user rating is 6.15. The intercept in this case is not very informative since the interpretation does not make sense.
The estimate of the slope is \(\hat \beta_1=\) 0. This means that for a one dollar increase in budget the average user rating changes by zero. This is not surprising since the scatterplot in part (b) doesn’t show a linear relationship between rating
and budget
.
The coefficient of determination, \(R^2=\) 0. This indicates a very poor match between the observed ratings and ratings predicted using the linear regression model.
rating
and budget
?movies %>% ggplot(aes(budget, rating)) + geom_point() + geom_smooth(method = "lm", se = F) + theme_minimal()
## Warning: Removed 53573 rows containing non-finite values (stat_smooth).
## Warning: Removed 53573 rows containing missing values (geom_point).
The line is horizontal and doesn’t capture the relationship.
A linear regression model is inappropriate to predict IMDB ratings using a film’s budget. The scatterplot does not show a linear relationship between the two variables, and the estimated regression coefficients and coefficient of determination reflect the lack of linearity.
The Housing data for 506 census tracts of Boston from the 1970 census. The dataframe BostonHousing2
contains the original corrected data by Harrison and Rubinfeld (1979).
In this question you will build a linear regression model to predict the median value of owner occupied homes in USD 1000’s medv
.
library(mlbench)
data("BostonHousing2")
glimpse(BostonHousing2)
## Observations: 506
## Variables: 19
## $ town <fctr> Nahant, Swampscott, Swampscott, Marblehead, Marblehea...
## $ tract <int> 2011, 2021, 2022, 2031, 2032, 2033, 2041, 2042, 2043, ...
## $ lon <dbl> -70.9550, -70.9500, -70.9360, -70.9280, -70.9220, -70....
## $ lat <dbl> 42.2550, 42.2875, 42.2830, 42.2930, 42.2980, 42.3040, ...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
## $ cmedv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 22.1, 16.5, ...
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <fctr> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <int> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ b <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
medv
and percentage of lower status of the population that lives in the census tract lstat
. Describe the relationship.The relationship is negative-linear. As the percentage of lower status increases the median value decreases.
library(mlbench)
data("BostonHousing2")
glimpse(BostonHousing2)
## Observations: 506
## Variables: 19
## $ town <fctr> Nahant, Swampscott, Swampscott, Marblehead, Marblehea...
## $ tract <int> 2011, 2021, 2022, 2031, 2032, 2033, 2041, 2042, 2043, ...
## $ lon <dbl> -70.9550, -70.9500, -70.9360, -70.9280, -70.9220, -70....
## $ lat <dbl> 42.2550, 42.2875, 42.2830, 42.2930, 42.2980, 42.3040, ...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
## $ cmedv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 22.1, 16.5, ...
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <fctr> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <int> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ b <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
BostonHousing2 %>% ggplot(aes(lstat, medv)) + geom_point()
medv
on lstat
. Describe each of the variables in the model.\(y_i\) is the \(i^{th}\) census track’s median home value. \(\beta_0\) is the intercept term in the model. \(\beta_1\) is the slope term in the model. \(x_i\) is the \(i^{th}\) census track’s percentage of lower status of the population. \(\epsilon_i\) is the \(i^{th}\) error term.
Use 80% of the data to select a training set, and leave 20% of the data for testing. Fit a linear regression model of medv
on lstat
on the training set.
Calculate RMSE on the training and test data using the linear regression model you fit on the training set. Is there any evidence of overfitting?
library(mlbench)
data("BostonHousing2")
set.seed(10)
n <- nrow(BostonHousing2)
test_idx <- sample.int(n, size = round(0.2 * n))
BH_train <- BostonHousing2[-test_idx, ]
n_train <- nrow(BH_train)
n_train
## [1] 405
BH_test <- BostonHousing2[test_idx,]
n_test <- nrow(BH_test)
n_test
## [1] 101
train_mod <- lm(medv ~ lstat, data = BH_train)
summ_train_mod <- summary(train_mod)
sr_rsq <- summ_train_mod$r.squared
sr_rsq
## [1] 0.5583837
yhat_test <- predict(train_mod, newdata = BH_test)
y_test <- BH_test$medv
sqrt(sum((y_test - yhat_test)^2) / n_test)
## [1] 6.805511
yhat_train <- predict(train_mod, newdata = BH_train)
y_train <- BH_train$medv
sr_rmse <- sqrt(sum((y_train - yhat_train)^2) / n_train)
sr_rmse
## [1] 6.048581
There is no evidence of overfitting since the RMSE on the training and test sets are very close. These values will depend on which observations were included in the training and test sets. So if set.seed()
were set to a different value then RMSE from the test and training sets would be different, although on average would probably be similar.
\(R^2=\) 0.5583837. This shows a modest amount of agreement between the observed and predicted values.
medev
where the following covariates (inputs) or used in addition to lstat
: crim
, zn
, rm
, nox
, age
, tax
, ptratio
. Does this model provide more accurate predictions compared to the model that you fit in part (c)? Explain.library(mlbench)
data("BostonHousing2")
set.seed(10)
n <- nrow(BostonHousing2)
test_idx <- sample.int(n, size = round(0.2 * n))
BH_train <- BostonHousing2[-test_idx, ]
n_train <- nrow(BH_train)
n_train
## [1] 405
BH_test <- BostonHousing2[test_idx,]
n_test <- nrow(BH_test)
n_test
## [1] 101
train_mod <- lm(medv ~ lstat + crim + zn + rm + nox + age + tax + ptratio, data = BH_train)
summ_train_mod <- summary(train_mod)
mr_rsq <- summ_train_mod$r.squared
yhat_test <- predict(train_mod, newdata = BH_test)
y_test <- BH_test$medv
sqrt(sum((y_test - yhat_test)^2) / n_test)
## [1] 5.833651
yhat_train <- predict(train_mod, newdata = BH_train)
y_train <- BH_train$medv
mr_rmse <- sqrt(sum((y_train - yhat_train)^2) / n_train)
mr_rmse
## [1] 4.971456
The RMSE has decreased from 6.0485807 to 4.9714564 and \(R^2\) has increased from 0.5583837 to 0.7016642. Therefore the multiple linear regression model has better prediction accuracy compared to the simple linear regression model.