Instructions

What should I bring to tutorial on November 23?

  • R output (e.g., plots and explanations) for Questions 1d, 2e, 2f, 3c, 3d You can either bring a hardcopy or bring your laptop with the output.

Tutorial Grading

Tutorial grades will be assigned according to the following marking scheme.

Mark
Attendance for the entire tutorial 1
Assigned homework completion 1
In-class exercises 4
Total 6

Practice Problems

Question 1

In this question, you will derive the least squares estimators of the intercept \(\beta_0\) and two regression coefficients \(\beta_1\) and \(\beta_2\) for a linear regression model with two covariates, \(x_1\) and \(x_2\).

  1. Write down the linear regression model. Explain each term in the model.

The linear regression model described in this question is \[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i \] where \(i=1,2,...n\) and \(n\) is the number of observations.

  1. Write an expression for the sum of squared errors, \(L(\beta_0, \beta_1, \beta_2)\)

\[ L(\beta_0, \beta_1, \beta_2) = \sum_{i=1}^n \epsilon_i^2 = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_{i1} - \beta_2 x_{i2})^2\]

  1. Calculate the partial derivatives of \(L\) with respect to \(\beta_0\), \(\beta_1\), and \(\beta_2\).

\[ \begin{aligned} \frac{\partial L}{\partial \beta_0} &= -2 \sum_{i=1}^n (y_i -\beta_0 - \beta_1 x_{i1} - \beta_2 x_{i2}), \\ \frac{\partial L}{\partial \beta_1} &= -2 \sum_{i=1}^n (y_i -\beta_0 - \beta_1 x_{i1} - \beta_2 x_{i2})x_{i1}, \\ \frac{\partial L}{\partial \beta_2} &= -2 \sum_{i=1}^n (y_i -\beta_0 - \beta_1 x_{i1} - \beta_2 x_{i2})x_{i2} \end{aligned} \]

  1. Write one sentence explaining how you could find the least squares esimates \(\hat\beta_0\), \(\hat\beta_1\), and \(\hat\beta_2\) (you do not need to find the esimates).

I would find the least squares estimates by setting the partial derivatives from (c) equal to zero and solving the resulting system of three equations.

Question 2

In this question, you’ll use the mtcars dataset (available in the datasets library), which consists of a sample of 11 variables for 32 car models from the 1974 Motor Trend US magazine. Your goal in this question is to investigate the effect of various factors on gas mileage (mpg)

  1. Produce a scatterplot of gas mileage (mpg) vs horsepower (hp). Write one sentence commenting on the association between these two variables.
mtcars %>% ggplot(aes(x=hp, y=mpg)) + geom_point()

There appears to be a negative association between horsepower and gas mileage: cars with higher horsepower can travel fewer miles per gallon of gas used.

  1. Produce a scatterplot of gas mileage (mpg) vs the rear axle ratio (drat). Write one sentence commenting on the association between these two variables.
mtcars %>% ggplot(aes(x=drat, y=mpg)) + geom_point()

There appears to be a positive association between horsepower and gas mileage: cars with higher rear axle ratio can travel a longer distance per gallon of gas used.

  1. Based on the plots in (a) and (b), which factor do you think is more useful for predicting mpg? Explain in 1-2 sentences.

From the scatterplots in (a) and (b), we see that horsepower and rear axle ratio are both associated with gas mileage. However, the association between horsepower and gas mileage appears to be stronger than the association between rear axle ratio and gas mileage, since the points in (a) are more tightly concentrated in a straight line than those in (b).

  1. Fit two simple linear regression models to predict gas mileage: one with weight as a predictor and the other with rear axle ratio as a predictor. Compare the coefficient of determination in these two models and interpret these values in 1-2 sentences.
lm_hp <- lm(mpg ~ hp, data = mtcars)
summary(lm_hp)
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07
lm_drat <- lm(mpg ~ drat, data = mtcars)
summary(lm_drat)
## 
## Call:
## lm(formula = mpg ~ drat, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0775 -2.6803 -0.2095  2.2976  9.0225 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.525      5.477  -1.374     0.18    
## drat           7.678      1.507   5.096 1.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.485 on 30 degrees of freedom
## Multiple R-squared:  0.464,  Adjusted R-squared:  0.4461 
## F-statistic: 25.97 on 1 and 30 DF,  p-value: 1.776e-05

When we fit a simple linear regression model for gas mileage with only horsepower as a predictor, we get \(R^2 = 0.6024\), while the model with only rear axle ratio as a predictor has \(R^2 = 0.464\). In other words, regressing on horsepower explains 60% of the variation in gas mileage, but regressing on rear axle ratio only explains 46% of this variation.

  1. Fit a linear regression model with both horsepower and rear axle ratio as predictors for gas mileage. Does this model explain more of the variability in gas mileage than the models from (d)?
lm_mult <- lm(mpg ~ hp + drat, data = mtcars)
summary(lm_mult)
## 
## Call:
## lm(formula = mpg ~ hp + drat, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0369 -2.3487 -0.6034  1.1897  7.7500 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.789861   5.077752   2.125 0.042238 *  
## hp          -0.051787   0.009293  -5.573 5.17e-06 ***
## drat         4.698158   1.191633   3.943 0.000467 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.17 on 29 degrees of freedom
## Multiple R-squared:  0.7412, Adjusted R-squared:  0.7233 
## F-statistic: 41.52 on 2 and 29 DF,  p-value: 3.081e-09

The linear regression model for gas mileage with two predictors (horsepower and rear axle ratio) has a coefficient of determination \(R^2=0.7412\), which means that it explains 74% of the variability in gas mileage. This is a large improvement over both models fit in part (d), which had \(R^2\) values of 0.6024 and 0.464 respectively.

  1. Based on the model you fit in part (e), what would be the predicted gas mileage for a car with a horsepower of 25 and rear axle ratio of 6. Do you think this prediction is reliable? Write 1-2 sentences explaining why or why not. Hint: look at the range of values for mpg, hp and drat in the dataset.

In part (e), we found that \(\hat\beta_0 = 10.79\), \(\hat\beta_{hp} = -0.05\), and \(\hat\beta_{drat} = 4.70\). Thus, the predicted value for a car with \(x_{hp}=25\) and \(x_{drat}=6\) would be \[ \hat y = 10.79 - 0.05 * 25 + 4.70 * 6 = 40.24 \] The values of horsepower and rear axle ratio which we are asked to consider in this problem are outside of the range of values in our dataset. In particular, the values of hp in mtcars range from 52 to 335, and the values of drat range from 2.76 to 4.93. Since we don’t know if the linear regression model is appropriate outside of this range, we should look at this predicted value with caution, as it may not be a reliable prediction.

Question 3

The Housing data for 506 census tracts of Boston from the 1970 census. The dataframe BostonHousing2 contains the original corrected data by Harrison and Rubinfeld (1979).

In this question you will build a linear regression model to predict the median value of owner occupied homes in USD 1000’s medv.

library(mlbench)
data("BostonHousing2")
glimpse(BostonHousing2)
## Observations: 506
## Variables: 19
## $ town    <fct> Nahant, Swampscott, Swampscott, Marblehead, Marblehead...
## $ tract   <int> 2011, 2021, 2022, 2031, 2032, 2033, 2041, 2042, 2043, ...
## $ lon     <dbl> -70.9550, -70.9500, -70.9360, -70.9280, -70.9220, -70....
## $ lat     <dbl> 42.2550, 42.2875, 42.2830, 42.2930, 42.2980, 42.3040, ...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
## $ cmedv   <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 22.1, 16.5, ...
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <int> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ b       <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
  1. Create a scatterplot of median value of homes medv and percentage of lower status of the population that lives in the census tract lstat. Describe the relationship.

The relationship is negative-linear. As the percentage of lower status increases the median value decreases.

library(mlbench)
data("BostonHousing2")
glimpse(BostonHousing2)
## Observations: 506
## Variables: 19
## $ town    <fct> Nahant, Swampscott, Swampscott, Marblehead, Marblehead...
## $ tract   <int> 2011, 2021, 2022, 2031, 2032, 2033, 2041, 2042, 2043, ...
## $ lon     <dbl> -70.9550, -70.9500, -70.9360, -70.9280, -70.9220, -70....
## $ lat     <dbl> 42.2550, 42.2875, 42.2830, 42.2930, 42.2980, 42.3040, ...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
## $ cmedv   <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 22.1, 16.5, ...
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <int> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ b       <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
BostonHousing2 %>% ggplot(aes(lstat, medv)) + geom_point()

  1. Write out the mathematical description of a simple linear regression model of medv on lstat. Describe each of the variables in the model.

\(y_i\) is the \(i^{th}\) census track’s median home value. \(\beta_0\) is the intercept term in the model. \(\beta_1\) is the slope term in the model. \(x_i\) is the \(i^{th}\) census track’s percentage of lower status of the population. \(\epsilon_i\) is the \(i^{th}\) error term.

  1. Use 80% of the data to select a training set, and leave 20% of the data for testing. Fit a linear regression model of medv on lstat on the training set.

  2. Calculate RMSE on the training and test data using the linear regression model you fit on the training set. Is there any evidence of overfitting?

  1. Calculate the coefficient of determination.
library(mlbench)
data("BostonHousing2")
set.seed(10)
n <- nrow(BostonHousing2) 
test_idx <- sample.int(n, size = round(0.2 * n)) 
BH_train <- BostonHousing2[-test_idx, ] 
n_train <- nrow(BH_train)
n_train
## [1] 405
BH_test <- BostonHousing2[test_idx,]
n_test <- nrow(BH_test)
n_test
## [1] 101
train_mod <- lm(medv ~ lstat, data = BH_train)
summ_train_mod <- summary(train_mod)
sr_rsq <- summ_train_mod$r.squared
sr_rsq
## [1] 0.5583837
yhat_test <- predict(train_mod, newdata = BH_test)
y_test <- BH_test$medv

sqrt(sum((y_test - yhat_test)^2) / n_test)
## [1] 6.805511
yhat_train <- predict(train_mod, newdata = BH_train)
y_train <- BH_train$medv

sr_rmse <- sqrt(sum((y_train - yhat_train)^2) / n_train)
sr_rmse
## [1] 6.048581

There is no evidence of overfitting since the RMSE on the training and test sets are very close. These values will depend on which observations were included in the training and test sets. So if set.seed() were set to a different value then RMSE from the test and training sets would be different, although on average would probably be similar.

\(R^2=\) 0.5583837. This shows a modest amount of agreement between the observed and predicted values.

  1. Use the training and test sets to build a multiple regression model to predict medev where the following covariates (inputs) are used in addition to lstat: crim, zn, rm, nox, age, tax, ptratio. Does this model provide more accurate predictions compared to the model that you fit in part (c)? Explain.
library(mlbench)
data("BostonHousing2")
set.seed(10)
n <- nrow(BostonHousing2) 
test_idx <- sample.int(n, size = round(0.2 * n)) 
BH_train <- BostonHousing2[-test_idx, ] 
n_train <- nrow(BH_train)
n_train
## [1] 405
BH_test <- BostonHousing2[test_idx,]
n_test <- nrow(BH_test)
n_test
## [1] 101
train_mod <- lm(medv ~ lstat + crim + zn + rm + nox + age + tax + ptratio, data = BH_train)
summ_train_mod <- summary(train_mod)
mr_rsq <- summ_train_mod$r.squared


yhat_test <- predict(train_mod, newdata = BH_test)
y_test <- BH_test$medv


sqrt(sum((y_test - yhat_test)^2) / n_test)
## [1] 5.833651
yhat_train <- predict(train_mod, newdata = BH_train)
y_train <- BH_train$medv

mr_rmse <- sqrt(sum((y_train - yhat_train)^2) / n_train)
mr_rmse
## [1] 4.971456

The RMSE has decreased from 6.0485807 to 4.9714564 and \(R^2\) has increased from 0.5583837 to 0.7016642. Therefore the multiple linear regression model has better prediction accuracy compared to the simple linear regression model.