Instructions

What should I bring to tutorial on November 23?

  • R output (e.g., plots and explanations) for Questions 1d, 2e, 2f, 3c, 3d You can either bring a hardcopy or bring your laptop with the output.

Tutorial Grading

Tutorial grades will be assigned according to the following marking scheme.

Mark
Attendance for the entire tutorial 1
Assigned homework completion 1
In-class exercises 4
Total 6

Practice Problems

Question 1

In this question, you will derive the least squares estimators of the intercept \(\beta_0\) and two regression coefficients \(\beta_1\) and \(\beta_2\) for a linear regression model with two covariates, \(x_1\) and \(x_2\).

  1. Write down the linear regression model. Explain each term in the model.

  2. Write an expression for the sum of squared errors, \(L(\beta_0, \beta_1, \beta_2)\)

  3. Calculate the partial derivatives of \(L\) with respect to \(\beta_0\), \(\beta_1\), and \(\beta_2\).

  4. Write one sentence explaining how you could find the least squares esimates \(\hat\beta_0\), \(\hat\beta_1\), and \(\hat\beta_2\) (you do not need to find the esimates).

Question 2

In this question, you’ll use the mtcars dataset, (available in the datasets library), which consists of a sample of 11 variables for 32 car models from the 1974 Motor Trend US magazine. Your goal in this question is to investigate the effect of various factors on gas mileage (mpg)

  1. Produce a scatterplot of gas mileage (mpg) vs horsepower (hp). Write one sentence commenting on the association between these two variables.

  2. Produce a scatterplot of gas mileage (mpg) vs the rear axle ratio (drat). Write one sentence commenting on the association between these two variables.

  3. Based on the plots in (a) and (b), which factor do you think is more useful for predicting mpg? Explain in 1-2 sentences.

  4. Fit two simple linear regression models to predict gas mileage: one with weight as a predictor and the other with rear axle ratio as a predictor. Compare the coefficient of determination in these two models and interpret these values in 1-2 sentences.

  5. Fit a linear regression model with both horsepower and rear axle ratio as predictors for gas mileage. Does this model explain more of the variability in gas mileage than the models from (d)?

  6. Based on the model you fit in part (e), what would be the predicted gas mileage for a car with a horsepower of 25 and rear axle ratio of 6. Do you think this prediction is reliable? Write 1-2 sentences explaining why or why not. Hint: look at the range of values for mpg, hp and drat in the dataset.

Question 3

The Housing data for 506 census tracts of Boston from the 1970 census. The dataframe BostonHousing2 contains the original corrected data by Harrison and Rubinfeld (1979).

In this question you will build a linear regression model to predict the median value of owner occupied homes in USD 1000’s medv.

library(mlbench)
data("BostonHousing2")
glimpse(BostonHousing2)
## Observations: 506
## Variables: 19
## $ town    <fct> Nahant, Swampscott, Swampscott, Marblehead, Marblehead...
## $ tract   <int> 2011, 2021, 2022, 2031, 2032, 2033, 2041, 2042, 2043, ...
## $ lon     <dbl> -70.9550, -70.9500, -70.9360, -70.9280, -70.9220, -70....
## $ lat     <dbl> 42.2550, 42.2875, 42.2830, 42.2930, 42.2980, 42.3040, ...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
## $ cmedv   <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 22.1, 16.5, ...
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <int> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ b       <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
  1. Create a scatterplot of median value of homes medv and percentage of lower status of the population that lives in the census tract lstat. Describe the relationship.

  2. Write out the mathematical description of a simple linear regression model of medv on lstat. Describe each of the variables in the model.

  3. Use 80% of the data to select a training set, and leave 20% of the data for testing. Fit a linear regression model of medv on lstat on the training set.

  4. Calculate RMSE on the training and test data using the linear regression model you fit on the training set. Is there any evidence of overfitting?

  1. Calculate the coefficient of determination.

  2. Use the training and test sets to build a multiple regression model to predict medev where the following covariates (inputs) are used in addition to lstat: crim, zn, rm, nox, age, tax, ptratio. Does this model provide more accurate predictions compared to the model that you fit in part (c)? Explain.