STA130H1F

class: center, middle, inverse, title-slide

# STA130H1F
## Class #10
### Prof. Nathan Taback
### 2018-11-19

---

## This Class

- Relationships between two variables
- Linear Relationships: The equation of a straight line
- Linear regression models 
- Estimating the coefficients: Least Squares
- Interpreting the slope with a continuous explanatory variable
- Prediction/Supervised learning using a linear regression model
- `$R^2$` - Coefficient of Determination
- Introduction to Multiple Regression

---
# Relationships between two variables

---

## Advertising Example

- Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product.

- The `Advertising` data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.

```r
glimpse(Advertising)
```

```
## Observations: 200
## Variables: 4
## $ TV        <dbl> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8...
## $ radio     <dbl> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1,...
## $ newspaper <dbl> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0,...
## $ sales     <dbl> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 1...
```

---

## Advertising Example

- It is not possible for our client to directly increase sales of the product, but they can control the advertising expenditure in each of the three media.

- Therefore, if we determine that there is an association between advertising and sales, then we can instruct our client to adjust advertising budgets, thereby indirectly increasing sales.

---

## Increasing sales through advertising

What is the relationship between `sales` and `TV` budget?

```r
Advertising %>% 
  ggplot(aes(x = TV, y = sales)) + 
  geom_point() + theme_minimal()
```

![](sta130_class10_nt_files/figure-html/unnamed-chunk-2-1.png)

---

## Increasing sales through advertising

- In general, as the budget for `TV` increases `sales` increases.
- Although, sometimes increasing the `TV` budget didn't increase `sales`.
- The relationship between these two variables is approximately linear.

---

## Linear Relationships

A perfect linear relationship between an independent variable `$x$` and dependent variable `$y$` has the mathematical form:

`$$y = \beta_0+\beta_1x.$$`

`$\beta_0$` is called the `$y$`-intercept and `$\beta_1$` is called the slope.

---

# Linear Relationships: The equation of a straight line

---

## Linear Relationships: The equation of a straight line

If the relationship between `$y$` and `$x$` is perfectly linear then the scatter plot could look like:

![](sta130_class10_nt_files/figure-html/unnamed-chunk-3-1.png)

---

## Linear Relationships: The equation of a straight line

What is the equation of straight line that fits these points?

![](sta130_class10_nt_files/figure-html/unnamed-chunk-4-1.png)

First four observations:

```
## # A tibble: 4 x 2
##       x     y
##   <dbl> <dbl>
## 1     0    0 
## 2     2  133.
## 3     4  267.
## 4     6  400
```

---

## Fitting a straight line to data

Use analytic geometry to find the equation of the straight line: pick two any points `$(x^{(1)},y^{(1)})$` and `$(x^{(2)},y^{(2)})$` on the line.

The slope is:

`$$m = \frac{y^{(1)} - y^{(2)}}{x^{(1)} - x^{(2)}}.$$`

So the equation of the line with slope `$m$` passing through `$(x^{(1)},y^{(1)})$` is

`$$y - y^{(1)} = m(x - x^{(1)}) \Rightarrow y =mx +b,$$`

where `$b=y^{(1)}-mx^{(1)}.$`

---

## Linear Relationships: The equation of a straight line

What is the equation of the 'best' straight line that fits these points?
![](sta130_class10_nt_files/figure-html/unnamed-chunk-6-1.png)

```
## # A tibble: 4 x 2
##       x     y
##   <dbl> <dbl>
## 1    -4    16
## 2    -2     4
## 3     0     0
## 4     2     4
```

---

# Relationships between two variables

---

## Relationships between two variables

- Sometimes the relationship between two variables in non-linear.  
- If the realtionship is non-linear then fitting a straight line to the data is not useful in describing the relationship.

---

## Example of Non-linear relationships

- Let `$y$` be life expectancy of a component, and `$x$` the age of the component.
- There is a relationship between `$y$` and `$x$`, but it is not linear.

```r
p <- data_frame(x = age, y = life_exp) %>% 
  ggplot(aes(x = x, y = y)) + geom_point() + theme_minimal() 
p
```

![](sta130_class10_nt_files/figure-html/unnamed-chunk-9-1.png)

```r
p + geom_smooth(method = "lm", se = F)
```

![](sta130_class10_nt_files/figure-html/unnamed-chunk-9-2.png)

---

## Tidy the Advertising Data

- Each market is an observation, but each column is the amount spent on TV, radio, newspaper advertising.

```
## # A tibble: 3 x 4
##      TV radio newspaper sales
##   <dbl> <dbl>     <dbl> <dbl>
## 1 230.   37.8      69.2  22.1
## 2  44.5  39.3      45.1  10.4
## 3  17.2  45.9      69.3   9.3
```

- The data are not tidy since each column corresponds to the values of advertising budget for different media.

---

## Tidy the Advertising Data

- Tidy the data by creating a column for advertising budget and another column for type of advertising.
- We can use the `gather` function in the `tidyr` library (part of the `tidyverse` library) to tidy the data.

```r
Advertising_long <- Advertising %>% 
  select(TV, radio, newspaper, sales) %>% 
  gather(key = adtype, value = amount, TV, radio, newspaper)
head(Advertising_long)
```

```
## # A tibble: 6 x 3
##   sales adtype amount
##   <dbl> <chr>   <dbl>
## 1  22.1 TV      230. 
## 2  10.4 TV       44.5
## 3   9.3 TV       17.2
## 4  18.5 TV      152. 
## 5  12.9 TV      181. 
## 6   7.2 TV        8.7
```

---

## Advertising Data

```r
Advertising_long %>% 
  ggplot(aes(amount, sales)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  facet_grid(. ~ adtype)
```

![](sta130_class10_nt_files/figure-html/unnamed-chunk-12-1.png)

- The advertising budgets (`newspaper`, `radio`, `TV`) are the input/independent/covariates and the dependent variable is sales.

---

# Linear Regression Models

---
## Simple Linear Regression

The simple linear regression model can describe the relationship between sales and amont spent on radio advertising through the model

`$$y_i =\beta_0 + \beta_1 x_i + \epsilon_i,$$`
where `$i=1,\ldots,n$` and `$n$` is the number of observations.

```r
Advertising_long %>% 
  filter(adtype == "radio") %>%
  ggplot(aes(amount, sales)) + 
  geom_point()
```

![](sta130_class10_nt_files/figure-html/unnamed-chunk-13-1.png)

---

## Simple Linear Regression

The equation:

`$$y_i =\beta_0 + \beta_1 x_i + \epsilon_i$$`
is called a __regression model__ and since we have only one independent variable it is called a _simple regression model_.

- `$y_i$` is called the dependent or target variable.
- `$\beta_0$` is the intercept parameter.
- `$x_i$` is the independent variable, covariate, feature, or input.
- `$\beta_1$` is called the slope parameter.
- `$\epsilon_i$` is called the error parameter.

---

## Multiple Linear Regression

In general, models of the form

`$$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} + \epsilon_{i},$$` where `$i=1,\ldots,n$`, with `$k>1$` independent variables are called _multiple regression models_.

- The `$\beta_j$`'s are called parameters and the `$\epsilon_i$`'s errors.
- The values of of neither `$\beta_j$`'s nor `$\epsilon_i$`'s can ever be known, but they can be estimated.
- The "linear" in Linear Regression means that the equation is linear in the parameters `$\beta_j$`.  
- This is a linear regression model: `$y_i = \beta_0 + \beta_1 \sqrt{x_{i1}} + \beta_2 x_{i2}^2 + \epsilon_{i}$`
- This is not a linear regression model: `$y_i = \beta_0 + \sin(\beta_1) x_{i1} + \beta_2 x_{i2} + \epsilon_{i}$`.  This is called a nonlinear regression model.

---

# Least Squares

---

## Fitting a straight line to Sales and Radio Advertising

![](sta130_class10_nt_files/figure-html/unnamed-chunk-14-1.png)

```
## # A tibble: 6 x 2
##   sales amount
##   <dbl>  <dbl>
## 1  22.1   37.8
## 2  10.4   39.3
## 3   9.3   45.9
## 4  18.5   41.3
## 5  12.9   10.8
## 6   7.2   48.9
```

---

## Fitting a straight line to Sales and Radio Advertising

```r
head(Advertising_long %>% 
  filter(adtype == "radio")) %>%
  select(sales,amount)
```

```
## # A tibble: 6 x 2
##   sales amount
##   <dbl>  <dbl>
## 1  22.1   37.8
## 2  10.4   39.3
## 3   9.3   45.9
## 4  18.5   41.3
## 5  12.9   10.8
## 6   7.2   48.9
```

`$m = \frac{22.1-10.4}{37.8-39.8}=$` -5.85, `$b = 22.1-\frac{22.1-10.4}{37.8-39.8}\times37.8=$` 243.23. So, the equation of the straight line is:

`$$y=243.23-5.85x.$$`
---

## Fitting a straight line to Sales and Radio Advertising

---

The equation `$y=243.23-5.85x$` is shown on the scatter plot.

![](sta130_class10_nt_files/figure-html/unnamed-chunk-16-1.png)

---

## Fitting a straight line to Sales and Radio Advertising

- For a fixed value of `amount` spent on radio ads the corresponding `sales` has variation.  It's neither strictly increasing nor decreasing.

- But, the overall pattern displayed in the scatterplot shows that _on average_ `sales` increase as `amount` spent on radio ads increases.

---

## Least Squares

The Least Squares approach is to find the y-intercept `$\beta_0$` and slope `$\beta_1$` of the straight line that is closest to as many of the points as possible.

![](sta130_class10_nt_files/figure-html/unnamed-chunk-17-1.png)

---

## Estimating the coefficients: Least Squares

To find the values of `$\beta_0$` and slope `$\beta_1$` that fit the data best we can minimize the sum of `$\epsilon_i^2=\left(y_i -\beta_0 - \beta_1 x_i\right)^2$` (squared errors):

`$$\sum_{i=1}^n \epsilon_i^2 = \sum_{i=1}^n \left(y_i -\beta_0 - \beta_1 x_i\right)^2.$$`

So, we want to minimize a function of `$\beta_0, \beta_1$`

`$$L(\beta_0,\beta_1) = \sum_{i=1}^n \left(y_i -\beta_0 - \beta_1 x_i\right)^2,$$`

where `$x_i$`'s are numbers and therfore constants.
---

## Estimating the coefficients: Least Squares

- The derivative of `$L(\beta_0,\beta_1)$` with respect to `$\beta_0$` treats `$\beta_1$` as a constant.  This is also called the partial derivative and is denoted as `$\frac{\partial L}{\partial \beta_0}.$`

- To find the values of `$\beta_0$` and `$\beta_1$` that minimize `$L(\beta_0,\beta_1)$` we set the partial derivatives to zero and solve:

$$
`\begin{aligned}
\frac{\partial L}{\partial \beta_0} &= -2 \sum_{i=1}^n (y_i -\beta_0 - \beta_1 x_i) = 0, \\
\frac{\partial L}{\partial \beta_1} &= -2 \sum_{i=1}^n (y_i -\beta_0 - \beta_1 x_i)x_{i} =0.
\end{aligned}`
$$

The values of `$\beta_0$` and `$\beta_1$` that are solutions to above equations are denoted `$\hat \beta_0$` and `$\hat \beta_1$` respectively.

---
## Estimating the coefficients: Least Squares

It can be shown that

$$
`\begin{aligned}
\hat{\beta_0} &= \bar{y} - \hat{\beta_1} \bar{x} \\
\hat{\beta_1} &= \frac{(\sum_{i=1}^n y_ix_i) - n \bar{x}\bar{y}}{(\sum_{i=1}^n x_i^2) - n\bar{x}^2},
\end{aligned}`
$$

where, `$\bar{y} = \sum_{i=1}^n{y_i}/n$`, and `$\bar{x} = \sum_{i=1}^n{x_i}/n.$`

`$\hat \beta_0$` and `$\hat \beta_1$` are called the least squares estimators of `$\beta_0$` and `$\beta_1$`.

---

## Estimating the Coefficients Using R - Formula syntax in R

The R syntax for defining relationships between inputs such as amount spent on `newspaper` advertising and outputs such as `sales` is:

```r
sales ~ newspaper
```

The tilde `~` is used to define the  what the output variable (or outcome, on the left-hand side) is and what the input variables (or predictors, on the right-hand side) are.

A formula that has three inputs can be written as

```r
sales ~ newspaper + TV + radio
```

---

## Estimating the Coefficients Using `lm()`

```r
mod_paper <- lm(sales ~ newspaper, data = Advertising)
mod_paper_summary <- summary(mod_paper)
mod_paper_summary$coefficients
```

```
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 12.3514071 0.62142019 19.876096 4.713507e-49
## newspaper    0.0546931 0.01657572  3.299591 1.148196e-03
```

- `(Intercept)` is the estimate of `$\hat \beta_0$`.
- `newspaper` is the estimate of `$\hat \beta_1$`.

---

## Estimating the Coefficients Using R

.pull-left[
- The blue line is the estimated regression line with intercept 12.35 and slope 0.05. 
- `geom_smooth(method = "lm", se = FALSE)` adds the linear regression to the scatterplot without a confidence interval for the linear regression line (this is set via `se = FALSE`).
]

.pull-right[
.midi[

```r
Advertising_long %>% 
  filter(adtype == "radio") %>%
  ggplot(aes(amount, sales)) + 
  geom_point() +
  geom_smooth(method = "lm", 
              se = FALSE) + 
  theme_minimal()
```
]

![](sta130_class10_nt_files/figure-html/plot-label-out-1.png)
]
---
## Interpreting the Slope and Intercept with a Continuous Explanatory Variable

The estimated linear regression of `sales` on `newspaper` is:

`$$y_i = 12.35 + 0.05x_i,$$`
where `$y_i$` is sales in the `$i^{th}$` market and `$x_i$` is the dollar amount spent on newspaper advertising in the `$i^{th}$` market.

- The __slope__ `$\hat \beta_1$` is the amount of change in `$y$` for a unit change in `$x$`. 
- Sales increase by 0.05 for each dollar spent on advertising.
- The __intercept__ `$\hat \beta_0$` is the average of `$y$` when `$x_i=0$`.
- The average sales is 12.35 when the amount spent on advertising is zero.

---
## Prediction using a Linear Regression Model

After a linear regression model is estimated from data it can be used to calculate predicted values using the regression equation

`$$\hat y_{i} = \hat \beta_0 + \hat \beta_1 x_{i}.$$`

`$\hat y_{i}$` is the predicted value of the `$i^{th}$` response `$y_i$`.

The `$i^{th}$` residual is `$$e_i = y_i - \hat y_i.$$`

---

## Prediction using a Linear Regression Model

The amount spent on newspaper advertising in the first market is:

```r
Advertising %>% filter(row_number() == 1)
```

```
## # A tibble: 1 x 4
##      TV radio newspaper sales
##   <dbl> <dbl>     <dbl> <dbl>
## 1  230.  37.8      69.2  22.1
```

- The predicted sales using the regression model is: `$12.35 + 0.05\times 69.2 =$` 16.14. 
- The observed sales for region is 22.1.
- The __error__ or __residual__ is `$y_1-\hat {y_1}=$` 5.96.

---
## Prediction using a Linear Regression Model

The predicted and residual values from a regression model can be obtained using the `predict()` and `residual()` functions.

```r
mod_paper <- lm(sales ~ newspaper, data = Advertising)
sales_pred <- predict(mod_paper)
head(sales_pred)
```

```
##        1        2        3        4        5        6 
## 16.13617 14.81807 16.14164 15.55095 15.54548 16.45339
```

```r
sales_resid <- residuals(mod_paper)
head(sales_resid)
```

```
##         1         2         3         4         5         6 
##  5.963831 -4.418066 -6.841639  2.949047 -2.645484 -9.253389
```

---
## Measure of Fit for Simple Regression

- The regression model is a good fit when the residuals are small.
- Thus, we can measure the quality of fit by the sum of squares of the residuals `$\sum_{i=1}^n (y_{i}- \hat y_{i})^2$`.  
- This quantity depends on the units in which `$y_i$`'s are measured.  A measure of fit that does not depend on the units is:

`$$R^2 = 1- \frac{\sum_{i=1}^n e_i^2}{\sum_{i=1}^n (y_i-\bar y)^2}.$$`

- `$R^2$` is often called the coeffcient of determination.
- `$0 \le R^2 \le 1,$` where 1 indicates a perfect match between the observed and predicted values and 0 indicates an poor match.

---

## Measure of Fit for Simple Regression

The `summary()` method calculates `$R^2$`

```r
mod_paper <- lm(sales ~ newspaper, data = Advertising)
mod_paper_summ <- summary(mod_paper)
mod_paper_summ$r.squared
```

```
## [1] 0.05212045
```

- `$R^2=$` 0.0521204.  This indicates a poor fit.

---
## Using Linear Regression as a Machine Learning/Supervised Learning Tool

The `diamonds` data set contains the prices and other attributes of almost 54,000 diamonds. The variables are as follows:

```
## Observations: 53,940
## Variables: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very G...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, ...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...
```

Question: Predict the price of diamonds based on carot size.

---

## Predicting the Price of Diamonds

Let's select training and test sets.

```r
set.seed(2)
diamonds_train <- diamonds %>% 
  mutate(id = row_number()) %>% 
  sample_frac(size = 0.8)

diamonds_test <- diamonds %>% 
  mutate(id = row_number()) %>% 
  # return all rows from diamonds where there are not 
  # matching values in diamonds_train, keeping just 
  # columns from diamonds.
  anti_join(diamonds_train, by = 'id') 
```

---

## Predicting the Price of Diamonds

- Now fit a regression model on `diamonds_train`.

```r
mod_train <- lm(price ~ carat, data = diamonds_train)
mod_train_summ <- summary(mod_train)
mod_train_summ$r.squared
```

```
## [1] 0.8488647
```

- Evaluate the prediction error using root mean square error using the training model on `diamonds_test`.

`$$\text {RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^n (y_i - \hat y_i)^2}$$`

- RMSE can be used to compare different sizes of data sets on an equal footing and the square root ensures that RMSE is on the same scale as `$y$`.

---

## Predicting the Price of Diamonds using Simple Linear Regression

- Calculate RMSE using test and training data.

```r
y_test <- diamonds_test$price
yhat_test <- predict(mod_train, newdata = diamonds_test)
n_test <- length(diamonds_test$price)

# test RMSE
rmse <- sqrt(sum((y_test - yhat_test)^2) / n_test)
rmse
```

```
## [1] 1553.295
```

```r
y_train <- diamonds_train$price
yhat_train <- predict(mod_train, newdata = diamonds_train)
n_train <- length(diamonds_train$price)

# train RMSE
sqrt(sum((y_train - yhat_train)^2) / n_train)
```

```
## [1] 1547.391
```

---

## Predicting the Price of Diamonds using Multiple Linear Regression

We will add other variables to the regression model to investigate if we can decrease the prediction error.

.pull-left[.small[

```r
mrmod_train <- lm(price ~ carat + cut + 
                    color + clarity, 
                  data = diamonds_train)
mrmod_train_summ <- summary(mrmod_train)
mrmod_train_summ$r.squared
```

```
## [1] 0.9152898
```

]]

.pull-right[.small[

```r
y_test <- diamonds_test$price
yhat_test <- predict(mrmod_train, 
                     newdata = diamonds_test)
n_test <- length(diamonds_test$price)
mr_rmse <- 
  sqrt(sum((y_test - yhat_test)^2) / n_test)
mr_rmse
```

```
## [1] 1149.881
```
]
- The simple linear regression model had `$R^2=$` 0.8488647 and RMSE = 1553.2953095.
]