STAT 306 - Lecture 07
So far, we’ve only considered the simple linear regression with one covariate. In practice, we often have more than one covariate of interest.
When we have more than one covariate, we called the model a multiple linear regression.
The theory underlying the simple linear model naturally generalizes to multiple linear regression;
However, in multiple linear regression, it’s more convenient to express our linear model in terms of vectors and matrices.
What are the least squares estimators and their properties?
What changes in the interpretation of such models or when making inference/predictions with them?
Confidence intervals and hypothesis tests for the parameters.
Prediction intervals for future observations.
Model diagnostics.
Scroll down
\[ Y_i = \beta_0 + \beta_1x_i + \varepsilon_i \;, \qquad i=1,\dotsc,n \;, \]
\[ \begin{align} \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} &= \begin{bmatrix} \beta_0 + \beta_1x_1 + \varepsilon_1 \\ \beta_0 + \beta_1x_2 + \varepsilon_2 \\ \vdots \\ \beta_0 + \beta_1x_n + \varepsilon_n \end{bmatrix}\\ \ \\ &= \begin{bmatrix} \beta_0 + \beta_1x_1\\ \beta_0 + \beta_1x_2\\ \vdots \\ \beta_0 + \beta_1x_n\end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}\\ \ \\ &=\begin{bmatrix} 1 & x_1\\ 1 & x_2\\ \vdots \\ 1 & x_n\end{bmatrix}\begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix} \end{align} \]
Scroll down
If \(\mathbf{X}^\top\mathbf{X}\) is invertible, then by the normal equations \[ \begin{align} \mathbf{X}^\top\mathbf{X}\hat\beta &= \mathbf{X}^\top\mathbf{y} \Leftrightarrow\\ (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}\hat\beta &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} \Leftrightarrow\\ \hat\beta &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} \end{align} \]
In other words, our least squares estimate is given by: \[\hat\beta =(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\]
Scroll down
In other words, \(\mathbf{X}^\top\mathbf{X}\) is not invertible when \(x_1=x_2=\dotsc=x_n\).
Intuitively, this makes sense: if all the covariate values in our data are the same, then it’s not possible to observe a trend between the covariate and the response.
In general, the matrix \(\mathbf{X}^\top\mathbf{X}\) is invertible if and only if the design matrix \(\mathbf{X}_{n\times (p+1)}\) has \(rank(\mathbf{X})=p+1\), i.e.,
iClicker 7.3
Scroll down
Suppose we want to explain \(Y\) using \(p\) covariates \(X_1, X_2, \ldots, X_p\).
The model generalizes very naturally to: \[ Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \cdots + \beta_pX_{ip} + \varepsilon_i \;, \qquad i=1,\dotsc,n \;, \]
\[ \mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix}_{n \times 1} \;, \qquad \mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix}_{n \times (p+1)} \;, \\ \pmb{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix}_{(p+1) \times 1} \;, \qquad \pmb{\varepsilon} = \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}_{n \times 1} \;. \]
Note that if we have more than one covariate (i.e., \(p>1\)), then we are working in a \(p+1\)-dimensional space.
We don’t have a line anymore, but a hyperplane.
Scroll down
The residuals of the model with intercept \(b_0\) and slope \(b_1\) are given by \[ \begin{align} \mathbf{e} &= \begin{bmatrix} Y_1 - b_0 - b_1x_1 \\ Y_2 - b_0 - b_1x_2 \\ \vdots \\ Y_n - b_0 - b_1x_n \end{bmatrix} \ \\ &= \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} - \begin{bmatrix} b_0 + b_1x_1 \\ b_0 + b_1x_2 \\ \vdots \\ b_0 + b_1x_n \end{bmatrix}\\ \ \\ &= \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} - \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}\begin{bmatrix} b_0 \\ b_1 \end{bmatrix} \ \\ &= \mathbf{Y} - \mathbf{X}\mathbf{b} \;. \end{align} \]
iClicker 7.5
For SLR, since we had only two coefficients, we used the estimator of the variance of the errors \[ \hat{\sigma}^2 = \frac{RSS}{n-2} \]
For MLR, since we have \(p+1\) coefficients, we use \[ \hat{\sigma}^2 = \frac{RSS}{n-p-1} \]
Scroll down
\[ \begin{align} RSS(\mathbf{b}) &= \mathbf{e^\top e}\\ &= (\mathbf{Y} - \mathbf{X}\mathbf{b})^\top(\mathbf{Y} - \mathbf{X}\mathbf{b}) \\ &= \mathbf{b^\top X^\top X b} - \mathbf{b^\top X^\top Y} - \mathbf{Y^\top X b} + \mathbf{Y^\top Y} \\ &= \mathbf{b^\top X^\top X b} - 2\mathbf{b^\top X^\top Y} + \mathbf{Y^\top Y} \end{align} \]
\[ \frac{\partial RSS(\mathbf{b})}{\partial \mathbf{b}} = 2\mathbf{X^\top X b} - 2\mathbf{X^\top Y} \]
Goal: Manually fit a linear model in the case with more than two covariates; explore measures of model fit and their properties.
You may work on this activity alone or with others (recommended).
If you or your group needs help, please raise your hand.
When you are done one question, move on to the next. iClicker questions will be asked at certain points to monitor progress.
Suppose we want to explain penguins’ body mass not only by their flipper length but also by their bill depth. Our linear model is then
\[ Y_i = \beta_0 + \beta_1x_{i1} +\beta_2x_{i2} +\varepsilon_i \;, \qquad \varepsilon_i\sim N(0,\sigma^2) \;. \]
Scroll down
Construct the design matrix \(\mathbf{X}\) by completing the partial R code below. Which of the following R functions will you need to use to do so?
c()rbind()cbind()data.frame()
Scroll down
Some commands you’ll need to compute the least squares estimate \(\hat\beta\):
The function t(A) computes the matrix transpose \(\mathbf{A}^\top\).
The operator A %*% B computes the matrix product \(\mathbf{AB}\).
The function solve(A,b) computes the vector c such that \(\mathbf{Ac=b}\).
Given the above, compute the least squares estimate
\[ \hat\beta = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} \;. \]
Scroll down
\[ \begin{align} RSS_{p=2} &:= \mathbf{e^\top e} \\ &= \sum_{i=1}^n(y_i - \hat{y}_i)^2 \\ &= \sum_{i=1}^n(y_i - \hat\beta_0 - \hat\beta_1x_1 - \hat\beta_2x_2)^2 \;. \end{align} \]
\[ RSS_{p=3} := \sum_{i=1}^n(y_i - \hat\beta_0 - \hat\beta_1x_1 - \hat\beta_2x_2 - \hat\beta_3x_3)^2 \; . \]
Solution
\[ \begin{align} RSS_{p=3} &= \sum_{i=1}^n(y_i - \hat\beta_0 - \hat\beta_1x_1 - \hat\beta_2x_2 - \hat\beta_3x_3)^2 \\ &= \sum_{i=1}^n(y_i - \hat\beta_0 - \hat\beta_1x_1 - \hat\beta_2x_2)^2 \\ &= RSS_{p=2} \;. \end{align} \]
In other words, adding covariates to our linear model never increases the minimum value achievable for the residual sum of squares (which is a measure of the “error” made by the model on our data).
So, adding new covariates to our model never increases the “total error” made by our model (and may potentially decrease the error instead).
Does this mean we should always add as many variables as we can to our model?
Scroll down
Let \(k = p+1\) denote the number of \(\beta\) parameters in the model.
Our estimator of \(\sigma^2\), the residual mean square \(\hat{\sigma}^2\), is defined as \[ \hat{\sigma}^2 = \frac{\mathbf{e^\top e}}{n-k} = \frac{RSS}{n-k} \;. \]
Suppose that \(0<R^2<1\) and we add an additional covariate to the model.
iClicker 7.8
Solution
About \(R^2\): \[ R^2 = 1 - \frac{RSS}{\mathbf{y^\top y}-n\bar{y}^2} \]
The denominator \(\mathbf{y^\top y}-n\bar{y}^2\) in the definition of \(R^2\) is fixed given data, and as we saw in Question 3, adding covariates never increases \(RSS\). Hence, \(R^2\) will never decrease with additional covariates.
To account for the number of predictors in the model, we can use the adjusted \(R^2\), which is \(R^2\) “corrected” for degrees of freedom and defined as \[ \mathrm{adj}\: R^2 = 1 - \frac{\left(\frac{\mathrm{SS}(\text{Res})}{n-k}\right)}{\left(\frac{\mathbf{y^\top y}-n\bar{y}^2}{n-1}\right)} = 1 - \frac{(n-1)\hat\sigma^2}{\mathbf{y^\top y}-n\bar{y}^2} \;. \]
Unlike \(R^2\), adjusted \(R^2\) could increase or decrease with additional covariates. However, it is not lower-bounded by zero.
Both \(R^2\) and adjusted-\(R^2\) are provided as output of summary().
© Rodolfo Lourenzutti - Adapted from Kenny Chiu’s material – licensed under CC By 4.0