Linear models: vector and matrix formulation

STAT 306 - Lecture 07

General announcements

Webwork 3 due today
No webwork next week.
Midterm on Nov 7th (in class)

Learning objectives

Express a linear model in matrix form, defining all vectors and matrices involved.
Provide the dimensions of all vectors and matrices relevant to a linear model.
Recall that the least squares estimation in a linear model context is not possible if \(\mathbf{X}^\top\mathbf{X}\) is singular, where \(\mathbf{X}\) is the design matrix of the model.
Definition of key statistics that can be computed in the fit of a linear model, including the residual sum of squares, the residual mean square, \(R^{2}\), and adjusted \(R^2\), in terms of vectors and matrices.
Explain how the residual sum of squares, the residual mean square, \(R^{2}\), and adjusted \(R^{2}\) can vary with the addition of explanatory variables to the fitted linear model.

Linear models in terms of vectors and matrices

So far, we’ve only considered the simple linear regression with one covariate. In practice, we often have more than one covariate of interest.
When we have more than one covariate, we called the model a multiple linear regression.
The theory underlying the simple linear model naturally generalizes to multiple linear regression;
However, in multiple linear regression, it’s more convenient to express our linear model in terms of vectors and matrices.

Multiple Linear Regression: our goals for next lectures

What are the least squares estimators and their properties?
What changes in the interpretation of such models or when making inference/predictions with them?
Confidence intervals and hypothesis tests for the parameters.
Prediction intervals for future observations.
Model diagnostics.

Simple Linear Model: vector and matrix formulation

Scroll down

We have formulated the simple linear model equation as:

\[ Y_i = \beta_0 + \beta_1x_i + \varepsilon_i \;, \qquad i=1,\dotsc,n \;, \]

How can we write this in terms of vectors and matrices?

\[ \begin{align} \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} &= \begin{bmatrix} \beta_0 + \beta_1x_1 + \varepsilon_1 \\ \beta_0 + \beta_1x_2 + \varepsilon_2 \\ \vdots \\ \beta_0 + \beta_1x_n + \varepsilon_n \end{bmatrix}\\ \ \\ &= \begin{bmatrix} \beta_0 + \beta_1x_1\\ \beta_0 + \beta_1x_2\\ \vdots \\ \beta_0 + \beta_1x_n\end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}\\ \ \\ &=\begin{bmatrix} 1 & x_1\\ 1 & x_2\\ \vdots \\ 1 & x_n\end{bmatrix}\begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix} \end{align} \]

Therefore, we can write the simple linear model as \[ \mathbf{Y} = \mathbf{X}\pmb{\beta} + \pmb{\varepsilon} \;, \] where: \[ \mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix}_{n \times 1} \;, \qquad \mathbf{X} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}_{n \times 2} \;, \qquad \pmb{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}_{2 \times 1} \;, \qquad \pmb{\varepsilon} = \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}_{n \times 1} \;. \]

Note that the first column of \(\mathbf{X}\) is a column of \(1\)’s.
- This column corresponds to the intercept parameter \(\beta_0\).
The matrix \(\mathbf{X}\) is called the design matrix.
- An inaccurately restrictive name that stuck due to historical reasons;

Matrix formulation of the normal equations

Recall the normal equations from Lecture 2 that we solved to find the least squares estimates \(\hat\beta^\top=\begin{bmatrix}\hat\beta_0 & \hat\beta_1\end{bmatrix}\): \[ \begin{align} n\hat{\beta}_0 + \hat{\beta}_1\sum_{i=1}^nx_i &= \sum_{i=1}^ny_i \;, \\ \hat{\beta}_0\sum_{i=1}^nx_i+\hat{\beta}_1\sum_{i=1}^nx_i^2 &= \sum_{i=1}^nx_iy_i \;. \end{align} \]

The normal equations of SLR

Scroll down

The normal equations can be written as \[ \begin{align} \begin{bmatrix} n\hat{\beta}_0 + \hat{\beta}_1\sum_{i=1}^nx_i\\ \hat{\beta}_0\sum_{i=1}^nx_i+\hat{\beta}_1\sum_{i=1}^nx_i^2 \end{bmatrix} &= \begin{bmatrix} \sum_{i=1}^n y_i \\ \sum_{i=1}^n x_iy_i \end{bmatrix} \\ \ \\ \begin{bmatrix} n & \sum_{i=1}^nx_i \\ \sum_{i=1}^nx_i & \sum_{i=1}^nx_i^2 \end{bmatrix} \begin{bmatrix} \hat\beta_0 \\ \hat\beta_1 \end{bmatrix} &= \begin{bmatrix} 1 & 1 & \ldots & 1 \\ x_1 & x_2 & \ldots & x_n \end{bmatrix} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}\\ \ \\ \begin{bmatrix} 1 & 1 & \ldots & 1 \\ x_1 & x_2 & \ldots & x_n \end{bmatrix} \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \begin{bmatrix} \hat\beta_0 \\ \hat\beta_1 \end{bmatrix} &= \begin{bmatrix} 1 & 1 & \ldots & 1 \\ x_1 & x_2 & \ldots & x_n \end{bmatrix} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}\\ \mathbf{X}^\top\mathbf{X}\hat\beta &= \mathbf{X}^\top\mathbf{y} \;. \end{align} \]

The normal equations of SLR

If \(\mathbf{X}^\top\mathbf{X}\) is invertible, then by the normal equations \[ \begin{align} \mathbf{X}^\top\mathbf{X}\hat\beta &= \mathbf{X}^\top\mathbf{y} \Leftrightarrow\\ (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}\hat\beta &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} \Leftrightarrow\\ \hat\beta &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} \end{align} \]
In other words, our least squares estimate is given by: \[\hat\beta =(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\]

Notice that this requires taking the inverse of \(\mathbf{X}^\top\mathbf{X}\), and so \(\mathbf{X}^\top\mathbf{X}\) must be invertible.
- iClicker 7.2

Conditions for Invertibility in SLR

Scroll down

The matrix \((\mathbf{X}^\top\mathbf{X})^{-1}\) does not exist when \[ \begin{align} \mathrm{det}(\mathbf{X}^\top\mathbf{X}) & = n\sum_{i=1}^nx_i^2 - \left(\sum_{i=1}^nx_i\right)^2 = 0 \Rightarrow \\ &\Rightarrow n\sum_{i=1}^nx_i^2 - n^2\bar{x}^2 = 0 \\ &\Rightarrow \sum_{i=1}^n\left(x_i-\bar{x}\right)^2 = 0 \end{align} \]

In other words, \(\mathbf{X}^\top\mathbf{X}\) is not invertible when \(x_1=x_2=\dotsc=x_n\).
Intuitively, this makes sense: if all the covariate values in our data are the same, then it’s not possible to observe a trend between the covariate and the response.
In general, the matrix \(\mathbf{X}^\top\mathbf{X}\) is invertible if and only if the design matrix \(\mathbf{X}_{n\times (p+1)}\) has \(rank(\mathbf{X})=p+1\), i.e.,
- the columns of \(\mathbf{X}\) are linearly independent
- the number of rows \(n\) is at least as large as the number of columns \(p+1\) (i.e., \(n\geq p+1\))
iClicker 7.3

Multiple Linear Regression (MLR)

The MLR model

Scroll down

Suppose we want to explain \(Y\) using \(p\) covariates \(X_1, X_2, \ldots, X_p\).
The model generalizes very naturally to: \[ Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \cdots + \beta_pX_{ip} + \varepsilon_i \;, \qquad i=1,\dotsc,n \;, \]

In matrix form: \[ \mathbf{Y} = \mathbf{X}\pmb{\beta} + \pmb{\varepsilon} \;, \] where:

\[ \mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix}_{n \times 1} \;, \qquad \mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix}_{n \times (p+1)} \;, \\ \pmb{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix}_{(p+1) \times 1} \;, \qquad \pmb{\varepsilon} = \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}_{n \times 1} \;. \]

A linear model is any model that can be written in the form \[ \mathbf{Y} = \mathbf{X}\beta + \varepsilon \;. \]
- iClicker 7.4

Hyperplane

Note that if we have more than one covariate (i.e., \(p>1\)), then we are working in a \(p+1\)-dimensional space.
We don’t have a line anymore, but a hyperplane.

The residuals in matrix form

Scroll down

The residuals of the model with intercept \(b_0\) and slope \(b_1\) are given by \[ \begin{align} \mathbf{e} &= \begin{bmatrix} Y_1 - b_0 - b_1x_1 \\ Y_2 - b_0 - b_1x_2 \\ \vdots \\ Y_n - b_0 - b_1x_n \end{bmatrix} \ \\ &= \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} - \begin{bmatrix} b_0 + b_1x_1 \\ b_0 + b_1x_2 \\ \vdots \\ b_0 + b_1x_n \end{bmatrix}\\ \ \\ &= \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} - \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}\begin{bmatrix} b_0 \\ b_1 \end{bmatrix} \ \\ &= \mathbf{Y} - \mathbf{X}\mathbf{b} \;. \end{align} \]
iClicker 7.5

RSS in matrix form

To fit the model \[ \mathbf{Y} = \mathbf{X}\pmb{\beta} + \pmb{\varepsilon} \;, \] we used the least squares criterion, which minimizes the residual sum of squares \[ RSS(b_0, b_1) = \sum_{i=1}^n e_i^2 = \begin{bmatrix} e_1 & e_2 & \ldots & e_n \end{bmatrix} \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_n \end{bmatrix} = \mathbf{e^\top e} \] where \(\mathbf{e} = \mathbf{Y} - \mathbf{X}\mathbf{b}\) is the vector of residuals for the slope \(b_1\) and intercept \(b_0\).

Estimator of the variance of the errors

For SLR, since we had only two coefficients, we used the estimator of the variance of the errors \[ \hat{\sigma}^2 = \frac{RSS}{n-2} \]
For MLR, since we have \(p+1\) coefficients, we use \[ \hat{\sigma}^2 = \frac{RSS}{n-p-1} \]

Fitting the MLR model

Scroll down

We want to find the coefficients \(\hat\beta\) that minimize the RSS

\[ \begin{align} RSS(\mathbf{b}) &= \mathbf{e^\top e}\\ &= (\mathbf{Y} - \mathbf{X}\mathbf{b})^\top(\mathbf{Y} - \mathbf{X}\mathbf{b}) \\ &= \mathbf{b^\top X^\top X b} - \mathbf{b^\top X^\top Y} - \mathbf{Y^\top X b} + \mathbf{Y^\top Y} \\ &= \mathbf{b^\top X^\top X b} - 2\mathbf{b^\top X^\top Y} + \mathbf{Y^\top Y} \end{align} \]

Then, we can take the derivative of \(RSS(\mathbf{b})\) with respect to \(\mathbf{b}\) and set it equal to zero to find the minimum.

\[ \frac{\partial RSS(\mathbf{b})}{\partial \mathbf{b}} = 2\mathbf{X^\top X b} - 2\mathbf{X^\top Y} \]

Setting it to zero, we get \[ 2\mathbf{X^\top X \widehat{\pmb{\beta}}} - 2\mathbf{X^\top Y} = 0 \Rightarrow \mathbf{X^\top X \widehat{\pmb{\beta}}} = \mathbf{X^\top Y} \] which is exactly the same solution we obtained for simple linear regression.

Activity

Goal: Manually fit a linear model in the case with more than two covariates; explore measures of model fit and their properties.

You may work on this activity alone or with others (recommended).
If you or your group needs help, please raise your hand.
When you are done one question, move on to the next. iClicker questions will be asked at certain points to monitor progress.

Preamble

Suppose we want to explain penguins’ body mass not only by their flipper length but also by their bill depth. Our linear model is then

\[ Y_i = \beta_0 + \beta_1x_{i1} +\beta_2x_{i2} +\varepsilon_i \;, \qquad \varepsilon_i\sim N(0,\sigma^2) \;. \]

Question 1

Scroll down

Construct the design matrix \(\mathbf{X}\) by completing the partial R code below. Which of the following R functions will you need to use to do so?

c()
rbind()
cbind()
data.frame()

Question 2

Scroll down

Some commands you’ll need to compute the least squares estimate \(\hat\beta\):

The function t(A) computes the matrix transpose \(\mathbf{A}^\top\).
The operator A %*% B computes the matrix product \(\mathbf{AB}\).
The function solve(A,b) computes the vector c such that \(\mathbf{Ac=b}\).

Given the above, compute the least squares estimate

\[ \hat\beta = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} \;. \]

iclicker 7.6

Question 3

Scroll down

Recall that the least squares solution \(\hat\beta\) minimizes the residual sum of squares

\[ \begin{align} RSS_{p=2} &:= \mathbf{e^\top e} \\ &= \sum_{i=1}^n(y_i - \hat{y}_i)^2 \\ &= \sum_{i=1}^n(y_i - \hat\beta_0 - \hat\beta_1x_1 - \hat\beta_2x_2)^2 \;. \end{align} \]

Suppose we add another covariate \(x_3\). Consider the new residual sum of squares

\[ RSS_{p=3} := \sum_{i=1}^n(y_i - \hat\beta_0 - \hat\beta_1x_1 - \hat\beta_2x_2 - \hat\beta_3x_3)^2 \; . \]

Is there relationship between these two objectives?
- iClicker 7.7

Solution

Notice that by setting \(\hat\beta_3=0\), we get

\[ \begin{align} RSS_{p=3} &= \sum_{i=1}^n(y_i - \hat\beta_0 - \hat\beta_1x_1 - \hat\beta_2x_2 - \hat\beta_3x_3)^2 \\ &= \sum_{i=1}^n(y_i - \hat\beta_0 - \hat\beta_1x_1 - \hat\beta_2x_2)^2 \\ &= RSS_{p=2} \;. \end{align} \]

In other words, adding covariates to our linear model never increases the minimum value achievable for the residual sum of squares (which is a measure of the “error” made by the model on our data).
So, adding new covariates to our model never increases the “total error” made by our model (and may potentially decrease the error instead).
Does this mean we should always add as many variables as we can to our model?
- NO! –>

Question 4

Scroll down

Let \(k = p+1\) denote the number of \(\beta\) parameters in the model.
Our estimator of \(\sigma^2\), the residual mean square \(\hat{\sigma}^2\), is defined as \[ \hat{\sigma}^2 = \frac{\mathbf{e^\top e}}{n-k} = \frac{RSS}{n-k} \;. \]
- This definition generalizes the definition from the \(p=1\) case that we’ve seen.
Suppose that \(0<R^2<1\) and we add an additional covariate to the model.
1. How could the multiple correlation coefficient \(R^2\) change?
2. How could the residual mean square \(\hat{\sigma}^2\) change?
iClicker 7.8

Solution

About \(R^2\): \[ R^2 = 1 - \frac{RSS}{\mathbf{y^\top y}-n\bar{y}^2} \]
The denominator \(\mathbf{y^\top y}-n\bar{y}^2\) in the definition of \(R^2\) is fixed given data, and as we saw in Question 3, adding covariates never increases \(RSS\). Hence, \(R^2\) will never decrease with additional covariates.
- This is a big deal because it means that \(R^2\) is not a good measure to compare models with different numbers of covariates.

About \(\hat{\sigma}^2\): \[ \hat{\sigma}^2 = \frac{RSS}{n-k} \]
With \(\hat\sigma^2\), the numerator \(RSS\) never increases with additional covariates. However, the denominator \(n-k\) decreases with additional covariates, and so it is possible that \(\hat\sigma^2\) may increase.

Adjusted \(R^2\)

To account for the number of predictors in the model, we can use the adjusted \(R^2\), which is \(R^2\) “corrected” for degrees of freedom and defined as \[ \mathrm{adj}\: R^2 = 1 - \frac{\left(\frac{\mathrm{SS}(\text{Res})}{n-k}\right)}{\left(\frac{\mathbf{y^\top y}-n\bar{y}^2}{n-1}\right)} = 1 - \frac{(n-1)\hat\sigma^2}{\mathbf{y^\top y}-n\bar{y}^2} \;. \]
Unlike \(R^2\), adjusted \(R^2\) could increase or decrease with additional covariates. However, it is not lower-bounded by zero.
Both \(R^2\) and adjusted-\(R^2\) are provided as output of summary().