Simple Linear Regression

STAT 306 - Lecture 02

Specification

  • We have two variables:
    • Response variable (\(Y\)): what we are trying to predict/explain;
    • Explanatory Variable (\(x\)): the variable used to predict the response;
  • There is a probability distribution of \(Y\) for each value of \(x\);

  • The means of these distributions are linearly related to \(x\);

  • The explanatory variable \(x\) is assumed to be fixed for each individual/sample;

    • For this reason, I’ll use lower case for \(x\) and upper case for \(Y\). But the usual notation is both upper case letters.

The model

  • The model relating \(X\) and \(Y\) is given by the equation of a line: \[ Y_i = \beta_0 + \beta_1 x_i+ \varepsilon_i \]
  • \(x_i\) is the value of the explanatory variable for the \(i\)-th sample (constant!);
  • \(\beta_0\) and \(\beta_1\) are parameters (constants!);
  • \(\varepsilon_i\) is a random error term (random variable!).
  • \(Y_i\) the response variable for the \(i\)-th sample (random variable)
    • function of a random variable, so also a random variable;

The model: components

  • The model relating \(X\) and \(Y\) is given by the equation of a line: \[ Y_i = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} x_i + \underbrace{\epsilon_i}_{\text{error term}} \]
  • Intercept: tells us the \(Y\) value when \(X=0\) (i.e., the value of \(Y\) when the line crosses the \(Y\)-axis).
  • Slope: tells us how much change in \(Y\) to expect for a unit increase in \(X\).
  • Error: captures the variability of the response not explained by the model.

The model: components

  • The model relating \(X\) and \(Y\) is given by: \[ Y_i = \underbrace{\beta_0}_{\text{intercept}} + \beta_1 x_i + \epsilon_i \]

The model: components

  • The model relating \(X\) and \(Y\) is given by: \[ Y_i = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} x_i + \epsilon_i \]

The random errors

  • The error component, \(\varepsilon_i\), captures everything that our model does not.

  • We treat \(\varepsilon_i\) as a random variable.

    • It has a distribution:
      • We will assume it to be Normal;
    • It has a mean:
      • safely assumed to be 0 (i.e., \(E[\varepsilon_i] = 0\));
    • It has a variance:
      • unknown and denoted by \(\sigma^2\) (i.e., \(Var(\varepsilon_i) = \sigma^2\));
    • \(Cov(\varepsilon_i, \varepsilon_j) = 0\) for \(i \neq j\) (i.e., the errors are uncorrelated).

The random errors and the response

Scroll down - you might need to refresh this page to show the plot

  • Imagine a linear model relating Height (m) and Weight (kg):

\[ \text{Weight} = -166+140\times\text{Height}+\varepsilon \]

  • For a \(1.63m\) tall person, we have:

\[ \text{Weight} = -166+140\times1.63+\varepsilon = 62.2 + \varepsilon \]

  • We would expect this person to weigh around \(62.2kg\);
    • But the weight is affected by other factors as well;
    • so we cannot say precisely the value of the weight;
  • We have a probability distribution of possible weights for a \(1.63m\) tall person:



Modelling the average

  • For a \(1.63m\) tall person, we have:

\[ \text{Weight} = -166+140\times1.63+\varepsilon = 62.2 + \varepsilon \]

  • We would expect this person to weigh around \(62.2kg\);

    • Some people will weigh more, and some will weigh less.
  • In average, \(1.63m\) tall people weigh 62.2;

  • The line \(−166+140\times\text{Height}\) gives the mean \(\text{Weight}\) of people of a given \(\text{Height}\);

The model as conditional expectation

  • In general: \[ E[Y|X = x] = \beta_0 + \beta_1 x \]
    • This just means that the regression line is the conditional average of \(Y\) for a given value of \(X=x\).
    • Why can we say that?
  • Note the difference: \[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]
    • This equation is for a given point, which is off the line (note the presence of the error term).

The model as conditional expectation

\[ \overbrace{\color{red}{\underbrace{\beta_0+\beta_1 x_i}_{\text{Regression Line}:\\\quad E[Y|X_i=x_i]}} + \epsilon_i}^{\text{Point: } Y_i} \]

The model as conditional expectation

Fitting is estimation

  • \(E[Y|X=x] = \beta_0 + \beta_1 x\) is the population’s conditional mean for a given value of \(X=x\).

  • But instead of estimating the mean for each value of \(X\) separately, which is not feasible, we are assuming a linear structure between the mean of the population and the value of \(X\).

  • Therefore, estimating the means for the value of \(X\) (in a certain range) reduces to estimate \(\beta_0\) and \(\beta_1\).

  • We denote the estimators by \(\hat{\beta}_0\) and \(\hat{\beta}_1\).

Fitting the model: Notation

  • We have \(n\) observations/samples.
    • \(Y_i\) and \(X_i\) denote the values of \(X\) and \(Y\) for observation \(i\).
    • You can think of it as \(n\) pairs: \((X_1, Y_1), (X_2, Y_2), ..., (X_n, Y_n)\).
  • For example, for the penguins, \(i\) refers to each penguin and can vary from 1 to 344.
i flipper_length_mm (X) body_mass_g (Y)
1 181 3750
2 186 3800
3 195 3250
4 NA NA
5 193 3450
6 190 3650
7 181 3625
8 195 4675
9 193 3475
10 190 4250
11 186 3300
12 180 3700
13 182 3200
14 191 3800
15 198 4400
16 185 3700
17 195 3450
18 197 4500
19 184 3325
20 194 4200
21 174 3400
22 180 3600
23 189 3800
24 185 3950
25 180 3800
26 187 3800
27 183 3550
28 187 3200
29 172 3150
30 180 3950
31 178 3250
32 178 3900
33 188 3300
34 184 3900
35 195 3325
36 196 4150
37 190 3950
38 180 3550
39 181 3300
40 184 4650
41 182 3150
42 195 3900
43 186 3100
44 196 4400
45 185 3000
46 190 4600
47 182 3425
48 179 2975
49 190 3450
50 191 4150
51 186 3500
52 188 4300
53 190 3450
54 200 4050
55 187 2900
56 191 3700
57 186 3550
58 193 3800
59 181 2850
60 194 3750
61 185 3150
62 195 4400
63 185 3600
64 192 4050
65 184 2850
66 192 3950
67 195 3350
68 188 4100
69 190 3050
70 198 4450
71 190 3600
72 190 3900
73 196 3550
74 197 4150
75 190 3700
76 195 4250
77 191 3700
78 184 3900
79 187 3550
80 195 4000
81 189 3200
82 196 4700
83 187 3800
84 193 4200
85 191 3350
86 194 3550
87 190 3800
88 189 3500
89 189 3950
90 190 3600
91 202 3550
92 205 4300
93 185 3400
94 186 4450
95 187 3300
96 208 4300
97 190 3700
98 196 4350
99 178 2900
100 192 4100
101 192 3725
102 203 4725
103 183 3075
104 190 4250
105 193 2925
106 184 3550
107 199 3750
108 190 3900
109 181 3175
110 197 4775
111 198 3825
112 191 4600
113 193 3200
114 197 4275
115 191 3900
116 196 4075
117 188 2900
118 199 3775
119 189 3350
120 189 3325
121 187 3150
122 198 3500
123 176 3450
124 202 3875
125 186 3050
126 199 4000
127 191 3275
128 195 4300
129 191 3050
130 210 4000
131 190 3325
132 197 3500
133 193 3500
134 199 4475
135 187 3425
136 190 3900
137 191 3175
138 200 3975
139 185 3400
140 193 4250
141 193 3400
142 187 3475
143 188 3050
144 190 3725
145 192 3000
146 185 3650
147 190 4250
148 184 3475
149 195 3450
150 193 3750
151 187 3700
152 201 4000
153 211 4500
154 230 5700
155 210 4450
156 218 5700
157 215 5400
158 210 4550
159 211 4800
160 219 5200
161 209 4400
162 215 5150
163 214 4650
164 216 5550
165 214 4650
166 213 5850
167 210 4200
168 217 5850
169 210 4150
170 221 6300
171 209 4800
172 222 5350
173 218 5700
174 215 5000
175 213 4400
176 215 5050
177 215 5000
178 215 5100
179 216 4100
180 215 5650
181 210 4600
182 220 5550
183 222 5250
184 209 4700
185 207 5050
186 230 6050
187 220 5150
188 220 5400
189 213 4950
190 219 5250
191 208 4350
192 208 5350
193 208 3950
194 225 5700
195 210 4300
196 216 4750
197 222 5550
198 217 4900
199 210 4200
200 225 5400
201 213 5100
202 215 5300
203 210 4850
204 220 5300
205 210 4400
206 225 5000
207 217 4900
208 220 5050
209 208 4300
210 220 5000
211 208 4450
212 224 5550
213 208 4200
214 221 5300
215 214 4400
216 231 5650
217 219 4700
218 230 5700
219 214 4650
220 229 5800
221 220 4700
222 223 5550
223 216 4750
224 221 5000
225 221 5100
226 217 5200
227 216 4700
228 230 5800
229 209 4600
230 220 6000
231 215 4750
232 223 5950
233 212 4625
234 221 5450
235 212 4725
236 224 5350
237 212 4750
238 228 5600
239 218 4600
240 218 5300
241 212 4875
242 230 5550
243 218 4950
244 228 5400
245 212 4750
246 224 5650
247 214 4850
248 226 5200
249 216 4925
250 222 4875
251 203 4625
252 225 5250
253 219 4850
254 228 5600
255 215 4975
256 228 5500
257 216 4725
258 215 5500
259 210 4700
260 219 5500
261 208 4575
262 209 5500
263 216 5000
264 229 5950
265 213 4650
266 230 5500
267 217 4375
268 230 5850
269 217 4875
270 222 6000
271 214 4925
272 NA NA
273 215 4850
274 222 5750
275 212 5200
276 213 5400
277 192 3500
278 196 3900
279 193 3650
280 188 3525
281 197 3725
282 198 3950
283 178 3250
284 197 3750
285 195 4150
286 198 3700
287 193 3800
288 194 3775
289 185 3700
290 201 4050
291 190 3575
292 201 4050
293 197 3300
294 181 3700
295 190 3450
296 195 4400
297 181 3600
298 191 3400
299 187 2900
300 193 3800
301 195 3300
302 197 4150
303 200 3400
304 200 3800
305 191 3700
306 205 4550
307 187 3200
308 201 4300
309 187 3350
310 203 4100
311 195 3600
312 199 3900
313 195 3850
314 210 4800
315 192 2700
316 205 4500
317 210 3950
318 187 3650
319 196 3550
320 196 3500
321 196 3675
322 201 4450
323 190 3400
324 212 4300
325 187 3250
326 198 3675
327 199 3325
328 201 3950
329 193 3600
330 203 4050
331 187 3350
332 197 3450
333 191 3250
334 203 4050
335 202 3800
336 194 3525
337 206 3950
338 189 3650
339 195 3650
340 207 4000
341 202 3400
342 193 3775
343 210 4100
344 198 3775

Fitting the model

Scroll down - You might need to refresh this page to show the plot

  • Let \((X_1, Y_1), (X_2, Y_2), ..., (X_n, Y_n)\) be a sample of size \(n\) of the variables \(X\) and \(Y\);

  • These are just points in a scatter plot; e.g.,

  • Naturally, we could use many different lines to fit these points;
    • which line to use?

Fitting the model: the residuals

  • We want a line that is “close” to the points;

  • The difference between the \(i\)-th observed point and the line is called residual and denoted by \(e_i\), \[ e_i = Y_i - (b_0 + b_1 X_i) \]

    • \(b_0\) is the intercept of the line;
    • \(b_1\) is the slope of the line.

Fitting the model: Residuals Sum of Squares

  • A line close to the points means small residuals;

  • But how do we measure “small”?

  • One common way is to use the Residual Sum of Square (RSS): \[ RSS(b_0, b_1) = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n \left( Y_i - (b_0 + b_1 x_i) \right)^2 \]

  • So we want to find the values of \(b_0\) and \(b_1\) that minimize the RSS.

Fitting the model: minimizing RSS

Scroll down - You might need to refresh this page to show the plot

  • What values to use for \(b_0\) and \(b_1\)?

Fitting the model: minimizing RSS

You might need to refresh this page to show the plot

  • We want \(b_0\) and \(b_1\) that minimizes the Sum of Square Error;

Fitting the model: normal equations

Scroll down - You might need to refresh this page to show the plot

  • We can use calculus to find the values of \(b_0\) and \(b_1\) that minimize the RSS:

\[ RSS(b_0, b_1) = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n \left( Y_i - (b_0 + b_1 x_i) \right)^2 \]

  1. Take the partial derivatives of \(RSS(b_0, b_1)\) with respect to \(b_0\) and \(b_1\); \[ \frac{\partial RSS(b_0, b_1)}{\partial b_0} = -2\sum_{i=1}^n \left( Y_i - (b_0 + b_1 x_i) \right) \] \[ \frac{\partial RSS(b_0, b_1)}{\partial b_1} = -2\sum_{i=1}^n \left( Y_i - (b_0 + b_1 x_i) \right)x_i \]

  2. Set the partial derivatives to zero: \[ \frac{\partial RSS(b_0, b_1)}{\partial b_0} = -2\sum_{i=1}^n \left( Y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right) = 0 \] \[ \frac{\partial RSS(b_0, b_1)}{\partial b_1} = -2\sum_{i=1}^n \left( Y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right)x_i = 0 \]

  • The solution of these equations are the estimators of \(\beta_0\) and \(\beta_1\) - that’s why we used \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
  1. We have a linear equation system, with two equations. Let’s organize it: \[ \hat{\beta}_0 + \bar{x}\hat{\beta}_1 = \bar{Y} \] \[ \hat{\beta}_0\sum_{i=1}^n x_i + \hat{\beta}_1 \sum_{i=1}^n x_i^2 = \sum_{i=1}^n Y_ix_i \]
  • These are the so-called normal equations.
  1. Solve the system of equations to get the estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\): \[ \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{x} \] \[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(Y_i - \bar{Y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = r_{XY}\frac{S_Y}{S_X} \]

Example: Fitting a model

  • Let’s fit a linear model relating Flipper Length and Body Mass of penguins.
  • To fit a linear model in R, we use the lm function.

Interpretation \(\beta_0\) and \(\beta_1\)

  • Slope: an increase of 1 unit of \(X\) is associated with an expected increase of \(\beta_1\) units in \(Y\).
    • It is associated with, not the cause of!
  • Intercept: The average value of \(Y\) when \(X = 0\) is \(\beta_0\).
    • Usually, we don’t care as much about this parameter.

Important: Association is not causality.

In general, we cannot conclude that changes in \(X\) cause a change in \(Y\). The conclusion of causality requires more than a good model.

Interpretation \(\beta_0\) and \(\beta_1\)

  • This means that an increase of 1mm in flipper length is associated with an expected increase of 50.15g in body mass.

  • Not ok to say: “An increase of 1mm in flipper length increases body mass by 50.15g.”

Predicting with the model

  • To predict the value of \(Y\) for a given \(x\), we use the estimated regression line: \[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 x \]

  • \(\hat{Y}\) is called the predicted value.

The range problem

  • The linear model assumes that the relationship between \(X\) and \(E[Y|X]\) is linear, which may or may not be true;

  • Sometimes, there’s a linear association only in part of the data range.

    • The linear model could still be useful when restricted to that specific range;
  • We need to exercise caution when using the model outside the range of the data, as the relationship between \(X\) and \(Y\) may differ significantly.

The Range problem

  • According to our model, what is the % of fat after week 36?

The Range problem

  • According to our model, what is the % of fat before week 11?

The Range problem

  • Let’s look at the actual data before week 10:

The Range problem: Take-away

  • There’s no way for us to know whether the relationship is still linear outside the range of the data;

  • You should be careful when predicting outside the range of the data;

Regression vs Correlation analysis

  • Correlation analysis: we’re interested in the strength of linear association between two variables;
    • no distinction between the two variables (no response and no covariate);
    • both variables are assumed to be stochastic;
  • Linear Regression: we’re interested in estimating the conditional average of the response given the value of the covariate.
    • covariate is assumed to be non-stochastic;
    • one of the variables is treated as a response and the other as a covariate;

Categorical Covariate?

  • Note that we can also have a categorical covariate.
    • For now let’s assume that the categorical variable has only two categories.
  • We can try to explain the body mass of Penguins based on the sex of penguins: male or female. \[ \text{body_mass_g}_i = \beta_0 + \beta_1\text{sex}_i + \varepsilon_i \]
  • But wait! How can we have categories in an equation?

Dummy variables (2 categories)

  • We can encode the categories into multiple variables.

  • For example, the variable sex could be defined as:

\[\begin{equation} \text{sex}_i = \begin{cases} 0 & \text{if penguin $i$ is female}\\ 1 & \text{if penguin $i$ is male} \end{cases} \end{equation}\]

  • A variable patient_status could be defined:

\[\begin{equation} \text{patient_status}_i = \begin{cases} 0 & \text{if patient $i$ is healthy}\\ 1 & \text{if patient $i$ is sick} \end{cases} \end{equation}\]

  • If we have a variable with two categories, we need only 1 dummy variable to represent it.

Back to the model

  • We can try to explain the body mass of Penguins based on the sex of penguins: male or female. \[ \text{body_mass_g}_i = \beta_0 + \beta_1\text{sex}_i + \varepsilon_i \]

Where is the line?

  • Note that there is no line in this case.

  • Sex cannot be 0.1 or 0.5. It can only be 0 or 1.

  • So, what is going on?

\[\begin{equation} \text{body_mass_g}_i = \begin{cases} \beta_0 + \beta_1 + \varepsilon_i& \text{if penguin $i$ is male}\\ \beta_0 + \varepsilon_i & \text{if penguin $i$ is female} \end{cases} \end{equation}\]

We are just comparing means

  • Remember that in regression, we model the mean given the value of a covariate.

  • So, in this case, we are modelling the mean of female penguins and the mean of male penguins;

  • Note that:

    • \(\beta_0\) is the average body_mass_g of female penguins.
    • \(\beta_1\) is the difference in means.
    • \(\beta_0+\beta_1\) is the average body_mass_g of male penguins.

R for us

  • R is pretty good dealing with categorical variables.

  • All we need to do is to use factors, e.g.,

penguins_clean %>%
  mutate(sex = as_factor(sex))
  • The lm will create the dummy variables and tell us the levels of the factor associated with the coefficient.

Population vs Sample Regression

  • Population Regression: \[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]

  • Sample Regression (estimated from the sample): \[ Y_i = \widehat{\beta}_0 + \widehat{\beta}_1 x_i + e_i \]

  • Note that \(e_i\neq\varepsilon_i\) because \(\widehat{\beta}_0\neq \beta_0\) and \(\widehat{\beta}_1\neq \beta_1\)

The parameters \(\beta_0\) and \(\beta_1\)

  • Since \(\beta_0\) and \(\beta_1\) are parameters, we estimate them based on a sample;

  • \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) are the estimators of \(\beta_0\) and \(\beta_1\).

  • As statistics, \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) depend on the sample, therefore we will need their sampling distribution.