Linear Regression

STAT 200 - Chapter 7 & 8

Introduction

  • We learned that the correlation measures the strength and direction of the linear relationship between two variables.

  • However, we often need more than this, such as quantifying how much the variables vary together and even predicting one variable based on another.

Questions Beyond Correlation

  • The plot shows a positive correlation between the length and body weight of the flipper.
  • How much do you expect the body weight to increase for a 1mm increase in the flipper’s length?
  • For a penguin with a 200mm flipper, what do you expect its body mass to be?

Correlation is not enough

  • The correlation coefficient tells us the strenght of the linear relationship between two variables;

  • But it does not directly gives us the answer to these questions;

  • Linear models provide a mathematical formula to describe the relationship between Flipper’s length and Body weight based on the data.

Simple Linear Regression

Specification

Scroll down

  • We have two variables:
    • Response variable (\(Y\)): what we are trying to predict/explain;
    • Explanatory Variable (\(X\)): the variable used to predict the response;
  • We have \(n\) observations/samples.
    • \(Y_i\) and \(X_i\) denote the values of \(X\) and \(Y\) for observation \(i\).
    • You can think of it as \(n\) pairs: \((X_1, Y_1), (X_2, Y_2), ..., (X_n, Y_n)\).
  • For example, for the penguins, \(i\) refers to each penguin and can vary from 1 to 344.
i flipper_length_mm (X) body_mass_g (Y)
1 181 3750
2 186 3800
3 195 3250
4 NA NA
5 193 3450
6 190 3650
7 181 3625
8 195 4675
9 193 3475
10 190 4250
11 186 3300
12 180 3700
13 182 3200
14 191 3800
15 198 4400
16 185 3700
17 195 3450
18 197 4500
19 184 3325
20 194 4200
21 174 3400
22 180 3600
23 189 3800
24 185 3950
25 180 3800
26 187 3800
27 183 3550
28 187 3200
29 172 3150
30 180 3950
31 178 3250
32 178 3900
33 188 3300
34 184 3900
35 195 3325
36 196 4150
37 190 3950
38 180 3550
39 181 3300
40 184 4650
41 182 3150
42 195 3900
43 186 3100
44 196 4400
45 185 3000
46 190 4600
47 182 3425
48 179 2975
49 190 3450
50 191 4150
51 186 3500
52 188 4300
53 190 3450
54 200 4050
55 187 2900
56 191 3700
57 186 3550
58 193 3800
59 181 2850
60 194 3750
61 185 3150
62 195 4400
63 185 3600
64 192 4050
65 184 2850
66 192 3950
67 195 3350
68 188 4100
69 190 3050
70 198 4450
71 190 3600
72 190 3900
73 196 3550
74 197 4150
75 190 3700
76 195 4250
77 191 3700
78 184 3900
79 187 3550
80 195 4000
81 189 3200
82 196 4700
83 187 3800
84 193 4200
85 191 3350
86 194 3550
87 190 3800
88 189 3500
89 189 3950
90 190 3600
91 202 3550
92 205 4300
93 185 3400
94 186 4450
95 187 3300
96 208 4300
97 190 3700
98 196 4350
99 178 2900
100 192 4100
101 192 3725
102 203 4725
103 183 3075
104 190 4250
105 193 2925
106 184 3550
107 199 3750
108 190 3900
109 181 3175
110 197 4775
111 198 3825
112 191 4600
113 193 3200
114 197 4275
115 191 3900
116 196 4075
117 188 2900
118 199 3775
119 189 3350
120 189 3325
121 187 3150
122 198 3500
123 176 3450
124 202 3875
125 186 3050
126 199 4000
127 191 3275
128 195 4300
129 191 3050
130 210 4000
131 190 3325
132 197 3500
133 193 3500
134 199 4475
135 187 3425
136 190 3900
137 191 3175
138 200 3975
139 185 3400
140 193 4250
141 193 3400
142 187 3475
143 188 3050
144 190 3725
145 192 3000
146 185 3650
147 190 4250
148 184 3475
149 195 3450
150 193 3750
151 187 3700
152 201 4000
153 211 4500
154 230 5700
155 210 4450
156 218 5700
157 215 5400
158 210 4550
159 211 4800
160 219 5200
161 209 4400
162 215 5150
163 214 4650
164 216 5550
165 214 4650
166 213 5850
167 210 4200
168 217 5850
169 210 4150
170 221 6300
171 209 4800
172 222 5350
173 218 5700
174 215 5000
175 213 4400
176 215 5050
177 215 5000
178 215 5100
179 216 4100
180 215 5650
181 210 4600
182 220 5550
183 222 5250
184 209 4700
185 207 5050
186 230 6050
187 220 5150
188 220 5400
189 213 4950
190 219 5250
191 208 4350
192 208 5350
193 208 3950
194 225 5700
195 210 4300
196 216 4750
197 222 5550
198 217 4900
199 210 4200
200 225 5400
201 213 5100
202 215 5300
203 210 4850
204 220 5300
205 210 4400
206 225 5000
207 217 4900
208 220 5050
209 208 4300
210 220 5000
211 208 4450
212 224 5550
213 208 4200
214 221 5300
215 214 4400
216 231 5650
217 219 4700
218 230 5700
219 214 4650
220 229 5800
221 220 4700
222 223 5550
223 216 4750
224 221 5000
225 221 5100
226 217 5200
227 216 4700
228 230 5800
229 209 4600
230 220 6000
231 215 4750
232 223 5950
233 212 4625
234 221 5450
235 212 4725
236 224 5350
237 212 4750
238 228 5600
239 218 4600
240 218 5300
241 212 4875
242 230 5550
243 218 4950
244 228 5400
245 212 4750
246 224 5650
247 214 4850
248 226 5200
249 216 4925
250 222 4875
251 203 4625
252 225 5250
253 219 4850
254 228 5600
255 215 4975
256 228 5500
257 216 4725
258 215 5500
259 210 4700
260 219 5500
261 208 4575
262 209 5500
263 216 5000
264 229 5950
265 213 4650
266 230 5500
267 217 4375
268 230 5850
269 217 4875
270 222 6000
271 214 4925
272 NA NA
273 215 4850
274 222 5750
275 212 5200
276 213 5400
277 192 3500
278 196 3900
279 193 3650
280 188 3525
281 197 3725
282 198 3950
283 178 3250
284 197 3750
285 195 4150
286 198 3700
287 193 3800
288 194 3775
289 185 3700
290 201 4050
291 190 3575
292 201 4050
293 197 3300
294 181 3700
295 190 3450
296 195 4400
297 181 3600
298 191 3400
299 187 2900
300 193 3800
301 195 3300
302 197 4150
303 200 3400
304 200 3800
305 191 3700
306 205 4550
307 187 3200
308 201 4300
309 187 3350
310 203 4100
311 195 3600
312 199 3900
313 195 3850
314 210 4800
315 192 2700
316 205 4500
317 210 3950
318 187 3650
319 196 3550
320 196 3500
321 196 3675
322 201 4450
323 190 3400
324 212 4300
325 187 3250
326 198 3675
327 199 3325
328 201 3950
329 193 3600
330 203 4050
331 187 3350
332 197 3450
333 191 3250
334 203 4050
335 202 3800
336 194 3525
337 206 3950
338 189 3650
339 195 3650
340 207 4000
341 202 3400
342 193 3775
343 210 4100
344 198 3775

The model

  • The model relating \(X\) and \(Y\) is given by the equation of a line: \[ Y_i = b_0 + b_1 X_i+ \epsilon_i \]
  • Let’s discuss each of these components in more detail;

The model: components

Scroll down

  • The model relating \(X\) and \(Y\) is given by the equation of a line: \[ Y_i = \underbrace{b_0}_{\text{intercept}} + \underbrace{b_1}_{\text{slope}} X_i + \underbrace{\epsilon_i}_{\text{error term}} \]
  • Intercept: tells us the \(Y\) value when \(X=0\) (i.e., the value of \(Y\) when the line crosses the \(Y\)-axis).
  • Slope: tells us how much change in \(Y\) to expect for a unit increase in \(X\).
  • Error: captures the variability of the response not explained by the model.

The model: components

  • The model relating \(X\) and \(Y\) is given by: \[ Y_i = \underbrace{b_0}_{\text{intercept}} + b_1 X_i + \epsilon_i \]

The model: components

  • The model relating \(X\) and \(Y\) is given by: \[ Y_i = \underbrace{b_0}_{\text{intercept}} + \underbrace{b_1}_{\text{slope}} X_i + \epsilon_i \]

The model: components

Scroll down

  • The model relating \(X\) and \(Y\) is given by: \[ Y_i = b_0 + b_1 X_i + \color{red}{\underbrace{\epsilon_i}_{\text{error term}}} \]

  • The hat in \(\widehat{Y}\) indicates the value is predicted from the model.
    • Generally, the predicted value will be different from the actual value.

The model: components

Fitting

  • We want to find the line that best fits the data;

  • The best line is the one that minimizes the square of all residuals;

  • We sum the square residuals and look for the line that minimizes it;

  • Let’s try to fit a line ourselves!

Fitting

Scroll down – You might need to refresh this page to show the plot

  • What values to use for \(b_0\) and \(b_1\)?

Fitting

You might need to refresh this page to show the plot

  • We want \(b_0\) and \(b_1\) that minimizes the Sum of Square Error;

Simple Linear Regression: Fitting

  • As it turns out the best line is given by:
    • Slope: \[b_1 = r\frac{s_Y}{s_X}\] where \(r\) is the correlation coefficient;
    • Intercept: \[b_0 = \bar{y} - b_1\bar{x}\]
  • Note that the line will always pass through \((\bar{x}, \bar{y})\);

Interpretation \(b_0\) and \(b_1\)

  • Slope: an increase of 1 unit of \(X\) is associated with an expected increase of \(b_1\) units in \(Y\).
    • It is associated with, not the cause of!
  • Intercept: The average value of \(Y\) when \(X = 0\) is \(b_0\).

Important: Association is not causality.

In general, we cannot conclude that changes in \(X\) cause a change in \(Y\). The conclusion of causality requires more than a good model.

Example

Scroll down

  • Let’s fit a linear regression to explain the penguins’ flipper length based on the penguins’ body weight.

Summary Quantities:

  • \(\bar{X}: 4207\)
  • \(\bar{Y}: 201\)
  • \(S_X: 805\)
  • \(S_Y: 14\)
  • \(r = 0.87\)

Regression:

  • \(b_1 =\)
  • \(b_0 =\)

The residuals

  • The residuals is defined as:

\[ r_i = \underbrace{Y_i}_{\text{(from data)}} - \underbrace{\widehat{Y}_i}_{(\text{from model})} \]

  • We fitted the model by minimizing the sum of squared residuals;
    • When fitted this way, the sum of residuals is equal to zero;

Goodness of fit

  • Remember that the residuals contain everything the model couldn’t capture.

  • So, the residuals are helpful to check the goodness of fit of our model.

  • A commonly used plot is the response vs. the explanatory variable scatterplot.

    • If the model is appropriate, this plot should show no pattern.

Residual Plot

Residual Plot: Response vs Covariate

  • What are your thoughts here?

Outliers and Influential Points

  • Outliers might greatly affect the fitted linear model.

  • Data points whose omission results in a very different fitted regression model are influential points.

    • Outliers are not necessarily influential points (they might be!).
  • In these cases, One should fit separate regression lines to the data with and without the outliers, and compare the results.

    • The outliers should not be omitted without justification.

Outliers and Influential Points

Outliers and Influential Points

The Range problem

The Range problem

  • According to our model, what is the % of fat after week 36?

The Range problem

  • According to our model, what is the % of fat before week 11?

The Range problem

  • Let’s look at the actual data before week 10:

The Range problem: Take-away

  • There’s no way for us to know whether the relationship is still linear outside the range of the data;

  • You should not predict outside the range of the data;

Simpson’s Paradox

Scroll down

  • Let’s fit a model to explain penguins’ body mass based on the bill depth.

  • This model suggests that an increase in the bill depth is associated with a decrease in body mass!

Simpson’s Paradox

Lighter penguin

Heavier penguin
Figure 1

Simpson’s Paradox

  • But if we fit one linear regression per species, we get…

  • By controlling for species, the correlation between bill depth and body mass becomes positive;

References

References

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.