STAT 200 - Chapter 7 & 8
We learned that the correlation measures the strength and direction of the linear relationship between two variables.
However, we often need more than this, such as quantifying how much the variables vary together and even predicting one variable based on another.
import { d3_createScatterPlot } from "./scripts/d3plots.js"
import { d3_createScatterPlotWithLine } from "./scripts/d3plots.js"
import { sumOfSquaredResiduals } from './scripts/d3plots.js'
penguins_array = {
const maxLength = d3.max(Object.values(penguins_df).map(d => d.length));
const keys = Object.keys(penguins_df);
return d3.range(maxLength).map(i =>
Object.fromEntries(keys.map(key => [key, penguins_df[key][i] || null]))
);}
penguins_sample_array = {
const maxLength = d3.max(Object.values(penguins_sample_df).map(d => d.length));
const keys = Object.keys(penguins_sample_df);
return d3.range(maxLength).map(i =>
Object.fromEntries(keys.map(key => [key, penguins_sample_df[key][i] || null]))
);
}
The correlation coefficient tells us the strenght of the linear relationship between two variables;
But it does not directly gives us the answer to these questions;
Linear models provide a mathematical formula to describe the relationship between Flipper’s length and Body weight based on the data.
Scroll down
i | flipper_length_mm (X) | body_mass_g (Y) |
---|---|---|
1 | 181 | 3750 |
2 | 186 | 3800 |
3 | 195 | 3250 |
4 | NA | NA |
5 | 193 | 3450 |
6 | 190 | 3650 |
7 | 181 | 3625 |
8 | 195 | 4675 |
9 | 193 | 3475 |
10 | 190 | 4250 |
11 | 186 | 3300 |
12 | 180 | 3700 |
13 | 182 | 3200 |
14 | 191 | 3800 |
15 | 198 | 4400 |
16 | 185 | 3700 |
17 | 195 | 3450 |
18 | 197 | 4500 |
19 | 184 | 3325 |
20 | 194 | 4200 |
21 | 174 | 3400 |
22 | 180 | 3600 |
23 | 189 | 3800 |
24 | 185 | 3950 |
25 | 180 | 3800 |
26 | 187 | 3800 |
27 | 183 | 3550 |
28 | 187 | 3200 |
29 | 172 | 3150 |
30 | 180 | 3950 |
31 | 178 | 3250 |
32 | 178 | 3900 |
33 | 188 | 3300 |
34 | 184 | 3900 |
35 | 195 | 3325 |
36 | 196 | 4150 |
37 | 190 | 3950 |
38 | 180 | 3550 |
39 | 181 | 3300 |
40 | 184 | 4650 |
41 | 182 | 3150 |
42 | 195 | 3900 |
43 | 186 | 3100 |
44 | 196 | 4400 |
45 | 185 | 3000 |
46 | 190 | 4600 |
47 | 182 | 3425 |
48 | 179 | 2975 |
49 | 190 | 3450 |
50 | 191 | 4150 |
51 | 186 | 3500 |
52 | 188 | 4300 |
53 | 190 | 3450 |
54 | 200 | 4050 |
55 | 187 | 2900 |
56 | 191 | 3700 |
57 | 186 | 3550 |
58 | 193 | 3800 |
59 | 181 | 2850 |
60 | 194 | 3750 |
61 | 185 | 3150 |
62 | 195 | 4400 |
63 | 185 | 3600 |
64 | 192 | 4050 |
65 | 184 | 2850 |
66 | 192 | 3950 |
67 | 195 | 3350 |
68 | 188 | 4100 |
69 | 190 | 3050 |
70 | 198 | 4450 |
71 | 190 | 3600 |
72 | 190 | 3900 |
73 | 196 | 3550 |
74 | 197 | 4150 |
75 | 190 | 3700 |
76 | 195 | 4250 |
77 | 191 | 3700 |
78 | 184 | 3900 |
79 | 187 | 3550 |
80 | 195 | 4000 |
81 | 189 | 3200 |
82 | 196 | 4700 |
83 | 187 | 3800 |
84 | 193 | 4200 |
85 | 191 | 3350 |
86 | 194 | 3550 |
87 | 190 | 3800 |
88 | 189 | 3500 |
89 | 189 | 3950 |
90 | 190 | 3600 |
91 | 202 | 3550 |
92 | 205 | 4300 |
93 | 185 | 3400 |
94 | 186 | 4450 |
95 | 187 | 3300 |
96 | 208 | 4300 |
97 | 190 | 3700 |
98 | 196 | 4350 |
99 | 178 | 2900 |
100 | 192 | 4100 |
101 | 192 | 3725 |
102 | 203 | 4725 |
103 | 183 | 3075 |
104 | 190 | 4250 |
105 | 193 | 2925 |
106 | 184 | 3550 |
107 | 199 | 3750 |
108 | 190 | 3900 |
109 | 181 | 3175 |
110 | 197 | 4775 |
111 | 198 | 3825 |
112 | 191 | 4600 |
113 | 193 | 3200 |
114 | 197 | 4275 |
115 | 191 | 3900 |
116 | 196 | 4075 |
117 | 188 | 2900 |
118 | 199 | 3775 |
119 | 189 | 3350 |
120 | 189 | 3325 |
121 | 187 | 3150 |
122 | 198 | 3500 |
123 | 176 | 3450 |
124 | 202 | 3875 |
125 | 186 | 3050 |
126 | 199 | 4000 |
127 | 191 | 3275 |
128 | 195 | 4300 |
129 | 191 | 3050 |
130 | 210 | 4000 |
131 | 190 | 3325 |
132 | 197 | 3500 |
133 | 193 | 3500 |
134 | 199 | 4475 |
135 | 187 | 3425 |
136 | 190 | 3900 |
137 | 191 | 3175 |
138 | 200 | 3975 |
139 | 185 | 3400 |
140 | 193 | 4250 |
141 | 193 | 3400 |
142 | 187 | 3475 |
143 | 188 | 3050 |
144 | 190 | 3725 |
145 | 192 | 3000 |
146 | 185 | 3650 |
147 | 190 | 4250 |
148 | 184 | 3475 |
149 | 195 | 3450 |
150 | 193 | 3750 |
151 | 187 | 3700 |
152 | 201 | 4000 |
153 | 211 | 4500 |
154 | 230 | 5700 |
155 | 210 | 4450 |
156 | 218 | 5700 |
157 | 215 | 5400 |
158 | 210 | 4550 |
159 | 211 | 4800 |
160 | 219 | 5200 |
161 | 209 | 4400 |
162 | 215 | 5150 |
163 | 214 | 4650 |
164 | 216 | 5550 |
165 | 214 | 4650 |
166 | 213 | 5850 |
167 | 210 | 4200 |
168 | 217 | 5850 |
169 | 210 | 4150 |
170 | 221 | 6300 |
171 | 209 | 4800 |
172 | 222 | 5350 |
173 | 218 | 5700 |
174 | 215 | 5000 |
175 | 213 | 4400 |
176 | 215 | 5050 |
177 | 215 | 5000 |
178 | 215 | 5100 |
179 | 216 | 4100 |
180 | 215 | 5650 |
181 | 210 | 4600 |
182 | 220 | 5550 |
183 | 222 | 5250 |
184 | 209 | 4700 |
185 | 207 | 5050 |
186 | 230 | 6050 |
187 | 220 | 5150 |
188 | 220 | 5400 |
189 | 213 | 4950 |
190 | 219 | 5250 |
191 | 208 | 4350 |
192 | 208 | 5350 |
193 | 208 | 3950 |
194 | 225 | 5700 |
195 | 210 | 4300 |
196 | 216 | 4750 |
197 | 222 | 5550 |
198 | 217 | 4900 |
199 | 210 | 4200 |
200 | 225 | 5400 |
201 | 213 | 5100 |
202 | 215 | 5300 |
203 | 210 | 4850 |
204 | 220 | 5300 |
205 | 210 | 4400 |
206 | 225 | 5000 |
207 | 217 | 4900 |
208 | 220 | 5050 |
209 | 208 | 4300 |
210 | 220 | 5000 |
211 | 208 | 4450 |
212 | 224 | 5550 |
213 | 208 | 4200 |
214 | 221 | 5300 |
215 | 214 | 4400 |
216 | 231 | 5650 |
217 | 219 | 4700 |
218 | 230 | 5700 |
219 | 214 | 4650 |
220 | 229 | 5800 |
221 | 220 | 4700 |
222 | 223 | 5550 |
223 | 216 | 4750 |
224 | 221 | 5000 |
225 | 221 | 5100 |
226 | 217 | 5200 |
227 | 216 | 4700 |
228 | 230 | 5800 |
229 | 209 | 4600 |
230 | 220 | 6000 |
231 | 215 | 4750 |
232 | 223 | 5950 |
233 | 212 | 4625 |
234 | 221 | 5450 |
235 | 212 | 4725 |
236 | 224 | 5350 |
237 | 212 | 4750 |
238 | 228 | 5600 |
239 | 218 | 4600 |
240 | 218 | 5300 |
241 | 212 | 4875 |
242 | 230 | 5550 |
243 | 218 | 4950 |
244 | 228 | 5400 |
245 | 212 | 4750 |
246 | 224 | 5650 |
247 | 214 | 4850 |
248 | 226 | 5200 |
249 | 216 | 4925 |
250 | 222 | 4875 |
251 | 203 | 4625 |
252 | 225 | 5250 |
253 | 219 | 4850 |
254 | 228 | 5600 |
255 | 215 | 4975 |
256 | 228 | 5500 |
257 | 216 | 4725 |
258 | 215 | 5500 |
259 | 210 | 4700 |
260 | 219 | 5500 |
261 | 208 | 4575 |
262 | 209 | 5500 |
263 | 216 | 5000 |
264 | 229 | 5950 |
265 | 213 | 4650 |
266 | 230 | 5500 |
267 | 217 | 4375 |
268 | 230 | 5850 |
269 | 217 | 4875 |
270 | 222 | 6000 |
271 | 214 | 4925 |
272 | NA | NA |
273 | 215 | 4850 |
274 | 222 | 5750 |
275 | 212 | 5200 |
276 | 213 | 5400 |
277 | 192 | 3500 |
278 | 196 | 3900 |
279 | 193 | 3650 |
280 | 188 | 3525 |
281 | 197 | 3725 |
282 | 198 | 3950 |
283 | 178 | 3250 |
284 | 197 | 3750 |
285 | 195 | 4150 |
286 | 198 | 3700 |
287 | 193 | 3800 |
288 | 194 | 3775 |
289 | 185 | 3700 |
290 | 201 | 4050 |
291 | 190 | 3575 |
292 | 201 | 4050 |
293 | 197 | 3300 |
294 | 181 | 3700 |
295 | 190 | 3450 |
296 | 195 | 4400 |
297 | 181 | 3600 |
298 | 191 | 3400 |
299 | 187 | 2900 |
300 | 193 | 3800 |
301 | 195 | 3300 |
302 | 197 | 4150 |
303 | 200 | 3400 |
304 | 200 | 3800 |
305 | 191 | 3700 |
306 | 205 | 4550 |
307 | 187 | 3200 |
308 | 201 | 4300 |
309 | 187 | 3350 |
310 | 203 | 4100 |
311 | 195 | 3600 |
312 | 199 | 3900 |
313 | 195 | 3850 |
314 | 210 | 4800 |
315 | 192 | 2700 |
316 | 205 | 4500 |
317 | 210 | 3950 |
318 | 187 | 3650 |
319 | 196 | 3550 |
320 | 196 | 3500 |
321 | 196 | 3675 |
322 | 201 | 4450 |
323 | 190 | 3400 |
324 | 212 | 4300 |
325 | 187 | 3250 |
326 | 198 | 3675 |
327 | 199 | 3325 |
328 | 201 | 3950 |
329 | 193 | 3600 |
330 | 203 | 4050 |
331 | 187 | 3350 |
332 | 197 | 3450 |
333 | 191 | 3250 |
334 | 203 | 4050 |
335 | 202 | 3800 |
336 | 194 | 3525 |
337 | 206 | 3950 |
338 | 189 | 3650 |
339 | 195 | 3650 |
340 | 207 | 4000 |
341 | 202 | 3400 |
342 | 193 | 3775 |
343 | 210 | 4100 |
344 | 198 | 3775 |
Scroll down
Scroll down
We want to find the line that best fits the data;
The best line is the one that minimizes the square of all residuals;
We sum the square residuals and look for the line that minimizes it;
Scroll down – You might need to refresh this page to show the plot
viewof intercept = {
let input = Inputs.range([-1, 13],
{
value: 8,
step: .01,
label: "Intercept: ",
width: 300
});
d3.select(input).select("label")._groups[0][0].innerHTML = 'b<sub>0</sub>: ';
return input
}
viewof slope = {
let input = Inputs.range([-5, 5],
{value: 0,
step: .01,
label: "Slope: ", width: 300});
d3.select(input).select("label")._groups[0][0].innerHTML = 'b<sub>1</sub>:';
return input
}
data_test = [
{'x': 2.83, 'y': 10.90},
{'x': 4.37, 'y': 11.48},
{'x': 3.29, 'y': 11.22},
{'x': 2.45, 'y': 8.11},
{'x': -0.50, 'y': 2.92},
{'x': 3.53, 'y': 13.97},
{'x': 3.32, 'y': 5.21},
{'x': 2.15, 'y': 7.53},
{'x': 3.90, 'y': 9.63},
{'x': 0.12, 'y': 5.98},
{'x': 4.20, 'y': 12.88},
{'x': 2.73, 'y': 10.05},
{'x': 4.64, 'y': 13.26},
{'x': 2.12, 'y': 4.94},
{'x': 0.95, 'y': 9.18}];
{
const rss_data_test = sumOfSquaredResiduals({slope: slope, intercept: intercept, data: data_test, xName: 'x', yName: 'y'});
//const title = "Residual Sum of Squares" + rss;
d3_createScatterPlotWithLine({
elementId: 'which-beta',
//xName: 'flipper_length_mm',
//yName: 'body_mass_g',
xName: 'x',
yName: 'y',
data: data_test,
slope: slope,
intercept: intercept,
drawErrorLines: true,
title: `Residual Sum of Squares: ${rss_data_test.toFixed(3)}` ,
xlab: 'Explanatory Variable',
ylab: "Response Variable",
titleFontSize: "24px",
labelFontSize: "20px",
tickFontSize: '16px',
pointSize: 3,
pointColor: 'steelblue',
margin: {top: 80, right: 20, bottom: 50, left: 80}
//lineCallback,
//styles = {}
});
}
You might need to refresh this page to show the plot
d3_createScatterPlotWithLine({
elementId: 'scatterplot-penguins-3',
//xName: 'flipper_length_mm',
//yName: 'body_mass_g',
xName: 'x',
yName: 'y',
data: data_test,
slope: 1.6307,
intercept: 4.7911,
drawErrorLines: true,
title: `Residual Sum of Squares: ${sumOfSquaredResiduals({slope: 1.6307, intercept: 4.7911, data: data_test, xName: 'x', yName: 'y'}).toFixed(3)}` ,
xlab: 'Explanatory Variable',
ylab: "Response Variable",
titleFontSize: "24px",
labelFontSize: "18px",
tickFontSize: '16px',
pointSize: 3,
pointColor: 'steelblue',
margin: {top: 80, right: 40, bottom: 100, left: 80}
//lineCallback,
//styles = {}
});
Important: Association is not causality.
In general, we cannot conclude that changes in \(X\) cause a change in \(Y\). The conclusion of causality requires more than a good model.
Scroll down
Summary Quantities:
Regression:
\[ r_i = \underbrace{Y_i}_{\text{(from data)}} - \underbrace{\widehat{Y}_i}_{(\text{from model})} \]
Remember that the residuals contain everything the model couldn’t capture.
So, the residuals are helpful to check the goodness of fit of our model.
A commonly used plot is the response vs. the explanatory variable scatterplot.
Outliers might greatly affect the fitted linear model.
Data points whose omission results in a very different fitted regression model are influential points.
In these cases, One should fit separate regression lines to the data with and without the outliers, and compare the results.
There’s no way for us to know whether the relationship is still linear outside the range of the data;
You should not predict outside the range of the data;
Scroll down
body mass
based on the bill depth
.species
, the correlation between bill depth
and body mass
becomes positive;© 2023 Rodolfo Lourenzutti & Eugenia Yu – Material Licensed under CC By-SA 4.0