STAT 306 - Lecture 02
import { d3_createScatterPlot } from "./scripts/d3plots.js"
import { d3_createScatterPlotWithLine } from "./scripts/d3plots.js"
import { sumOfSquaredResiduals } from './scripts/d3plots.js'
import { d3_createNormalDensityPlot } from './scripts/d3plots.js'There is a probability distribution of \(Y\) for each value of \(x\);
The means of these distributions are linearly related to \(x\);
The explanatory variable \(x\) is assumed to be fixed for each individual/sample;
The error component, \(\varepsilon_i\), captures everything that our model does not.
We treat \(\varepsilon_i\) as a random variable.
Scroll down - you might need to refresh this page to show the plot
\[ \text{Weight} = -166+140\times\text{Height}+\varepsilon \]
\[ \text{Weight} = -166+140\times1.63+\varepsilon = 62.2 + \varepsilon \]
viewof height = Inputs.range([1.4, 2.1],
{
value: 1.63,
step: .01,
label: "Height: ",
width: 400
});
viewof sigma = Inputs.range([.1, 8],
{
value: 3,
step: .1,
label: "σ: ",
width: 400
});{
const mean = -166+140*height;
const stdDev = sigma;
const elementId = 'normal-density';
const title = `Weight Distribution for a ${height} meters tall person.`;
const xlab = 'Weight (kg)';
const ylab = 'Density';
const titleFontSize = '22px';
const labelFontSize = '18px';
const tickFontSize = '16px';
const margin = { top: 40, right: 20, bottom: 50, left: 70 };
d3_createNormalDensityPlot({
elementId,
mean,
stdDev,
title,
xlab,
ylab,
titleFontSize,
labelFontSize,
tickFontSize,
margin
});
}\[ \text{Weight} = -166+140\times1.63+\varepsilon = 62.2 + \varepsilon \]
We would expect this person to weigh around \(62.2kg\);
In average, \(1.63m\) tall people weigh 62.2;
The line \(−166+140\times\text{Height}\) gives the mean \(\text{Weight}\) of people of a given \(\text{Height}\);
\[ \overbrace{\color{red}{\underbrace{\beta_0+\beta_1 x_i}_{\text{Regression Line}:\\\quad E[Y|X_i=x_i]}} + \epsilon_i}^{\text{Point: } Y_i} \]
\(E[Y|X=x] = \beta_0 + \beta_1 x\) is the population’s conditional mean for a given value of \(X=x\).
But instead of estimating the mean for each value of \(X\) separately, which is not feasible, we are assuming a linear structure between the mean of the population and the value of \(X\).
Therefore, estimating the means for the value of \(X\) (in a certain range) reduces to estimate \(\beta_0\) and \(\beta_1\).
We denote the estimators by \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
| i | flipper_length_mm (X) | body_mass_g (Y) |
|---|---|---|
| 1 | 181 | 3750 |
| 2 | 186 | 3800 |
| 3 | 195 | 3250 |
| 4 | NA | NA |
| 5 | 193 | 3450 |
| 6 | 190 | 3650 |
| 7 | 181 | 3625 |
| 8 | 195 | 4675 |
| 9 | 193 | 3475 |
| 10 | 190 | 4250 |
| 11 | 186 | 3300 |
| 12 | 180 | 3700 |
| 13 | 182 | 3200 |
| 14 | 191 | 3800 |
| 15 | 198 | 4400 |
| 16 | 185 | 3700 |
| 17 | 195 | 3450 |
| 18 | 197 | 4500 |
| 19 | 184 | 3325 |
| 20 | 194 | 4200 |
| 21 | 174 | 3400 |
| 22 | 180 | 3600 |
| 23 | 189 | 3800 |
| 24 | 185 | 3950 |
| 25 | 180 | 3800 |
| 26 | 187 | 3800 |
| 27 | 183 | 3550 |
| 28 | 187 | 3200 |
| 29 | 172 | 3150 |
| 30 | 180 | 3950 |
| 31 | 178 | 3250 |
| 32 | 178 | 3900 |
| 33 | 188 | 3300 |
| 34 | 184 | 3900 |
| 35 | 195 | 3325 |
| 36 | 196 | 4150 |
| 37 | 190 | 3950 |
| 38 | 180 | 3550 |
| 39 | 181 | 3300 |
| 40 | 184 | 4650 |
| 41 | 182 | 3150 |
| 42 | 195 | 3900 |
| 43 | 186 | 3100 |
| 44 | 196 | 4400 |
| 45 | 185 | 3000 |
| 46 | 190 | 4600 |
| 47 | 182 | 3425 |
| 48 | 179 | 2975 |
| 49 | 190 | 3450 |
| 50 | 191 | 4150 |
| 51 | 186 | 3500 |
| 52 | 188 | 4300 |
| 53 | 190 | 3450 |
| 54 | 200 | 4050 |
| 55 | 187 | 2900 |
| 56 | 191 | 3700 |
| 57 | 186 | 3550 |
| 58 | 193 | 3800 |
| 59 | 181 | 2850 |
| 60 | 194 | 3750 |
| 61 | 185 | 3150 |
| 62 | 195 | 4400 |
| 63 | 185 | 3600 |
| 64 | 192 | 4050 |
| 65 | 184 | 2850 |
| 66 | 192 | 3950 |
| 67 | 195 | 3350 |
| 68 | 188 | 4100 |
| 69 | 190 | 3050 |
| 70 | 198 | 4450 |
| 71 | 190 | 3600 |
| 72 | 190 | 3900 |
| 73 | 196 | 3550 |
| 74 | 197 | 4150 |
| 75 | 190 | 3700 |
| 76 | 195 | 4250 |
| 77 | 191 | 3700 |
| 78 | 184 | 3900 |
| 79 | 187 | 3550 |
| 80 | 195 | 4000 |
| 81 | 189 | 3200 |
| 82 | 196 | 4700 |
| 83 | 187 | 3800 |
| 84 | 193 | 4200 |
| 85 | 191 | 3350 |
| 86 | 194 | 3550 |
| 87 | 190 | 3800 |
| 88 | 189 | 3500 |
| 89 | 189 | 3950 |
| 90 | 190 | 3600 |
| 91 | 202 | 3550 |
| 92 | 205 | 4300 |
| 93 | 185 | 3400 |
| 94 | 186 | 4450 |
| 95 | 187 | 3300 |
| 96 | 208 | 4300 |
| 97 | 190 | 3700 |
| 98 | 196 | 4350 |
| 99 | 178 | 2900 |
| 100 | 192 | 4100 |
| 101 | 192 | 3725 |
| 102 | 203 | 4725 |
| 103 | 183 | 3075 |
| 104 | 190 | 4250 |
| 105 | 193 | 2925 |
| 106 | 184 | 3550 |
| 107 | 199 | 3750 |
| 108 | 190 | 3900 |
| 109 | 181 | 3175 |
| 110 | 197 | 4775 |
| 111 | 198 | 3825 |
| 112 | 191 | 4600 |
| 113 | 193 | 3200 |
| 114 | 197 | 4275 |
| 115 | 191 | 3900 |
| 116 | 196 | 4075 |
| 117 | 188 | 2900 |
| 118 | 199 | 3775 |
| 119 | 189 | 3350 |
| 120 | 189 | 3325 |
| 121 | 187 | 3150 |
| 122 | 198 | 3500 |
| 123 | 176 | 3450 |
| 124 | 202 | 3875 |
| 125 | 186 | 3050 |
| 126 | 199 | 4000 |
| 127 | 191 | 3275 |
| 128 | 195 | 4300 |
| 129 | 191 | 3050 |
| 130 | 210 | 4000 |
| 131 | 190 | 3325 |
| 132 | 197 | 3500 |
| 133 | 193 | 3500 |
| 134 | 199 | 4475 |
| 135 | 187 | 3425 |
| 136 | 190 | 3900 |
| 137 | 191 | 3175 |
| 138 | 200 | 3975 |
| 139 | 185 | 3400 |
| 140 | 193 | 4250 |
| 141 | 193 | 3400 |
| 142 | 187 | 3475 |
| 143 | 188 | 3050 |
| 144 | 190 | 3725 |
| 145 | 192 | 3000 |
| 146 | 185 | 3650 |
| 147 | 190 | 4250 |
| 148 | 184 | 3475 |
| 149 | 195 | 3450 |
| 150 | 193 | 3750 |
| 151 | 187 | 3700 |
| 152 | 201 | 4000 |
| 153 | 211 | 4500 |
| 154 | 230 | 5700 |
| 155 | 210 | 4450 |
| 156 | 218 | 5700 |
| 157 | 215 | 5400 |
| 158 | 210 | 4550 |
| 159 | 211 | 4800 |
| 160 | 219 | 5200 |
| 161 | 209 | 4400 |
| 162 | 215 | 5150 |
| 163 | 214 | 4650 |
| 164 | 216 | 5550 |
| 165 | 214 | 4650 |
| 166 | 213 | 5850 |
| 167 | 210 | 4200 |
| 168 | 217 | 5850 |
| 169 | 210 | 4150 |
| 170 | 221 | 6300 |
| 171 | 209 | 4800 |
| 172 | 222 | 5350 |
| 173 | 218 | 5700 |
| 174 | 215 | 5000 |
| 175 | 213 | 4400 |
| 176 | 215 | 5050 |
| 177 | 215 | 5000 |
| 178 | 215 | 5100 |
| 179 | 216 | 4100 |
| 180 | 215 | 5650 |
| 181 | 210 | 4600 |
| 182 | 220 | 5550 |
| 183 | 222 | 5250 |
| 184 | 209 | 4700 |
| 185 | 207 | 5050 |
| 186 | 230 | 6050 |
| 187 | 220 | 5150 |
| 188 | 220 | 5400 |
| 189 | 213 | 4950 |
| 190 | 219 | 5250 |
| 191 | 208 | 4350 |
| 192 | 208 | 5350 |
| 193 | 208 | 3950 |
| 194 | 225 | 5700 |
| 195 | 210 | 4300 |
| 196 | 216 | 4750 |
| 197 | 222 | 5550 |
| 198 | 217 | 4900 |
| 199 | 210 | 4200 |
| 200 | 225 | 5400 |
| 201 | 213 | 5100 |
| 202 | 215 | 5300 |
| 203 | 210 | 4850 |
| 204 | 220 | 5300 |
| 205 | 210 | 4400 |
| 206 | 225 | 5000 |
| 207 | 217 | 4900 |
| 208 | 220 | 5050 |
| 209 | 208 | 4300 |
| 210 | 220 | 5000 |
| 211 | 208 | 4450 |
| 212 | 224 | 5550 |
| 213 | 208 | 4200 |
| 214 | 221 | 5300 |
| 215 | 214 | 4400 |
| 216 | 231 | 5650 |
| 217 | 219 | 4700 |
| 218 | 230 | 5700 |
| 219 | 214 | 4650 |
| 220 | 229 | 5800 |
| 221 | 220 | 4700 |
| 222 | 223 | 5550 |
| 223 | 216 | 4750 |
| 224 | 221 | 5000 |
| 225 | 221 | 5100 |
| 226 | 217 | 5200 |
| 227 | 216 | 4700 |
| 228 | 230 | 5800 |
| 229 | 209 | 4600 |
| 230 | 220 | 6000 |
| 231 | 215 | 4750 |
| 232 | 223 | 5950 |
| 233 | 212 | 4625 |
| 234 | 221 | 5450 |
| 235 | 212 | 4725 |
| 236 | 224 | 5350 |
| 237 | 212 | 4750 |
| 238 | 228 | 5600 |
| 239 | 218 | 4600 |
| 240 | 218 | 5300 |
| 241 | 212 | 4875 |
| 242 | 230 | 5550 |
| 243 | 218 | 4950 |
| 244 | 228 | 5400 |
| 245 | 212 | 4750 |
| 246 | 224 | 5650 |
| 247 | 214 | 4850 |
| 248 | 226 | 5200 |
| 249 | 216 | 4925 |
| 250 | 222 | 4875 |
| 251 | 203 | 4625 |
| 252 | 225 | 5250 |
| 253 | 219 | 4850 |
| 254 | 228 | 5600 |
| 255 | 215 | 4975 |
| 256 | 228 | 5500 |
| 257 | 216 | 4725 |
| 258 | 215 | 5500 |
| 259 | 210 | 4700 |
| 260 | 219 | 5500 |
| 261 | 208 | 4575 |
| 262 | 209 | 5500 |
| 263 | 216 | 5000 |
| 264 | 229 | 5950 |
| 265 | 213 | 4650 |
| 266 | 230 | 5500 |
| 267 | 217 | 4375 |
| 268 | 230 | 5850 |
| 269 | 217 | 4875 |
| 270 | 222 | 6000 |
| 271 | 214 | 4925 |
| 272 | NA | NA |
| 273 | 215 | 4850 |
| 274 | 222 | 5750 |
| 275 | 212 | 5200 |
| 276 | 213 | 5400 |
| 277 | 192 | 3500 |
| 278 | 196 | 3900 |
| 279 | 193 | 3650 |
| 280 | 188 | 3525 |
| 281 | 197 | 3725 |
| 282 | 198 | 3950 |
| 283 | 178 | 3250 |
| 284 | 197 | 3750 |
| 285 | 195 | 4150 |
| 286 | 198 | 3700 |
| 287 | 193 | 3800 |
| 288 | 194 | 3775 |
| 289 | 185 | 3700 |
| 290 | 201 | 4050 |
| 291 | 190 | 3575 |
| 292 | 201 | 4050 |
| 293 | 197 | 3300 |
| 294 | 181 | 3700 |
| 295 | 190 | 3450 |
| 296 | 195 | 4400 |
| 297 | 181 | 3600 |
| 298 | 191 | 3400 |
| 299 | 187 | 2900 |
| 300 | 193 | 3800 |
| 301 | 195 | 3300 |
| 302 | 197 | 4150 |
| 303 | 200 | 3400 |
| 304 | 200 | 3800 |
| 305 | 191 | 3700 |
| 306 | 205 | 4550 |
| 307 | 187 | 3200 |
| 308 | 201 | 4300 |
| 309 | 187 | 3350 |
| 310 | 203 | 4100 |
| 311 | 195 | 3600 |
| 312 | 199 | 3900 |
| 313 | 195 | 3850 |
| 314 | 210 | 4800 |
| 315 | 192 | 2700 |
| 316 | 205 | 4500 |
| 317 | 210 | 3950 |
| 318 | 187 | 3650 |
| 319 | 196 | 3550 |
| 320 | 196 | 3500 |
| 321 | 196 | 3675 |
| 322 | 201 | 4450 |
| 323 | 190 | 3400 |
| 324 | 212 | 4300 |
| 325 | 187 | 3250 |
| 326 | 198 | 3675 |
| 327 | 199 | 3325 |
| 328 | 201 | 3950 |
| 329 | 193 | 3600 |
| 330 | 203 | 4050 |
| 331 | 187 | 3350 |
| 332 | 197 | 3450 |
| 333 | 191 | 3250 |
| 334 | 203 | 4050 |
| 335 | 202 | 3800 |
| 336 | 194 | 3525 |
| 337 | 206 | 3950 |
| 338 | 189 | 3650 |
| 339 | 195 | 3650 |
| 340 | 207 | 4000 |
| 341 | 202 | 3400 |
| 342 | 193 | 3775 |
| 343 | 210 | 4100 |
| 344 | 198 | 3775 |
Scroll down - You might need to refresh this page to show the plot
Let \((X_1, Y_1), (X_2, Y_2), ..., (X_n, Y_n)\) be a sample of size \(n\) of the variables \(X\) and \(Y\);
These are just points in a scatter plot; e.g.,
d3_createScatterPlot({
elementId: 'scatterplot-example-fitting',
xName: 'x',
yName: 'y',
data: data_test,
title: `Scatterplot of the sample` ,
xlab: 'Explanatory Variable',
ylab: "Response Variable",
titleFontSize: "24px",
labelFontSize: "18px",
tickFontSize: '16px',
pointSize: 3,
pointColor: 'steelblue',
margin: {top: 80, right: 40, bottom: 100, left: 80}
});We want a line that is “close” to the points;
The difference between the \(i\)-th observed point and the line is called residual and denoted by \(e_i\), \[ e_i = Y_i - (b_0 + b_1 X_i) \]
A line close to the points means small residuals;
But how do we measure “small”?
One common way is to use the Residual Sum of Square (RSS): \[ RSS(b_0, b_1) = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n \left( Y_i - (b_0 + b_1 x_i) \right)^2 \]
Scroll down - You might need to refresh this page to show the plot
viewof intercept = {
let input = Inputs.range([-1, 13],
{
value: 8,
step: .01,
label: "Intercept: ",
width: 300
});
d3.select(input).select("label")._groups[0][0].innerHTML = 'b<sub>0</sub>: ';
return input
}
viewof slope = {
let input = Inputs.range([-5, 5],
{value: 0,
step: .01,
label: "Slope: ", width: 300});
d3.select(input).select("label")._groups[0][0].innerHTML = 'b<sub>1</sub>:';
return input
}data_test = [
{'x': 2.83, 'y': 10.90},
{'x': 4.37, 'y': 11.48},
{'x': 3.29, 'y': 11.22},
{'x': 2.45, 'y': 8.11},
{'x': -0.50, 'y': 2.92},
{'x': 3.53, 'y': 13.97},
{'x': 3.32, 'y': 5.21},
{'x': 2.15, 'y': 7.53},
{'x': 3.90, 'y': 9.63},
{'x': 0.12, 'y': 5.98},
{'x': 4.20, 'y': 12.88},
{'x': 2.73, 'y': 10.05},
{'x': 4.64, 'y': 13.26},
{'x': 2.12, 'y': 4.94},
{'x': 0.95, 'y': 9.18}];
{
const rss_data_test = sumOfSquaredResiduals({slope: slope, intercept: intercept, data: data_test, xName: 'x', yName: 'y'});
//const title = "Residual Sum of Squares" + rss;
d3_createScatterPlotWithLine({
elementId: 'which-beta',
//xName: 'flipper_length_mm',
//yName: 'body_mass_g',
xName: 'x',
yName: 'y',
data: data_test,
slope: slope,
intercept: intercept,
drawErrorLines: true,
title: `Residual Sum of Squares: ${rss_data_test.toFixed(3)}` ,
xlab: 'Explanatory Variable',
ylab: "Response Variable",
titleFontSize: "24px",
labelFontSize: "20px",
tickFontSize: '16px',
pointSize: 3,
pointColor: 'steelblue',
margin: {top: 80, right: 20, bottom: 50, left: 80}
//lineCallback,
//styles = {}
});
}You might need to refresh this page to show the plot
d3_createScatterPlotWithLine({
elementId: 'scatterplot-penguins-3',
//xName: 'flipper_length_mm',
//yName: 'body_mass_g',
xName: 'x',
yName: 'y',
data: data_test,
slope: 1.6307,
intercept: 4.7911,
drawErrorLines: true,
title: `Residual Sum of Squares: ${sumOfSquaredResiduals({slope: 1.6307, intercept: 4.7911, data: data_test, xName: 'x', yName: 'y'}).toFixed(3)}` ,
xlab: 'Explanatory Variable',
ylab: "Response Variable",
titleFontSize: "24px",
labelFontSize: "18px",
tickFontSize: '16px',
pointSize: 3,
pointColor: 'steelblue',
margin: {top: 80, right: 40, bottom: 100, left: 80}
//lineCallback,
//styles = {}
});Scroll down - You might need to refresh this page to show the plot
\[ RSS(b_0, b_1) = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n \left( Y_i - (b_0 + b_1 x_i) \right)^2 \]
Take the partial derivatives of \(RSS(b_0, b_1)\) with respect to \(b_0\) and \(b_1\); \[ \frac{\partial RSS(b_0, b_1)}{\partial b_0} = -2\sum_{i=1}^n \left( Y_i - (b_0 + b_1 x_i) \right) \] \[ \frac{\partial RSS(b_0, b_1)}{\partial b_1} = -2\sum_{i=1}^n \left( Y_i - (b_0 + b_1 x_i) \right)x_i \]
Set the partial derivatives to zero: \[ \frac{\partial RSS(b_0, b_1)}{\partial b_0} = -2\sum_{i=1}^n \left( Y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right) = 0 \] \[ \frac{\partial RSS(b_0, b_1)}{\partial b_1} = -2\sum_{i=1}^n \left( Y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right)x_i = 0 \]
Flipper Length and Body Mass of penguins.lm function.Important: Association is not causality.
In general, we cannot conclude that changes in \(X\) cause a change in \(Y\). The conclusion of causality requires more than a good model.
This means that an increase of 1mm in flipper length is associated with an expected increase of 50.15g in body mass.
Not ok to say: “An increase of 1mm in flipper length increases body mass by 50.15g.”
To predict the value of \(Y\) for a given \(x\), we use the estimated regression line: \[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 x \]
\(\hat{Y}\) is called the predicted value.
The linear model assumes that the relationship between \(X\) and \(E[Y|X]\) is linear, which may or may not be true;
Sometimes, there’s a linear association only in part of the data range.
We need to exercise caution when using the model outside the range of the data, as the relationship between \(X\) and \(Y\) may differ significantly.
There’s no way for us to know whether the relationship is still linear outside the range of the data;
You should be careful when predicting outside the range of the data;
male or female. \[
\text{body_mass_g}_i = \beta_0 + \beta_1\text{sex}_i + \varepsilon_i
\]We can encode the categories into multiple variables.
For example, the variable sex could be defined as:
\[\begin{equation} \text{sex}_i = \begin{cases} 0 & \text{if penguin $i$ is female}\\ 1 & \text{if penguin $i$ is male} \end{cases} \end{equation}\]
patient_status could be defined:\[\begin{equation} \text{patient_status}_i = \begin{cases} 0 & \text{if patient $i$ is healthy}\\ 1 & \text{if patient $i$ is sick} \end{cases} \end{equation}\]
male or female. \[
\text{body_mass_g}_i = \beta_0 + \beta_1\text{sex}_i + \varepsilon_i
\]Note that there is no line in this case.
Sex cannot be 0.1 or 0.5. It can only be 0 or 1.
So, what is going on?
\[\begin{equation} \text{body_mass_g}_i = \begin{cases} \beta_0 + \beta_1 + \varepsilon_i& \text{if penguin $i$ is male}\\ \beta_0 + \varepsilon_i & \text{if penguin $i$ is female} \end{cases} \end{equation}\]
Remember that in regression, we model the mean given the value of a covariate.
So, in this case, we are modelling the mean of female penguins and the mean of male penguins;
Note that:
body_mass_g of female penguins.body_mass_g of male penguins.R is pretty good dealing with categorical variables.
All we need to do is to use factors, e.g.,
lm will create the dummy variables and tell us the levels of the factor associated with the coefficient.Population Regression: \[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]
Sample Regression (estimated from the sample): \[ Y_i = \widehat{\beta}_0 + \widehat{\beta}_1 x_i + e_i \]
Since \(\beta_0\) and \(\beta_1\) are parameters, we estimate them based on a sample;
\(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) are the estimators of \(\beta_0\) and \(\beta_1\).
As statistics, \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) depend on the sample, therefore we will need their sampling distribution.
© 2024 Rodolfo Lourenzutti – Material Licensed under CC By-SA 4.0