Simple Linear Regression

Specification

import { d3_createScatterPlot } from "./scripts/d3plots.js"
  import { d3_createScatterPlotWithLine } from "./scripts/d3plots.js"
  import { sumOfSquaredResiduals } from './scripts/d3plots.js'
  import { d3_createNormalDensityPlot } from './scripts/d3plots.js'

We have two variables:
- Response variable ($Y$): what we are trying to predict/explain;
- Explanatory Variable ($x$): the variable used to predict the response;

There is a probability distribution of $Y$ for each value of $x$;
The means of these distributions are linearly related to $x$;
The explanatory variable $x$ is assumed to be fixed for each individual/sample;
- For this reason, I’ll use lower case for $x$ and upper case for $Y$. But the usual notation is both upper case letters.

The model

The model relating $X$ and $Y$ is given by the equation of a line: \[ Y_i = \beta_0 + \beta_1 x_i+ \varepsilon_i \]

$x_i$ is the value of the explanatory variable for the $i$-th sample (constant!);
$\beta_0$ and $\beta_1$ are parameters (constants!);
$\varepsilon_i$ is a random error term (random variable!).
$Y_i$ the response variable for the $i$-th sample (random variable)
- function of a random variable, so also a random variable;

The model: components

The model relating $X$ and $Y$ is given by the equation of a line: \[ Y_i = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} x_i + \underbrace{\epsilon_i}_{\text{error term}} \]

Intercept: tells us the $Y$ value when $X=0$ (i.e., the value of $Y$ when the line crosses the $Y$-axis).

Slope: tells us how much change in $Y$ to expect for a unit increase in $X$.

Error: captures the variability of the response not explained by the model.

The model: components

The model relating $X$ and $Y$ is given by: \[ Y_i = \underbrace{\beta_0}_{\text{intercept}} + \beta_1 x_i + \epsilon_i \]

The model: components

The model relating $X$ and $Y$ is given by: \[ Y_i = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} x_i + \epsilon_i \]

The random errors

The error component, $\varepsilon_i$, captures everything that our model does not.
We treat $\varepsilon_i$ as a random variable.
- It has a distribution:
  - We will assume it to be Normal;
- It has a mean:
  - safely assumed to be 0 (i.e., $E[\varepsilon_i] = 0$);
- It has a variance:
  - unknown and denoted by $\sigma^2$ (i.e., $Var(\varepsilon_i) = \sigma^2$);
- $Cov(\varepsilon_i, \varepsilon_j) = 0$ for $i \neq j$ (i.e., the errors are uncorrelated).

But what does this mean?
- Let’s do some iClickers

The random errors and the response

Scroll down - you might need to refresh this page to show the plot

Imagine a linear model relating Height (m) and Weight (kg):

\[ \text{Weight} = -166+140\times\text{Height}+\varepsilon \]

For a $1.63m$ tall person, we have:

\[ \text{Weight} = -166+140\times1.63+\varepsilon = 62.2 + \varepsilon \]

We would expect this person to weigh around $62.2kg$;
- But the weight is affected by other factors as well;
- so we cannot say precisely the value of the weight;
We have a probability distribution of possible weights for a $1.63m$ tall person:

viewof height = Inputs.range([1.4, 2.1], 
                             {
                                value: 1.63,
                                step: .01, 
                                label: "Height: ",
                                width: 400
                             });
  
viewof sigma = Inputs.range([.1, 8], 
                             {
                                value: 3,
                                step: .1, 
                                label: "σ: ",
                                width: 400
                             });

{
const mean = -166+140*height;
const stdDev = sigma;
const elementId = 'normal-density';
const title = `Weight Distribution for a ${height} meters tall person.`;
const xlab = 'Weight (kg)';
const ylab = 'Density';
const titleFontSize = '22px';
const labelFontSize = '18px';
const tickFontSize = '16px';
const margin = { top: 40, right: 20, bottom: 50, left: 70 };

d3_createNormalDensityPlot({
    elementId,
    mean,
    stdDev,
    title,
    xlab,
    ylab,
    titleFontSize,
    labelFontSize,
    tickFontSize,
    margin
});
}

Modelling the average

For a $1.63m$ tall person, we have:

\[ \text{Weight} = -166+140\times1.63+\varepsilon = 62.2 + \varepsilon \]

We would expect this person to weigh around $62.2kg$;
- Some people will weigh more, and some will weigh less.
In average, $1.63m$ tall people weigh 62.2;
The line $−166+140\times\text{Height}$ gives the mean $\text{Weight}$ of people of a given $\text{Height}$;

The model as conditional expectation

In general: \[ E[Y|X = x] = \beta_0 + \beta_1 x \]
- This just means that the regression line is the conditional average of $Y$ for a given value of $X=x$.
- Why can we say that?

Note the difference: \[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]
- This equation is for a given point, which is off the line (note the presence of the error term).

The model as conditional expectation

\[ \overbrace{\color{red}{\underbrace{\beta_0+\beta_1 x_i}_{\text{Regression Line}:\\\quad E[Y|X_i=x_i]}} + \epsilon_i}^{\text{Point: } Y_i} \]

The model as conditional expectation

Fitting is estimation

$E[Y|X=x] = \beta_0 + \beta_1 x$ is the population’s conditional mean for a given value of $X=x$.
But instead of estimating the mean for each value of $X$ separately, which is not feasible, we are assuming a linear structure between the mean of the population and the value of $X$.
Therefore, estimating the means for the value of $X$ (in a certain range) reduces to estimate $\beta_0$ and $\beta_1$.
We denote the estimators by $\hat{\beta}_0$ and $\hat{\beta}_1$.

Fitting the model: Notation

We have $n$ observations/samples.
- $Y_i$ and $X_i$ denote the values of $X$ and $Y$ for observation $i$.
- You can think of it as $n$ pairs: $(X_1, Y_1), (X_2, Y_2), ..., (X_n, Y_n)$.
For example, for the penguins, $i$ refers to each penguin and can vary from 1 to 344.

i	flipper_length_mm (X)	body_mass_g (Y)
1	181	3750
2	186	3800
3	195	3250
4	NA	NA
5	193	3450
6	190	3650
7	181	3625
8	195	4675
9	193	3475
10	190	4250
11	186	3300
12	180	3700
13	182	3200
14	191	3800
15	198	4400
16	185	3700
17	195	3450
18	197	4500
19	184	3325
20	194	4200
21	174	3400
22	180	3600
23	189	3800
24	185	3950
25	180	3800
26	187	3800
27	183	3550
28	187	3200
29	172	3150
30	180	3950
31	178	3250
32	178	3900
33	188	3300
34	184	3900
35	195	3325
36	196	4150
37	190	3950
38	180	3550
39	181	3300
40	184	4650
41	182	3150
42	195	3900
43	186	3100
44	196	4400
45	185	3000
46	190	4600
47	182	3425
48	179	2975
49	190	3450
50	191	4150
51	186	3500
52	188	4300
53	190	3450
54	200	4050
55	187	2900
56	191	3700
57	186	3550
58	193	3800
59	181	2850
60	194	3750
61	185	3150
62	195	4400
63	185	3600
64	192	4050
65	184	2850
66	192	3950
67	195	3350
68	188	4100
69	190	3050
70	198	4450
71	190	3600
72	190	3900
73	196	3550
74	197	4150
75	190	3700
76	195	4250
77	191	3700
78	184	3900
79	187	3550
80	195	4000
81	189	3200
82	196	4700
83	187	3800
84	193	4200
85	191	3350
86	194	3550
87	190	3800
88	189	3500
89	189	3950
90	190	3600
91	202	3550
92	205	4300
93	185	3400
94	186	4450
95	187	3300
96	208	4300
97	190	3700
98	196	4350
99	178	2900
100	192	4100
101	192	3725
102	203	4725
103	183	3075
104	190	4250
105	193	2925
106	184	3550
107	199	3750
108	190	3900
109	181	3175
110	197	4775
111	198	3825
112	191	4600
113	193	3200
114	197	4275
115	191	3900
116	196	4075
117	188	2900
118	199	3775
119	189	3350
120	189	3325
121	187	3150
122	198	3500
123	176	3450
124	202	3875
125	186	3050
126	199	4000
127	191	3275
128	195	4300
129	191	3050
130	210	4000
131	190	3325
132	197	3500
133	193	3500
134	199	4475
135	187	3425
136	190	3900
137	191	3175
138	200	3975
139	185	3400
140	193	4250
141	193	3400
142	187	3475
143	188	3050
144	190	3725
145	192	3000
146	185	3650
147	190	4250
148	184	3475
149	195	3450
150	193	3750
151	187	3700
152	201	4000
153	211	4500
154	230	5700
155	210	4450
156	218	5700
157	215	5400
158	210	4550
159	211	4800
160	219	5200
161	209	4400
162	215	5150
163	214	4650
164	216	5550
165	214	4650
166	213	5850
167	210	4200
168	217	5850
169	210	4150
170	221	6300
171	209	4800
172	222	5350
173	218	5700
174	215	5000
175	213	4400
176	215	5050
177	215	5000
178	215	5100
179	216	4100
180	215	5650
181	210	4600
182	220	5550
183	222	5250
184	209	4700
185	207	5050
186	230	6050
187	220	5150
188	220	5400
189	213	4950
190	219	5250
191	208	4350
192	208	5350
193	208	3950
194	225	5700
195	210	4300
196	216	4750
197	222	5550
198	217	4900
199	210	4200
200	225	5400
201	213	5100
202	215	5300
203	210	4850
204	220	5300
205	210	4400
206	225	5000
207	217	4900
208	220	5050
209	208	4300
210	220	5000
211	208	4450
212	224	5550
213	208	4200
214	221	5300
215	214	4400
216	231	5650
217	219	4700
218	230	5700
219	214	4650
220	229	5800
221	220	4700
222	223	5550
223	216	4750
224	221	5000
225	221	5100
226	217	5200
227	216	4700
228	230	5800
229	209	4600
230	220	6000
231	215	4750
232	223	5950
233	212	4625
234	221	5450
235	212	4725
236	224	5350
237	212	4750
238	228	5600
239	218	4600
240	218	5300
241	212	4875
242	230	5550
243	218	4950
244	228	5400
245	212	4750
246	224	5650
247	214	4850
248	226	5200
249	216	4925
250	222	4875
251	203	4625
252	225	5250
253	219	4850
254	228	5600
255	215	4975
256	228	5500
257	216	4725
258	215	5500
259	210	4700
260	219	5500
261	208	4575
262	209	5500
263	216	5000
264	229	5950
265	213	4650
266	230	5500
267	217	4375
268	230	5850
269	217	4875
270	222	6000
271	214	4925
272	NA	NA
273	215	4850
274	222	5750
275	212	5200
276	213	5400
277	192	3500
278	196	3900
279	193	3650
280	188	3525
281	197	3725
282	198	3950
283	178	3250
284	197	3750
285	195	4150
286	198	3700
287	193	3800
288	194	3775
289	185	3700
290	201	4050
291	190	3575
292	201	4050
293	197	3300
294	181	3700
295	190	3450
296	195	4400
297	181	3600
298	191	3400
299	187	2900
300	193	3800
301	195	3300
302	197	4150
303	200	3400
304	200	3800
305	191	3700
306	205	4550
307	187	3200
308	201	4300
309	187	3350
310	203	4100
311	195	3600
312	199	3900
313	195	3850
314	210	4800
315	192	2700
316	205	4500
317	210	3950
318	187	3650
319	196	3550
320	196	3500
321	196	3675
322	201	4450
323	190	3400
324	212	4300
325	187	3250
326	198	3675
327	199	3325
328	201	3950
329	193	3600
330	203	4050
331	187	3350
332	197	3450
333	191	3250
334	203	4050
335	202	3800
336	194	3525
337	206	3950
338	189	3650
339	195	3650
340	207	4000
341	202	3400
342	193	3775
343	210	4100
344	198	3775

Fitting the model

Scroll down - You might need to refresh this page to show the plot

Let $(X_1, Y_1), (X_2, Y_2), ..., (X_n, Y_n)$ be a sample of size $n$ of the variables $X$ and $Y$;
These are just points in a scatter plot; e.g.,

d3_createScatterPlot({
    elementId: 'scatterplot-example-fitting',
    xName: 'x',
    yName: 'y',
    data: data_test, 
    title: `Scatterplot of the sample` , 
    xlab: 'Explanatory Variable', 
    ylab: "Response Variable", 
    titleFontSize: "24px", 
    labelFontSize: "18px", 
    tickFontSize: '16px', 
    pointSize: 3, 
    pointColor: 'steelblue', 
    margin: {top: 80, right: 40, bottom: 100, left: 80}
  });

Naturally, we could use many different lines to fit these points;
- which line to use?

Fitting the model: the residuals

We want a line that is “close” to the points;
The difference between the $i$-th observed point and the line is called residual and denoted by $e_i$, \[ e_i = Y_i - (b_0 + b_1 X_i) \]
- $b_0$ is the intercept of the line;
- $b_1$ is the slope of the line.

Fitting the model: Residuals Sum of Squares

A line close to the points means small residuals;
But how do we measure “small”?
One common way is to use the Residual Sum of Square (RSS): \[ RSS(b_0, b_1) = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n \left( Y_i - (b_0 + b_1 x_i) \right)^2 \]

So we want to find the values of $b_0$ and $b_1$ that minimize the RSS.

Fitting the model: minimizing RSS

Scroll down - You might need to refresh this page to show the plot

What values to use for $b_0$ and $b_1$?

viewof intercept = {
    let input = Inputs.range([-1, 13], 
                             {
                                value: 8,
                                step: .01, 
                                label: "Intercept: ",
                                width: 300
                             });
         
    d3.select(input).select("label")._groups[0][0].innerHTML = 'b<sub>0</sub>: ';
    
    return input
}

viewof slope = {
    let input = Inputs.range([-5, 5], 
                             {value: 0,
                             step: .01, 
                             label: "Slope: ", width: 300});
    
    d3.select(input).select("label")._groups[0][0].innerHTML = 'b<sub>1</sub>:';
    
    return input

}

Fitting the model: minimizing RSS

You might need to refresh this page to show the plot

We want $b_0$ and $b_1$ that minimizes the Sum of Square Error;

d3_createScatterPlotWithLine({
    elementId: 'scatterplot-penguins-3',
    //xName: 'flipper_length_mm',
    //yName: 'body_mass_g', 
    xName: 'x',
    yName: 'y',
    data: data_test, 
    slope: 1.6307, 
    intercept: 4.7911, 
    drawErrorLines: true, 
    title: `Residual Sum of Squares: ${sumOfSquaredResiduals({slope: 1.6307, intercept: 4.7911, data: data_test, xName: 'x', yName: 'y'}).toFixed(3)}` , 
    xlab: 'Explanatory Variable', 
    ylab: "Response Variable", 
    titleFontSize: "24px", 
    labelFontSize: "18px", 
    tickFontSize: '16px', 
    pointSize: 3, 
    pointColor: 'steelblue', 
    margin: {top: 80, right: 40, bottom: 100, left: 80}
    //lineCallback, 
    //styles = {}
  });

Fitting the model: normal equations

Scroll down - You might need to refresh this page to show the plot

We can use calculus to find the values of $b_0$ and $b_1$ that minimize the RSS:

\[ RSS(b_0, b_1) = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n \left( Y_i - (b_0 + b_1 x_i) \right)^2 \]

Take the partial derivatives of $RSS(b_0, b_1)$ with respect to $b_0$ and $b_1$; \[ \frac{\partial RSS(b_0, b_1)}{\partial b_0} = -2\sum_{i=1}^n \left( Y_i - (b_0 + b_1 x_i) \right) \] \[ \frac{\partial RSS(b_0, b_1)}{\partial b_1} = -2\sum_{i=1}^n \left( Y_i - (b_0 + b_1 x_i) \right)x_i \]
Set the partial derivatives to zero: \[ \frac{\partial RSS(b_0, b_1)}{\partial b_0} = -2\sum_{i=1}^n \left( Y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right) = 0 \] \[ \frac{\partial RSS(b_0, b_1)}{\partial b_1} = -2\sum_{i=1}^n \left( Y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right)x_i = 0 \]

The solution of these equations are the estimators of $\beta_0$ and $\beta_1$ - that’s why we used $\hat{\beta}_0$ and $\hat{\beta}_1$.

We have a linear equation system, with two equations. Let’s organize it: \[ \hat{\beta}_0 + \bar{x}\hat{\beta}_1 = \bar{Y} \] \[ \hat{\beta}_0\sum_{i=1}^n x_i + \hat{\beta}_1 \sum_{i=1}^n x_i^2 = \sum_{i=1}^n Y_ix_i \]

These are the so-called normal equations.

Solve the system of equations to get the estimates $\hat{\beta}_0$ and $\hat{\beta}_1$: \[ \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{x} \] \[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(Y_i - \bar{Y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = r_{XY}\frac{S_Y}{S_X} \]

Example: Fitting a model

Let’s fit a linear model relating Flipper Length and Body Mass of penguins.

To fit a linear model in R, we use the lm function.

Interpretation $\beta_0$ and $\beta_1$

Slope: an increase of 1 unit of $X$ is associated with an expected increase of $\beta_1$ units in $Y$.
- It is associated with, not the cause of!
Intercept: The average value of $Y$ when $X = 0$ is $\beta_0$.
- Usually, we don’t care as much about this parameter.

Important: Association is not causality.

In general, we cannot conclude that changes in $X$ cause a change in $Y$. The conclusion of causality requires more than a good model.

Interpretation $\beta_0$ and $\beta_1$

This means that an increase of 1mm in flipper length is associated with an expected increase of 50.15g in body mass.
Not ok to say: “An increase of 1mm in flipper length increases body mass by 50.15g.”

Predicting with the model

To predict the value of $Y$ for a given $x$, we use the estimated regression line: \[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 x \]
$\hat{Y}$ is called the predicted value.

The range problem

The linear model assumes that the relationship between $X$ and $E[Y|X]$ is linear, which may or may not be true;
Sometimes, there’s a linear association only in part of the data range.
- The linear model could still be useful when restricted to that specific range;
We need to exercise caution when using the model outside the range of the data, as the relationship between $X$ and $Y$ may differ significantly.

The Range problem

According to our model, what is the % of fat after week 36?

The Range problem

According to our model, what is the % of fat before week 11?

The Range problem

Let’s look at the actual data before week 10:

The Range problem: Take-away

There’s no way for us to know whether the relationship is still linear outside the range of the data;
You should be careful when predicting outside the range of the data;

Regression vs Correlation analysis

Correlation analysis: we’re interested in the strength of linear association between two variables;
- no distinction between the two variables (no response and no covariate);
- both variables are assumed to be stochastic;

Linear Regression: we’re interested in estimating the conditional average of the response given the value of the covariate.
- covariate is assumed to be non-stochastic;
- one of the variables is treated as a response and the other as a covariate;

Categorical Covariate?

Note that we can also have a categorical covariate.
- For now let’s assume that the categorical variable has only two categories.
We can try to explain the body mass of Penguins based on the sex of penguins: male or female. \[ \text{body_mass_g}_i = \beta_0 + \beta_1\text{sex}_i + \varepsilon_i \]

But wait! How can we have categories in an equation?

Dummy variables (2 categories)

We can encode the categories into multiple variables.
For example, the variable sex could be defined as:

\[\begin{equation} \text{sex}_i = \begin{cases} 0 & \text{if penguin $i$ is female}\\ 1 & \text{if penguin $i$ is male} \end{cases} \end{equation}\]

A variable patient_status could be defined:

\[\begin{equation} \text{patient_status}_i = \begin{cases} 0 & \text{if patient $i$ is healthy}\\ 1 & \text{if patient $i$ is sick} \end{cases} \end{equation}\]

If we have a variable with two categories, we need only 1 dummy variable to represent it.

Back to the model

We can try to explain the body mass of Penguins based on the sex of penguins: male or female. \[ \text{body_mass_g}_i = \beta_0 + \beta_1\text{sex}_i + \varepsilon_i \]

Where is the line?

Note that there is no line in this case.
Sex cannot be 0.1 or 0.5. It can only be 0 or 1.
So, what is going on?

\[\begin{equation} \text{body_mass_g}_i = \begin{cases} \beta_0 + \beta_1 + \varepsilon_i& \text{if penguin $i$ is male}\\ \beta_0 + \varepsilon_i & \text{if penguin $i$ is female} \end{cases} \end{equation}\]

We are just comparing means

Remember that in regression, we model the mean given the value of a covariate.
So, in this case, we are modelling the mean of female penguins and the mean of male penguins;
Note that:
- $\beta_0$ is the average body_mass_g of female penguins.
- $\beta_1$ is the difference in means.
- $\beta_0+\beta_1$ is the average body_mass_g of male penguins.

R for us

R is pretty good dealing with categorical variables.
All we need to do is to use factors, e.g.,

penguins_clean %>%
  mutate(sex = as_factor(sex))

The lm will create the dummy variables and tell us the levels of the factor associated with the coefficient.

Population vs Sample Regression

Population Regression: \[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]
Sample Regression (estimated from the sample): \[ Y_i = \widehat{\beta}_0 + \widehat{\beta}_1 x_i + e_i \]

Note that $e_i\neq\varepsilon_i$ because $\widehat{\beta}_0\neq \beta_0$ and $\widehat{\beta}_1\neq \beta_1$

The parameters $\beta_0$ and $\beta_1$

Since $\beta_0$ and $\beta_1$ are parameters, we estimate them based on a sample;
$\widehat{\beta}_0$ and $\widehat{\beta}_1$ are the estimators of $\beta_0$ and $\beta_1$.
As statistics, $\widehat{\beta}_0$ and $\widehat{\beta}_1$ depend on the sample, therefore we will need their sampling distribution.

Simple Linear Regression

Specification

The model

The model: components

The model: components

The model: components

The random errors

The random errors and the response

Modelling the average

The model as conditional expectation

The model as conditional expectation

The model as conditional expectation

Fitting is estimation

Fitting the model: Notation

Fitting the model

Fitting the model: the residuals

Fitting the model: Residuals Sum of Squares

Fitting the model: minimizing RSS

Fitting the model: minimizing RSS

Fitting the model: normal equations

Example: Fitting a model

Interpretation \(\beta_0\) and \(\beta_1\)

Interpretation \(\beta_0\) and \(\beta_1\)

Predicting with the model

The range problem

The Range problem

The Range problem

The Range problem

The Range problem: Take-away

Regression vs Correlation analysis

Categorical Covariate?

Dummy variables (2 categories)

Back to the model

Where is the line?

We are just comparing means

R for us

Population vs Sample Regression

The parameters \(\beta_0\) and \(\beta_1\)