In most of STAT 201, we dealt with estimating and hypothesis testing one parameter.
In this course, you will study the association between variables.
We will explore the world of explanatory and predictive modelling.
You’ll learn about a variety of different models, how to interpret the results and evaluate model performance.
Let’s explore our Canvas page
Relations between variables
Relations
In most of STAT 201, we focused on the estimation of one parameter of a population:
the average income of a group of a population;
the median diameter of certain trees;
Often, we are interested in how a variable relates to other variables:
What do we expect to happen with the youth unemployment rate if there’s an increase in minimum wage?
Example 1: Laffer curve
The Laffer curve relates the tax rate and the government revenues:
Example 2: Price vs demand
A well-established fact in Economy is that the more expensive a product is, the smaller the demand for that product.
Deterministic Relations
In a deterministic relationship, the value of one variable is entirely determined by the value of another variable.
There is no uncertainty!
Deterministic Relations: Example 1
Einstein’s mass-energy relation: \(\quad E = m\times c^2\)
\(c\) is just a constant (speed of light)
Deterministic Relations: Example 2
A circle’s area-radius relation: \(A=\pi r^2\).
\(\pi\) is just a constant.
Stochastic Relations
Stochastic relations refer to situations in which the outcome of a given input cannot be precisely predicted.
There’s uncertainty!
The uncertainty might be due to other variables not being considered or even noise.
Stochastic Relations: Example 1
Height and weight of individuals: taller people tend to weigh more, but there is considerable variability.
Other factors like age, gender, and lifestyle can influence weight.
It is impossible to accurately determine a person’s weight based solely on height.
Stochastic Relations: Example 1
Stochastic Relations: Example 2
Home price and square footage: larger homes tend to cost more, but there’s considerable variability.
Other factors such as location, age of the home, and market conditions can influence the price.
Stochastic Relations: Example 2
Stochastic Relations: the error term
A stochastic relation typically includes a randomerror term (\(\varepsilon\)) to account for the variability in the response. For instance, \[\text{Weight} = \beta_0 + \beta_1 \times \text{Height} + \varepsilon\]
Now, the model equation allows for two people with the same height to have different weight values;
Shape of relation
We often categorize a relation based on the form of the model equation:
Multiple variables were measured about each penguin, such as: island, species, bill depth, bill length, body mass, sex, among others.
Palmer Penguins Dataset
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
year
Adelie
Torgersen
39.1
18.7
181
3750
male
2007
Adelie
Torgersen
39.5
17.4
186
3800
female
2007
Adelie
Torgersen
40.3
18.0
195
3250
female
2007
Adelie
Torgersen
NA
NA
NA
NA
NA
2007
Adelie
Torgersen
36.7
19.3
193
3450
female
2007
Adelie
Torgersen
39.3
20.6
190
3650
male
2007
Adelie
Torgersen
38.9
17.8
181
3625
female
2007
Adelie
Torgersen
39.2
19.6
195
4675
male
2007
Adelie
Torgersen
34.1
18.1
193
3475
NA
2007
Adelie
Torgersen
42.0
20.2
190
4250
NA
2007
Adelie
Torgersen
37.8
17.1
186
3300
NA
2007
Adelie
Torgersen
37.8
17.3
180
3700
NA
2007
Adelie
Torgersen
41.1
17.6
182
3200
female
2007
Adelie
Torgersen
38.6
21.2
191
3800
male
2007
Adelie
Torgersen
34.6
21.1
198
4400
male
2007
Adelie
Torgersen
36.6
17.8
185
3700
female
2007
Adelie
Torgersen
38.7
19.0
195
3450
female
2007
Adelie
Torgersen
42.5
20.7
197
4500
male
2007
Adelie
Torgersen
34.4
18.4
184
3325
female
2007
Adelie
Torgersen
46.0
21.5
194
4200
male
2007
Adelie
Biscoe
37.8
18.3
174
3400
female
2007
Adelie
Biscoe
37.7
18.7
180
3600
male
2007
Adelie
Biscoe
35.9
19.2
189
3800
female
2007
Adelie
Biscoe
38.2
18.1
185
3950
male
2007
Adelie
Biscoe
38.8
17.2
180
3800
male
2007
Adelie
Biscoe
35.3
18.9
187
3800
female
2007
Adelie
Biscoe
40.6
18.6
183
3550
male
2007
Adelie
Biscoe
40.5
17.9
187
3200
female
2007
Adelie
Biscoe
37.9
18.6
172
3150
female
2007
Adelie
Biscoe
40.5
18.9
180
3950
male
2007
Adelie
Dream
39.5
16.7
178
3250
female
2007
Adelie
Dream
37.2
18.1
178
3900
male
2007
Adelie
Dream
39.5
17.8
188
3300
female
2007
Adelie
Dream
40.9
18.9
184
3900
male
2007
Adelie
Dream
36.4
17.0
195
3325
female
2007
Adelie
Dream
39.2
21.1
196
4150
male
2007
Adelie
Dream
38.8
20.0
190
3950
male
2007
Adelie
Dream
42.2
18.5
180
3550
female
2007
Adelie
Dream
37.6
19.3
181
3300
female
2007
Adelie
Dream
39.8
19.1
184
4650
male
2007
Adelie
Dream
36.5
18.0
182
3150
female
2007
Adelie
Dream
40.8
18.4
195
3900
male
2007
Adelie
Dream
36.0
18.5
186
3100
female
2007
Adelie
Dream
44.1
19.7
196
4400
male
2007
Adelie
Dream
37.0
16.9
185
3000
female
2007
Adelie
Dream
39.6
18.8
190
4600
male
2007
Adelie
Dream
41.1
19.0
182
3425
male
2007
Adelie
Dream
37.5
18.9
179
2975
NA
2007
Adelie
Dream
36.0
17.9
190
3450
female
2007
Adelie
Dream
42.3
21.2
191
4150
male
2007
Adelie
Biscoe
39.6
17.7
186
3500
female
2008
Adelie
Biscoe
40.1
18.9
188
4300
male
2008
Adelie
Biscoe
35.0
17.9
190
3450
female
2008
Adelie
Biscoe
42.0
19.5
200
4050
male
2008
Adelie
Biscoe
34.5
18.1
187
2900
female
2008
Adelie
Biscoe
41.4
18.6
191
3700
male
2008
Adelie
Biscoe
39.0
17.5
186
3550
female
2008
Adelie
Biscoe
40.6
18.8
193
3800
male
2008
Adelie
Biscoe
36.5
16.6
181
2850
female
2008
Adelie
Biscoe
37.6
19.1
194
3750
male
2008
Adelie
Biscoe
35.7
16.9
185
3150
female
2008
Adelie
Biscoe
41.3
21.1
195
4400
male
2008
Adelie
Biscoe
37.6
17.0
185
3600
female
2008
Adelie
Biscoe
41.1
18.2
192
4050
male
2008
Adelie
Biscoe
36.4
17.1
184
2850
female
2008
Adelie
Biscoe
41.6
18.0
192
3950
male
2008
Adelie
Biscoe
35.5
16.2
195
3350
female
2008
Adelie
Biscoe
41.1
19.1
188
4100
male
2008
Adelie
Torgersen
35.9
16.6
190
3050
female
2008
Adelie
Torgersen
41.8
19.4
198
4450
male
2008
Adelie
Torgersen
33.5
19.0
190
3600
female
2008
Adelie
Torgersen
39.7
18.4
190
3900
male
2008
Adelie
Torgersen
39.6
17.2
196
3550
female
2008
Adelie
Torgersen
45.8
18.9
197
4150
male
2008
Adelie
Torgersen
35.5
17.5
190
3700
female
2008
Adelie
Torgersen
42.8
18.5
195
4250
male
2008
Adelie
Torgersen
40.9
16.8
191
3700
female
2008
Adelie
Torgersen
37.2
19.4
184
3900
male
2008
Adelie
Torgersen
36.2
16.1
187
3550
female
2008
Adelie
Torgersen
42.1
19.1
195
4000
male
2008
Adelie
Torgersen
34.6
17.2
189
3200
female
2008
Adelie
Torgersen
42.9
17.6
196
4700
male
2008
Adelie
Torgersen
36.7
18.8
187
3800
female
2008
Adelie
Torgersen
35.1
19.4
193
4200
male
2008
Adelie
Dream
37.3
17.8
191
3350
female
2008
Adelie
Dream
41.3
20.3
194
3550
male
2008
Adelie
Dream
36.3
19.5
190
3800
male
2008
Adelie
Dream
36.9
18.6
189
3500
female
2008
Adelie
Dream
38.3
19.2
189
3950
male
2008
Adelie
Dream
38.9
18.8
190
3600
female
2008
Adelie
Dream
35.7
18.0
202
3550
female
2008
Adelie
Dream
41.1
18.1
205
4300
male
2008
Adelie
Dream
34.0
17.1
185
3400
female
2008
Adelie
Dream
39.6
18.1
186
4450
male
2008
Adelie
Dream
36.2
17.3
187
3300
female
2008
Adelie
Dream
40.8
18.9
208
4300
male
2008
Adelie
Dream
38.1
18.6
190
3700
female
2008
Adelie
Dream
40.3
18.5
196
4350
male
2008
Adelie
Dream
33.1
16.1
178
2900
female
2008
Adelie
Dream
43.2
18.5
192
4100
male
2008
Adelie
Biscoe
35.0
17.9
192
3725
female
2009
Adelie
Biscoe
41.0
20.0
203
4725
male
2009
Adelie
Biscoe
37.7
16.0
183
3075
female
2009
Adelie
Biscoe
37.8
20.0
190
4250
male
2009
Adelie
Biscoe
37.9
18.6
193
2925
female
2009
Adelie
Biscoe
39.7
18.9
184
3550
male
2009
Adelie
Biscoe
38.6
17.2
199
3750
female
2009
Adelie
Biscoe
38.2
20.0
190
3900
male
2009
Adelie
Biscoe
38.1
17.0
181
3175
female
2009
Adelie
Biscoe
43.2
19.0
197
4775
male
2009
Adelie
Biscoe
38.1
16.5
198
3825
female
2009
Adelie
Biscoe
45.6
20.3
191
4600
male
2009
Adelie
Biscoe
39.7
17.7
193
3200
female
2009
Adelie
Biscoe
42.2
19.5
197
4275
male
2009
Adelie
Biscoe
39.6
20.7
191
3900
female
2009
Adelie
Biscoe
42.7
18.3
196
4075
male
2009
Adelie
Torgersen
38.6
17.0
188
2900
female
2009
Adelie
Torgersen
37.3
20.5
199
3775
male
2009
Adelie
Torgersen
35.7
17.0
189
3350
female
2009
Adelie
Torgersen
41.1
18.6
189
3325
male
2009
Adelie
Torgersen
36.2
17.2
187
3150
female
2009
Adelie
Torgersen
37.7
19.8
198
3500
male
2009
Adelie
Torgersen
40.2
17.0
176
3450
female
2009
Adelie
Torgersen
41.4
18.5
202
3875
male
2009
Adelie
Torgersen
35.2
15.9
186
3050
female
2009
Adelie
Torgersen
40.6
19.0
199
4000
male
2009
Adelie
Torgersen
38.8
17.6
191
3275
female
2009
Adelie
Torgersen
41.5
18.3
195
4300
male
2009
Adelie
Torgersen
39.0
17.1
191
3050
female
2009
Adelie
Torgersen
44.1
18.0
210
4000
male
2009
Adelie
Torgersen
38.5
17.9
190
3325
female
2009
Adelie
Torgersen
43.1
19.2
197
3500
male
2009
Adelie
Dream
36.8
18.5
193
3500
female
2009
Adelie
Dream
37.5
18.5
199
4475
male
2009
Adelie
Dream
38.1
17.6
187
3425
female
2009
Adelie
Dream
41.1
17.5
190
3900
male
2009
Adelie
Dream
35.6
17.5
191
3175
female
2009
Adelie
Dream
40.2
20.1
200
3975
male
2009
Adelie
Dream
37.0
16.5
185
3400
female
2009
Adelie
Dream
39.7
17.9
193
4250
male
2009
Adelie
Dream
40.2
17.1
193
3400
female
2009
Adelie
Dream
40.6
17.2
187
3475
male
2009
Adelie
Dream
32.1
15.5
188
3050
female
2009
Adelie
Dream
40.7
17.0
190
3725
male
2009
Adelie
Dream
37.3
16.8
192
3000
female
2009
Adelie
Dream
39.0
18.7
185
3650
male
2009
Adelie
Dream
39.2
18.6
190
4250
male
2009
Adelie
Dream
36.6
18.4
184
3475
female
2009
Adelie
Dream
36.0
17.8
195
3450
female
2009
Adelie
Dream
37.8
18.1
193
3750
male
2009
Adelie
Dream
36.0
17.1
187
3700
female
2009
Adelie
Dream
41.5
18.5
201
4000
male
2009
Gentoo
Biscoe
46.1
13.2
211
4500
female
2007
Gentoo
Biscoe
50.0
16.3
230
5700
male
2007
Gentoo
Biscoe
48.7
14.1
210
4450
female
2007
Gentoo
Biscoe
50.0
15.2
218
5700
male
2007
Gentoo
Biscoe
47.6
14.5
215
5400
male
2007
Gentoo
Biscoe
46.5
13.5
210
4550
female
2007
Gentoo
Biscoe
45.4
14.6
211
4800
female
2007
Gentoo
Biscoe
46.7
15.3
219
5200
male
2007
Gentoo
Biscoe
43.3
13.4
209
4400
female
2007
Gentoo
Biscoe
46.8
15.4
215
5150
male
2007
Gentoo
Biscoe
40.9
13.7
214
4650
female
2007
Gentoo
Biscoe
49.0
16.1
216
5550
male
2007
Gentoo
Biscoe
45.5
13.7
214
4650
female
2007
Gentoo
Biscoe
48.4
14.6
213
5850
male
2007
Gentoo
Biscoe
45.8
14.6
210
4200
female
2007
Gentoo
Biscoe
49.3
15.7
217
5850
male
2007
Gentoo
Biscoe
42.0
13.5
210
4150
female
2007
Gentoo
Biscoe
49.2
15.2
221
6300
male
2007
Gentoo
Biscoe
46.2
14.5
209
4800
female
2007
Gentoo
Biscoe
48.7
15.1
222
5350
male
2007
Gentoo
Biscoe
50.2
14.3
218
5700
male
2007
Gentoo
Biscoe
45.1
14.5
215
5000
female
2007
Gentoo
Biscoe
46.5
14.5
213
4400
female
2007
Gentoo
Biscoe
46.3
15.8
215
5050
male
2007
Gentoo
Biscoe
42.9
13.1
215
5000
female
2007
Gentoo
Biscoe
46.1
15.1
215
5100
male
2007
Gentoo
Biscoe
44.5
14.3
216
4100
NA
2007
Gentoo
Biscoe
47.8
15.0
215
5650
male
2007
Gentoo
Biscoe
48.2
14.3
210
4600
female
2007
Gentoo
Biscoe
50.0
15.3
220
5550
male
2007
Gentoo
Biscoe
47.3
15.3
222
5250
male
2007
Gentoo
Biscoe
42.8
14.2
209
4700
female
2007
Gentoo
Biscoe
45.1
14.5
207
5050
female
2007
Gentoo
Biscoe
59.6
17.0
230
6050
male
2007
Gentoo
Biscoe
49.1
14.8
220
5150
female
2008
Gentoo
Biscoe
48.4
16.3
220
5400
male
2008
Gentoo
Biscoe
42.6
13.7
213
4950
female
2008
Gentoo
Biscoe
44.4
17.3
219
5250
male
2008
Gentoo
Biscoe
44.0
13.6
208
4350
female
2008
Gentoo
Biscoe
48.7
15.7
208
5350
male
2008
Gentoo
Biscoe
42.7
13.7
208
3950
female
2008
Gentoo
Biscoe
49.6
16.0
225
5700
male
2008
Gentoo
Biscoe
45.3
13.7
210
4300
female
2008
Gentoo
Biscoe
49.6
15.0
216
4750
male
2008
Gentoo
Biscoe
50.5
15.9
222
5550
male
2008
Gentoo
Biscoe
43.6
13.9
217
4900
female
2008
Gentoo
Biscoe
45.5
13.9
210
4200
female
2008
Gentoo
Biscoe
50.5
15.9
225
5400
male
2008
Gentoo
Biscoe
44.9
13.3
213
5100
female
2008
Gentoo
Biscoe
45.2
15.8
215
5300
male
2008
Gentoo
Biscoe
46.6
14.2
210
4850
female
2008
Gentoo
Biscoe
48.5
14.1
220
5300
male
2008
Gentoo
Biscoe
45.1
14.4
210
4400
female
2008
Gentoo
Biscoe
50.1
15.0
225
5000
male
2008
Gentoo
Biscoe
46.5
14.4
217
4900
female
2008
Gentoo
Biscoe
45.0
15.4
220
5050
male
2008
Gentoo
Biscoe
43.8
13.9
208
4300
female
2008
Gentoo
Biscoe
45.5
15.0
220
5000
male
2008
Gentoo
Biscoe
43.2
14.5
208
4450
female
2008
Gentoo
Biscoe
50.4
15.3
224
5550
male
2008
Gentoo
Biscoe
45.3
13.8
208
4200
female
2008
Gentoo
Biscoe
46.2
14.9
221
5300
male
2008
Gentoo
Biscoe
45.7
13.9
214
4400
female
2008
Gentoo
Biscoe
54.3
15.7
231
5650
male
2008
Gentoo
Biscoe
45.8
14.2
219
4700
female
2008
Gentoo
Biscoe
49.8
16.8
230
5700
male
2008
Gentoo
Biscoe
46.2
14.4
214
4650
NA
2008
Gentoo
Biscoe
49.5
16.2
229
5800
male
2008
Gentoo
Biscoe
43.5
14.2
220
4700
female
2008
Gentoo
Biscoe
50.7
15.0
223
5550
male
2008
Gentoo
Biscoe
47.7
15.0
216
4750
female
2008
Gentoo
Biscoe
46.4
15.6
221
5000
male
2008
Gentoo
Biscoe
48.2
15.6
221
5100
male
2008
Gentoo
Biscoe
46.5
14.8
217
5200
female
2008
Gentoo
Biscoe
46.4
15.0
216
4700
female
2008
Gentoo
Biscoe
48.6
16.0
230
5800
male
2008
Gentoo
Biscoe
47.5
14.2
209
4600
female
2008
Gentoo
Biscoe
51.1
16.3
220
6000
male
2008
Gentoo
Biscoe
45.2
13.8
215
4750
female
2008
Gentoo
Biscoe
45.2
16.4
223
5950
male
2008
Gentoo
Biscoe
49.1
14.5
212
4625
female
2009
Gentoo
Biscoe
52.5
15.6
221
5450
male
2009
Gentoo
Biscoe
47.4
14.6
212
4725
female
2009
Gentoo
Biscoe
50.0
15.9
224
5350
male
2009
Gentoo
Biscoe
44.9
13.8
212
4750
female
2009
Gentoo
Biscoe
50.8
17.3
228
5600
male
2009
Gentoo
Biscoe
43.4
14.4
218
4600
female
2009
Gentoo
Biscoe
51.3
14.2
218
5300
male
2009
Gentoo
Biscoe
47.5
14.0
212
4875
female
2009
Gentoo
Biscoe
52.1
17.0
230
5550
male
2009
Gentoo
Biscoe
47.5
15.0
218
4950
female
2009
Gentoo
Biscoe
52.2
17.1
228
5400
male
2009
Gentoo
Biscoe
45.5
14.5
212
4750
female
2009
Gentoo
Biscoe
49.5
16.1
224
5650
male
2009
Gentoo
Biscoe
44.5
14.7
214
4850
female
2009
Gentoo
Biscoe
50.8
15.7
226
5200
male
2009
Gentoo
Biscoe
49.4
15.8
216
4925
male
2009
Gentoo
Biscoe
46.9
14.6
222
4875
female
2009
Gentoo
Biscoe
48.4
14.4
203
4625
female
2009
Gentoo
Biscoe
51.1
16.5
225
5250
male
2009
Gentoo
Biscoe
48.5
15.0
219
4850
female
2009
Gentoo
Biscoe
55.9
17.0
228
5600
male
2009
Gentoo
Biscoe
47.2
15.5
215
4975
female
2009
Gentoo
Biscoe
49.1
15.0
228
5500
male
2009
Gentoo
Biscoe
47.3
13.8
216
4725
NA
2009
Gentoo
Biscoe
46.8
16.1
215
5500
male
2009
Gentoo
Biscoe
41.7
14.7
210
4700
female
2009
Gentoo
Biscoe
53.4
15.8
219
5500
male
2009
Gentoo
Biscoe
43.3
14.0
208
4575
female
2009
Gentoo
Biscoe
48.1
15.1
209
5500
male
2009
Gentoo
Biscoe
50.5
15.2
216
5000
female
2009
Gentoo
Biscoe
49.8
15.9
229
5950
male
2009
Gentoo
Biscoe
43.5
15.2
213
4650
female
2009
Gentoo
Biscoe
51.5
16.3
230
5500
male
2009
Gentoo
Biscoe
46.2
14.1
217
4375
female
2009
Gentoo
Biscoe
55.1
16.0
230
5850
male
2009
Gentoo
Biscoe
44.5
15.7
217
4875
NA
2009
Gentoo
Biscoe
48.8
16.2
222
6000
male
2009
Gentoo
Biscoe
47.2
13.7
214
4925
female
2009
Gentoo
Biscoe
NA
NA
NA
NA
NA
2009
Gentoo
Biscoe
46.8
14.3
215
4850
female
2009
Gentoo
Biscoe
50.4
15.7
222
5750
male
2009
Gentoo
Biscoe
45.2
14.8
212
5200
female
2009
Gentoo
Biscoe
49.9
16.1
213
5400
male
2009
Chinstrap
Dream
46.5
17.9
192
3500
female
2007
Chinstrap
Dream
50.0
19.5
196
3900
male
2007
Chinstrap
Dream
51.3
19.2
193
3650
male
2007
Chinstrap
Dream
45.4
18.7
188
3525
female
2007
Chinstrap
Dream
52.7
19.8
197
3725
male
2007
Chinstrap
Dream
45.2
17.8
198
3950
female
2007
Chinstrap
Dream
46.1
18.2
178
3250
female
2007
Chinstrap
Dream
51.3
18.2
197
3750
male
2007
Chinstrap
Dream
46.0
18.9
195
4150
female
2007
Chinstrap
Dream
51.3
19.9
198
3700
male
2007
Chinstrap
Dream
46.6
17.8
193
3800
female
2007
Chinstrap
Dream
51.7
20.3
194
3775
male
2007
Chinstrap
Dream
47.0
17.3
185
3700
female
2007
Chinstrap
Dream
52.0
18.1
201
4050
male
2007
Chinstrap
Dream
45.9
17.1
190
3575
female
2007
Chinstrap
Dream
50.5
19.6
201
4050
male
2007
Chinstrap
Dream
50.3
20.0
197
3300
male
2007
Chinstrap
Dream
58.0
17.8
181
3700
female
2007
Chinstrap
Dream
46.4
18.6
190
3450
female
2007
Chinstrap
Dream
49.2
18.2
195
4400
male
2007
Chinstrap
Dream
42.4
17.3
181
3600
female
2007
Chinstrap
Dream
48.5
17.5
191
3400
male
2007
Chinstrap
Dream
43.2
16.6
187
2900
female
2007
Chinstrap
Dream
50.6
19.4
193
3800
male
2007
Chinstrap
Dream
46.7
17.9
195
3300
female
2007
Chinstrap
Dream
52.0
19.0
197
4150
male
2007
Chinstrap
Dream
50.5
18.4
200
3400
female
2008
Chinstrap
Dream
49.5
19.0
200
3800
male
2008
Chinstrap
Dream
46.4
17.8
191
3700
female
2008
Chinstrap
Dream
52.8
20.0
205
4550
male
2008
Chinstrap
Dream
40.9
16.6
187
3200
female
2008
Chinstrap
Dream
54.2
20.8
201
4300
male
2008
Chinstrap
Dream
42.5
16.7
187
3350
female
2008
Chinstrap
Dream
51.0
18.8
203
4100
male
2008
Chinstrap
Dream
49.7
18.6
195
3600
male
2008
Chinstrap
Dream
47.5
16.8
199
3900
female
2008
Chinstrap
Dream
47.6
18.3
195
3850
female
2008
Chinstrap
Dream
52.0
20.7
210
4800
male
2008
Chinstrap
Dream
46.9
16.6
192
2700
female
2008
Chinstrap
Dream
53.5
19.9
205
4500
male
2008
Chinstrap
Dream
49.0
19.5
210
3950
male
2008
Chinstrap
Dream
46.2
17.5
187
3650
female
2008
Chinstrap
Dream
50.9
19.1
196
3550
male
2008
Chinstrap
Dream
45.5
17.0
196
3500
female
2008
Chinstrap
Dream
50.9
17.9
196
3675
female
2009
Chinstrap
Dream
50.8
18.5
201
4450
male
2009
Chinstrap
Dream
50.1
17.9
190
3400
female
2009
Chinstrap
Dream
49.0
19.6
212
4300
male
2009
Chinstrap
Dream
51.5
18.7
187
3250
male
2009
Chinstrap
Dream
49.8
17.3
198
3675
female
2009
Chinstrap
Dream
48.1
16.4
199
3325
female
2009
Chinstrap
Dream
51.4
19.0
201
3950
male
2009
Chinstrap
Dream
45.7
17.3
193
3600
female
2009
Chinstrap
Dream
50.7
19.7
203
4050
male
2009
Chinstrap
Dream
42.5
17.3
187
3350
female
2009
Chinstrap
Dream
52.2
18.8
197
3450
male
2009
Chinstrap
Dream
45.2
16.6
191
3250
female
2009
Chinstrap
Dream
49.3
19.9
203
4050
male
2009
Chinstrap
Dream
50.2
18.8
202
3800
male
2009
Chinstrap
Dream
45.6
19.4
194
3525
female
2009
Chinstrap
Dream
51.9
19.5
206
3950
male
2009
Chinstrap
Dream
46.8
16.5
189
3650
female
2009
Chinstrap
Dream
45.7
17.0
195
3650
female
2009
Chinstrap
Dream
55.8
19.8
207
4000
male
2009
Chinstrap
Dream
43.5
18.1
202
3400
female
2009
Chinstrap
Dream
49.6
18.2
193
3775
male
2009
Chinstrap
Dream
50.8
19.0
210
4100
male
2009
Chinstrap
Dream
50.2
18.7
198
3775
female
2009
Palmer Penguins in R
The data is available in the package palmerpenguins.
You can install it using install.packages('palmerpenguins')
Accessing the data:
Load the library library(palmerpenguins).
The data will be stored in the variables penguins.
Flipper Length vs Body Mass
Suppose we want to use Penguins’ Flipper Length to explain their Body Mass;
Flipper Length vs Body Mass
This is a stochastic relation;
Many penguins have a flipper length of \(190mm\), but their weights range from \(3050g\) to \(4600g\).
Other aspects affect body mass besides flipper length.
The relation is “almost” linear;
A straight-line model could be a reasonable approximation for this relation.
This model is called Simple Linear Regression.
Simple Linear Regression
Specification
We have two variables:
Response variable (\(Y\)): what we are trying to predict/explain;
Explanatory Variable (\(X\)): the variable used to predict the response;
We have \(n\) observations/samples.
\(Y_i\) and \(X_i\) denote the values of \(X\) and \(Y\) for observation \(i\).
You can think of it as \(n\) pairs: \((X_1, Y_1), (X_2, Y_2), ..., (X_n, Y_n)\).
For example, for the penguins, \(i\) refers to each penguin and can vary from 1 to 344.
i
flipper_length_mm (X)
body_mass_g (Y)
1
181
3750
2
186
3800
3
195
3250
4
NA
NA
5
193
3450
6
190
3650
7
181
3625
8
195
4675
9
193
3475
10
190
4250
11
186
3300
12
180
3700
13
182
3200
14
191
3800
15
198
4400
16
185
3700
17
195
3450
18
197
4500
19
184
3325
20
194
4200
21
174
3400
22
180
3600
23
189
3800
24
185
3950
25
180
3800
26
187
3800
27
183
3550
28
187
3200
29
172
3150
30
180
3950
31
178
3250
32
178
3900
33
188
3300
34
184
3900
35
195
3325
36
196
4150
37
190
3950
38
180
3550
39
181
3300
40
184
4650
41
182
3150
42
195
3900
43
186
3100
44
196
4400
45
185
3000
46
190
4600
47
182
3425
48
179
2975
49
190
3450
50
191
4150
51
186
3500
52
188
4300
53
190
3450
54
200
4050
55
187
2900
56
191
3700
57
186
3550
58
193
3800
59
181
2850
60
194
3750
61
185
3150
62
195
4400
63
185
3600
64
192
4050
65
184
2850
66
192
3950
67
195
3350
68
188
4100
69
190
3050
70
198
4450
71
190
3600
72
190
3900
73
196
3550
74
197
4150
75
190
3700
76
195
4250
77
191
3700
78
184
3900
79
187
3550
80
195
4000
81
189
3200
82
196
4700
83
187
3800
84
193
4200
85
191
3350
86
194
3550
87
190
3800
88
189
3500
89
189
3950
90
190
3600
91
202
3550
92
205
4300
93
185
3400
94
186
4450
95
187
3300
96
208
4300
97
190
3700
98
196
4350
99
178
2900
100
192
4100
101
192
3725
102
203
4725
103
183
3075
104
190
4250
105
193
2925
106
184
3550
107
199
3750
108
190
3900
109
181
3175
110
197
4775
111
198
3825
112
191
4600
113
193
3200
114
197
4275
115
191
3900
116
196
4075
117
188
2900
118
199
3775
119
189
3350
120
189
3325
121
187
3150
122
198
3500
123
176
3450
124
202
3875
125
186
3050
126
199
4000
127
191
3275
128
195
4300
129
191
3050
130
210
4000
131
190
3325
132
197
3500
133
193
3500
134
199
4475
135
187
3425
136
190
3900
137
191
3175
138
200
3975
139
185
3400
140
193
4250
141
193
3400
142
187
3475
143
188
3050
144
190
3725
145
192
3000
146
185
3650
147
190
4250
148
184
3475
149
195
3450
150
193
3750
151
187
3700
152
201
4000
153
211
4500
154
230
5700
155
210
4450
156
218
5700
157
215
5400
158
210
4550
159
211
4800
160
219
5200
161
209
4400
162
215
5150
163
214
4650
164
216
5550
165
214
4650
166
213
5850
167
210
4200
168
217
5850
169
210
4150
170
221
6300
171
209
4800
172
222
5350
173
218
5700
174
215
5000
175
213
4400
176
215
5050
177
215
5000
178
215
5100
179
216
4100
180
215
5650
181
210
4600
182
220
5550
183
222
5250
184
209
4700
185
207
5050
186
230
6050
187
220
5150
188
220
5400
189
213
4950
190
219
5250
191
208
4350
192
208
5350
193
208
3950
194
225
5700
195
210
4300
196
216
4750
197
222
5550
198
217
4900
199
210
4200
200
225
5400
201
213
5100
202
215
5300
203
210
4850
204
220
5300
205
210
4400
206
225
5000
207
217
4900
208
220
5050
209
208
4300
210
220
5000
211
208
4450
212
224
5550
213
208
4200
214
221
5300
215
214
4400
216
231
5650
217
219
4700
218
230
5700
219
214
4650
220
229
5800
221
220
4700
222
223
5550
223
216
4750
224
221
5000
225
221
5100
226
217
5200
227
216
4700
228
230
5800
229
209
4600
230
220
6000
231
215
4750
232
223
5950
233
212
4625
234
221
5450
235
212
4725
236
224
5350
237
212
4750
238
228
5600
239
218
4600
240
218
5300
241
212
4875
242
230
5550
243
218
4950
244
228
5400
245
212
4750
246
224
5650
247
214
4850
248
226
5200
249
216
4925
250
222
4875
251
203
4625
252
225
5250
253
219
4850
254
228
5600
255
215
4975
256
228
5500
257
216
4725
258
215
5500
259
210
4700
260
219
5500
261
208
4575
262
209
5500
263
216
5000
264
229
5950
265
213
4650
266
230
5500
267
217
4375
268
230
5850
269
217
4875
270
222
6000
271
214
4925
272
NA
NA
273
215
4850
274
222
5750
275
212
5200
276
213
5400
277
192
3500
278
196
3900
279
193
3650
280
188
3525
281
197
3725
282
198
3950
283
178
3250
284
197
3750
285
195
4150
286
198
3700
287
193
3800
288
194
3775
289
185
3700
290
201
4050
291
190
3575
292
201
4050
293
197
3300
294
181
3700
295
190
3450
296
195
4400
297
181
3600
298
191
3400
299
187
2900
300
193
3800
301
195
3300
302
197
4150
303
200
3400
304
200
3800
305
191
3700
306
205
4550
307
187
3200
308
201
4300
309
187
3350
310
203
4100
311
195
3600
312
199
3900
313
195
3850
314
210
4800
315
192
2700
316
205
4500
317
210
3950
318
187
3650
319
196
3550
320
196
3500
321
196
3675
322
201
4450
323
190
3400
324
212
4300
325
187
3250
326
198
3675
327
199
3325
328
201
3950
329
193
3600
330
203
4050
331
187
3350
332
197
3450
333
191
3250
334
203
4050
335
202
3800
336
194
3525
337
206
3950
338
189
3650
339
195
3650
340
207
4000
341
202
3400
342
193
3775
343
210
4100
344
198
3775
The explanatory variable \(X\) is assumed to be fixed for each individual/sample;
The model
The model relating \(X\) and \(Y\) is given by the equation of a line: \[
Y_i = \beta_0 + \beta_1 X_i+ \epsilon_i
\]
Let’s discuss each of these components in more detail;
The model: components
The model relating \(X\) and \(Y\) is given by the equation of a line: \[
Y_i = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} X_i + \underbrace{\epsilon_i}_{\text{error term}}
\]
Intercept: tells us the \(Y\) value when \(X=0\) (i.e., the value of \(Y\) when the line crosses the \(Y\)-axis).
Slope: tells us how much change in \(Y\) to expect for a unit increase in \(X\).
Error: captures the variability of the response not explained by the model.
The model: components
The model relating \(X\) and \(Y\) is given by: \[
Y_i = \underbrace{\beta_0}_{\text{intercept}} + \beta_1 X_i + \epsilon_i
\]
The model: components
The model relating \(X\) and \(Y\) is given by: \[
Y_i = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} X_i + \epsilon_i
\]
Fitting
You might need to refresh this page to show the plot
What values to use for \(\beta_0\) and \(\beta_1\)?
Slope: an increase of 1 unit of \(X\) is associated with an expected increase of \(\beta_1\) units in \(Y\). - It is associated with, not the cause of!
Intercept: The average value of \(Y\) when \(X = 0\) is \(\beta_0\). - Usually, we don’t care as much about this parameter.
Important: Association is not causality.
In general, we cannot conclude that changes in \(X\) cause a change in \(Y\). The conclusion of causality requires more than a good model.
Example: Fitting a model
Let’s fit a linear model relating Flipper Length and Body Mass of penguins.
To fit a linear model in R, we use the lm function.
Let’s explore this function in a code demo!
Code demo - Part I!
Fitting is estimation
\(E[Y|X] = \beta_0 + \beta_1 X\) is the population’s conditional mean for a given value of \(X\).
But instead of estimating the mean for each value of \(X\) separately, which is not feasible, we are assuming a linear structure between the mean of the population and the value of \(X\).
Therefore, estimating the means for the value of \(X\) (in a certain range) reduces to estimate \(\beta_0\) and \(\beta_1\).
Population vs Sample Regression
Population Regression: \[
Y_i = \beta_0 + \beta_1 X + \varepsilon_i
\]
Sample Regression (estimated from the sample): \[
\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X + e_i
\]
Note that \(e_i\neq\varepsilon_i\) because \(\hat{\beta}_0\neq \beta_0\) and \(\hat{\beta}_1\neq \beta_1\)
The parameters \(\beta_0\) and \(\beta_1\)
Since \(\beta_0\) and \(\beta_1\) are parameters, we estimate them based on a sample;
\(\hat{\beta_0}\) and \(\hat{\beta_1}\) are the estimators of \(\beta_0\) and \(\beta_1\).
As statistics, \(\hat{\beta_0}\) and \(\hat{\beta_1}\) depend on the sample, therefore we will need their sampling distribution.
Inference for \(\hat{\beta}_0\)
You do not need to memorize these formulae.
Estimator of \(\beta_0\): \[\widehat{\beta}_0 = \bar{Y}-\widehat{\beta}_1\bar{X}\]
# Fit the modelpenguins_lm <-lm(body_mass_g ~ flipper_length_mm, data = penguins %>%drop_na())# Extract the CIconfint(penguins_lm, level =0.95)
2.5 %
97.5 %
(Intercept)
-6482.47
-5261.71
flipper_length_mm
47.12
53.18
With 95% confidence, we expect an increase in the penguins’ weight between 47.12 and 53.18 grams for every 1 mm increase in the penguin’s flipper length.
We can also test the hypothesis that \(Y\) and \(X\) are linearly related. This is equivalent to testing \[
H_0: \beta_1 = 0\quad vs \quad H_1: \beta_1\neq 0
\]
We can change the alternative hypothesis if \(\beta_1\) cannot be positive or negative.
Hypothesis Test: Null Model
Test Statistic and Null model: \[
T = \frac{\hat{\beta}_1-0}{\widehat{SE}\left(\hat{\beta}_0\right)} \sim t_{n-2}
\]
Therefore, we reject the null hypothesis that \(\beta_1 = 0\), in favour of \(H_1: \beta_1\neq 0\).
Example: range problem
Code demo - Part 4!
The range problem
The linear model assumes that the relationship between \(X\) and \(E[Y|X]\) is linear, which may or may not be true;
Sometimes, there’s a linear association only in part of the data range.
The linear model could still be useful when restricted to that specific range;
We need to exercise caution when using the model outside the range of the data, as the relationship between \(X\) and \(Y\) may differ significantly.
Regression vs Correlation analysis
Correlation analysis: we’re interested in the strength of linear association between two variables;
no distinction between the two variables (no response and no covariate);
both variables are assumed to be stochastic;
Linear Regression: we’re interested in estimating the conditional average of the response given the value of the covariate. - covariate is assumed to be non-stochastic; - one of the variables is treated as a response and the other as a covariate;
Disclaimer
Warning
All the models in this course are approximations of reality. None of these models will be true or correct, but hopefully they will still be useful.
There’s a famous quote from George E. P. Box:
“Essentially, all models are wrong, but some are useful.”
Categorical Covariate?
Note that we can also have a categorical covariate.
We can try to explain the body mass of Penguins based on the sex of penguins: male or female.\[
\text{body_mass_g}_i = \beta_0 + \beta_1\text{sex}_i + \varepsilon_i
\]
But wait! How can we have categories in an equation?
Dummy variables (2 categories)
We can encode the categories into multiple variables.
For example, the variable sex could be defined as:
\[\begin{equation}
\text{sex}_i =
\begin{cases}
0 & \text{if penguin $i$ is female}\\
1 & \text{if penguin $i$ is male}
\end{cases}
\end{equation}\]
A variable patient_status could be defined:
\[\begin{equation}
\text{patient_status}_i =
\begin{cases}
0 & \text{if patient $i$ is healthy}\\
1 & \text{if patient $i$ is sick}
\end{cases}
\end{equation}\]
If we have a variable with two categories, we need only 1 dummy variable to represent it.
Back to the model
We can try to explain the body mass of Penguins based on the sex of penguins: male or female.\[
\text{body_mass_g}_i = \beta_0 + \beta_1\text{sex}_i + \varepsilon_i
\]