Simple Linear Regression

STAT 301 - Lecture 01

Welcome to STAT 301

STAT 301 - Intro

  • In most of STAT 201, we dealt with estimating and hypothesis testing one parameter.

  • In this course, you will study the association between variables.

  • We will explore the world of explanatory and predictive modelling.

  • You’ll learn about a variety of different models, how to interpret the results and evaluate model performance.

Let’s explore our Canvas page

Relations between variables

Relations

  • In most of STAT 201, we focused on the estimation of one parameter of a population:
    • the average income of a group of a population;
    • the median diameter of certain trees;
  • Often, we are interested in how a variable relates to other variables:
    • What do we expect to happen with the youth unemployment rate if there’s an increase in minimum wage?

Example 1: Laffer curve

  • The Laffer curve relates the tax rate and the government revenues:

Example 2: Price vs demand

  • A well-established fact in Economy is that the more expensive a product is, the smaller the demand for that product.

Deterministic Relations

  • In a deterministic relationship, the value of one variable is entirely determined by the value of another variable.

  • There is no uncertainty!

Deterministic Relations: Example 1

  • Einstein’s mass-energy relation: \(\quad E = m\times c^2\)
    • \(c\) is just a constant (speed of light)

Deterministic Relations: Example 2

  • A circle’s area-radius relation: \(A=\pi r^2\).
    • \(\pi\) is just a constant.

Stochastic Relations

  • Stochastic relations refer to situations in which the outcome of a given input cannot be precisely predicted.

  • There’s uncertainty!

  • The uncertainty might be due to other variables not being considered or even noise.

Stochastic Relations: Example 1

  • Height and weight of individuals: taller people tend to weigh more, but there is considerable variability.  

  • Other factors like age, gender, and lifestyle can influence weight.

  • It is impossible to accurately determine a person’s weight based solely on height.

Stochastic Relations: Example 1

Stochastic Relations: Example 2

  • Home price and square footage: larger homes tend to cost more, but there’s considerable variability.

  • Other factors such as location, age of the home, and market conditions can influence the price.

Stochastic Relations: Example 2

Stochastic Relations: the error term

  • A stochastic relation typically includes a random error term (\(\varepsilon\)) to account for the variability in the response. For instance, \[\text{Weight} = \beta_0 + \beta_1 \times \text{Height} + \varepsilon\]
  • Now, the model equation allows for two people with the same height to have different weight values;

Shape of relation

  • We often categorize a relation based on the form of the model equation:
    • Linear Relation: \(\text{Weight} = \beta_0 + \beta_1 \times \text{Height} + \varepsilon\)
    • Quadradic Relation: \(A = \pi\times r^2\)
    • Exponential Relation: \(N = N_0e^{-\lambda t}\varepsilon\)
  • Our focus will be exclusively on stochastic relations.

Terminology

  • We will denote our response variable by \(Y\), also known as:
    • Output variable
    • Dependent variable (avoid!)
  • Predictor \(X\), also known as
    • feature
    • input variable
    • covariate
    • regressor
    • independent variable (avoid!)

Our go-to dataset

  • In this course, we’ll use data to illustrate concepts.

  • When possible, we will always begin with the same dataset to minimize cognitive load.   - Don’t worry; we will explore different datasets as well!

  • Let’s familiarize ourselves with our main dataset.

Palmer Penguins Dataset

Artwork by @allison_horst

Dr. Kristen Gorman has collected data on 344 penguins from three islands in the Palmer Archipelago, Antarctica.

Artwork by @allison_horst

Multiple variables were measured about each penguin, such as: island, species, bill depth, bill length, body mass, sex, among others.

Palmer Penguins Dataset

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007
Adelie Torgersen 38.9 17.8 181 3625 female 2007
Adelie Torgersen 39.2 19.6 195 4675 male 2007
Adelie Torgersen 34.1 18.1 193 3475 NA 2007
Adelie Torgersen 42.0 20.2 190 4250 NA 2007
Adelie Torgersen 37.8 17.1 186 3300 NA 2007
Adelie Torgersen 37.8 17.3 180 3700 NA 2007
Adelie Torgersen 41.1 17.6 182 3200 female 2007
Adelie Torgersen 38.6 21.2 191 3800 male 2007
Adelie Torgersen 34.6 21.1 198 4400 male 2007
Adelie Torgersen 36.6 17.8 185 3700 female 2007
Adelie Torgersen 38.7 19.0 195 3450 female 2007
Adelie Torgersen 42.5 20.7 197 4500 male 2007
Adelie Torgersen 34.4 18.4 184 3325 female 2007
Adelie Torgersen 46.0 21.5 194 4200 male 2007
Adelie Biscoe 37.8 18.3 174 3400 female 2007
Adelie Biscoe 37.7 18.7 180 3600 male 2007
Adelie Biscoe 35.9 19.2 189 3800 female 2007
Adelie Biscoe 38.2 18.1 185 3950 male 2007
Adelie Biscoe 38.8 17.2 180 3800 male 2007
Adelie Biscoe 35.3 18.9 187 3800 female 2007
Adelie Biscoe 40.6 18.6 183 3550 male 2007
Adelie Biscoe 40.5 17.9 187 3200 female 2007
Adelie Biscoe 37.9 18.6 172 3150 female 2007
Adelie Biscoe 40.5 18.9 180 3950 male 2007
Adelie Dream 39.5 16.7 178 3250 female 2007
Adelie Dream 37.2 18.1 178 3900 male 2007
Adelie Dream 39.5 17.8 188 3300 female 2007
Adelie Dream 40.9 18.9 184 3900 male 2007
Adelie Dream 36.4 17.0 195 3325 female 2007
Adelie Dream 39.2 21.1 196 4150 male 2007
Adelie Dream 38.8 20.0 190 3950 male 2007
Adelie Dream 42.2 18.5 180 3550 female 2007
Adelie Dream 37.6 19.3 181 3300 female 2007
Adelie Dream 39.8 19.1 184 4650 male 2007
Adelie Dream 36.5 18.0 182 3150 female 2007
Adelie Dream 40.8 18.4 195 3900 male 2007
Adelie Dream 36.0 18.5 186 3100 female 2007
Adelie Dream 44.1 19.7 196 4400 male 2007
Adelie Dream 37.0 16.9 185 3000 female 2007
Adelie Dream 39.6 18.8 190 4600 male 2007
Adelie Dream 41.1 19.0 182 3425 male 2007
Adelie Dream 37.5 18.9 179 2975 NA 2007
Adelie Dream 36.0 17.9 190 3450 female 2007
Adelie Dream 42.3 21.2 191 4150 male 2007
Adelie Biscoe 39.6 17.7 186 3500 female 2008
Adelie Biscoe 40.1 18.9 188 4300 male 2008
Adelie Biscoe 35.0 17.9 190 3450 female 2008
Adelie Biscoe 42.0 19.5 200 4050 male 2008
Adelie Biscoe 34.5 18.1 187 2900 female 2008
Adelie Biscoe 41.4 18.6 191 3700 male 2008
Adelie Biscoe 39.0 17.5 186 3550 female 2008
Adelie Biscoe 40.6 18.8 193 3800 male 2008
Adelie Biscoe 36.5 16.6 181 2850 female 2008
Adelie Biscoe 37.6 19.1 194 3750 male 2008
Adelie Biscoe 35.7 16.9 185 3150 female 2008
Adelie Biscoe 41.3 21.1 195 4400 male 2008
Adelie Biscoe 37.6 17.0 185 3600 female 2008
Adelie Biscoe 41.1 18.2 192 4050 male 2008
Adelie Biscoe 36.4 17.1 184 2850 female 2008
Adelie Biscoe 41.6 18.0 192 3950 male 2008
Adelie Biscoe 35.5 16.2 195 3350 female 2008
Adelie Biscoe 41.1 19.1 188 4100 male 2008
Adelie Torgersen 35.9 16.6 190 3050 female 2008
Adelie Torgersen 41.8 19.4 198 4450 male 2008
Adelie Torgersen 33.5 19.0 190 3600 female 2008
Adelie Torgersen 39.7 18.4 190 3900 male 2008
Adelie Torgersen 39.6 17.2 196 3550 female 2008
Adelie Torgersen 45.8 18.9 197 4150 male 2008
Adelie Torgersen 35.5 17.5 190 3700 female 2008
Adelie Torgersen 42.8 18.5 195 4250 male 2008
Adelie Torgersen 40.9 16.8 191 3700 female 2008
Adelie Torgersen 37.2 19.4 184 3900 male 2008
Adelie Torgersen 36.2 16.1 187 3550 female 2008
Adelie Torgersen 42.1 19.1 195 4000 male 2008
Adelie Torgersen 34.6 17.2 189 3200 female 2008
Adelie Torgersen 42.9 17.6 196 4700 male 2008
Adelie Torgersen 36.7 18.8 187 3800 female 2008
Adelie Torgersen 35.1 19.4 193 4200 male 2008
Adelie Dream 37.3 17.8 191 3350 female 2008
Adelie Dream 41.3 20.3 194 3550 male 2008
Adelie Dream 36.3 19.5 190 3800 male 2008
Adelie Dream 36.9 18.6 189 3500 female 2008
Adelie Dream 38.3 19.2 189 3950 male 2008
Adelie Dream 38.9 18.8 190 3600 female 2008
Adelie Dream 35.7 18.0 202 3550 female 2008
Adelie Dream 41.1 18.1 205 4300 male 2008
Adelie Dream 34.0 17.1 185 3400 female 2008
Adelie Dream 39.6 18.1 186 4450 male 2008
Adelie Dream 36.2 17.3 187 3300 female 2008
Adelie Dream 40.8 18.9 208 4300 male 2008
Adelie Dream 38.1 18.6 190 3700 female 2008
Adelie Dream 40.3 18.5 196 4350 male 2008
Adelie Dream 33.1 16.1 178 2900 female 2008
Adelie Dream 43.2 18.5 192 4100 male 2008
Adelie Biscoe 35.0 17.9 192 3725 female 2009
Adelie Biscoe 41.0 20.0 203 4725 male 2009
Adelie Biscoe 37.7 16.0 183 3075 female 2009
Adelie Biscoe 37.8 20.0 190 4250 male 2009
Adelie Biscoe 37.9 18.6 193 2925 female 2009
Adelie Biscoe 39.7 18.9 184 3550 male 2009
Adelie Biscoe 38.6 17.2 199 3750 female 2009
Adelie Biscoe 38.2 20.0 190 3900 male 2009
Adelie Biscoe 38.1 17.0 181 3175 female 2009
Adelie Biscoe 43.2 19.0 197 4775 male 2009
Adelie Biscoe 38.1 16.5 198 3825 female 2009
Adelie Biscoe 45.6 20.3 191 4600 male 2009
Adelie Biscoe 39.7 17.7 193 3200 female 2009
Adelie Biscoe 42.2 19.5 197 4275 male 2009
Adelie Biscoe 39.6 20.7 191 3900 female 2009
Adelie Biscoe 42.7 18.3 196 4075 male 2009
Adelie Torgersen 38.6 17.0 188 2900 female 2009
Adelie Torgersen 37.3 20.5 199 3775 male 2009
Adelie Torgersen 35.7 17.0 189 3350 female 2009
Adelie Torgersen 41.1 18.6 189 3325 male 2009
Adelie Torgersen 36.2 17.2 187 3150 female 2009
Adelie Torgersen 37.7 19.8 198 3500 male 2009
Adelie Torgersen 40.2 17.0 176 3450 female 2009
Adelie Torgersen 41.4 18.5 202 3875 male 2009
Adelie Torgersen 35.2 15.9 186 3050 female 2009
Adelie Torgersen 40.6 19.0 199 4000 male 2009
Adelie Torgersen 38.8 17.6 191 3275 female 2009
Adelie Torgersen 41.5 18.3 195 4300 male 2009
Adelie Torgersen 39.0 17.1 191 3050 female 2009
Adelie Torgersen 44.1 18.0 210 4000 male 2009
Adelie Torgersen 38.5 17.9 190 3325 female 2009
Adelie Torgersen 43.1 19.2 197 3500 male 2009
Adelie Dream 36.8 18.5 193 3500 female 2009
Adelie Dream 37.5 18.5 199 4475 male 2009
Adelie Dream 38.1 17.6 187 3425 female 2009
Adelie Dream 41.1 17.5 190 3900 male 2009
Adelie Dream 35.6 17.5 191 3175 female 2009
Adelie Dream 40.2 20.1 200 3975 male 2009
Adelie Dream 37.0 16.5 185 3400 female 2009
Adelie Dream 39.7 17.9 193 4250 male 2009
Adelie Dream 40.2 17.1 193 3400 female 2009
Adelie Dream 40.6 17.2 187 3475 male 2009
Adelie Dream 32.1 15.5 188 3050 female 2009
Adelie Dream 40.7 17.0 190 3725 male 2009
Adelie Dream 37.3 16.8 192 3000 female 2009
Adelie Dream 39.0 18.7 185 3650 male 2009
Adelie Dream 39.2 18.6 190 4250 male 2009
Adelie Dream 36.6 18.4 184 3475 female 2009
Adelie Dream 36.0 17.8 195 3450 female 2009
Adelie Dream 37.8 18.1 193 3750 male 2009
Adelie Dream 36.0 17.1 187 3700 female 2009
Adelie Dream 41.5 18.5 201 4000 male 2009
Gentoo Biscoe 46.1 13.2 211 4500 female 2007
Gentoo Biscoe 50.0 16.3 230 5700 male 2007
Gentoo Biscoe 48.7 14.1 210 4450 female 2007
Gentoo Biscoe 50.0 15.2 218 5700 male 2007
Gentoo Biscoe 47.6 14.5 215 5400 male 2007
Gentoo Biscoe 46.5 13.5 210 4550 female 2007
Gentoo Biscoe 45.4 14.6 211 4800 female 2007
Gentoo Biscoe 46.7 15.3 219 5200 male 2007
Gentoo Biscoe 43.3 13.4 209 4400 female 2007
Gentoo Biscoe 46.8 15.4 215 5150 male 2007
Gentoo Biscoe 40.9 13.7 214 4650 female 2007
Gentoo Biscoe 49.0 16.1 216 5550 male 2007
Gentoo Biscoe 45.5 13.7 214 4650 female 2007
Gentoo Biscoe 48.4 14.6 213 5850 male 2007
Gentoo Biscoe 45.8 14.6 210 4200 female 2007
Gentoo Biscoe 49.3 15.7 217 5850 male 2007
Gentoo Biscoe 42.0 13.5 210 4150 female 2007
Gentoo Biscoe 49.2 15.2 221 6300 male 2007
Gentoo Biscoe 46.2 14.5 209 4800 female 2007
Gentoo Biscoe 48.7 15.1 222 5350 male 2007
Gentoo Biscoe 50.2 14.3 218 5700 male 2007
Gentoo Biscoe 45.1 14.5 215 5000 female 2007
Gentoo Biscoe 46.5 14.5 213 4400 female 2007
Gentoo Biscoe 46.3 15.8 215 5050 male 2007
Gentoo Biscoe 42.9 13.1 215 5000 female 2007
Gentoo Biscoe 46.1 15.1 215 5100 male 2007
Gentoo Biscoe 44.5 14.3 216 4100 NA 2007
Gentoo Biscoe 47.8 15.0 215 5650 male 2007
Gentoo Biscoe 48.2 14.3 210 4600 female 2007
Gentoo Biscoe 50.0 15.3 220 5550 male 2007
Gentoo Biscoe 47.3 15.3 222 5250 male 2007
Gentoo Biscoe 42.8 14.2 209 4700 female 2007
Gentoo Biscoe 45.1 14.5 207 5050 female 2007
Gentoo Biscoe 59.6 17.0 230 6050 male 2007
Gentoo Biscoe 49.1 14.8 220 5150 female 2008
Gentoo Biscoe 48.4 16.3 220 5400 male 2008
Gentoo Biscoe 42.6 13.7 213 4950 female 2008
Gentoo Biscoe 44.4 17.3 219 5250 male 2008
Gentoo Biscoe 44.0 13.6 208 4350 female 2008
Gentoo Biscoe 48.7 15.7 208 5350 male 2008
Gentoo Biscoe 42.7 13.7 208 3950 female 2008
Gentoo Biscoe 49.6 16.0 225 5700 male 2008
Gentoo Biscoe 45.3 13.7 210 4300 female 2008
Gentoo Biscoe 49.6 15.0 216 4750 male 2008
Gentoo Biscoe 50.5 15.9 222 5550 male 2008
Gentoo Biscoe 43.6 13.9 217 4900 female 2008
Gentoo Biscoe 45.5 13.9 210 4200 female 2008
Gentoo Biscoe 50.5 15.9 225 5400 male 2008
Gentoo Biscoe 44.9 13.3 213 5100 female 2008
Gentoo Biscoe 45.2 15.8 215 5300 male 2008
Gentoo Biscoe 46.6 14.2 210 4850 female 2008
Gentoo Biscoe 48.5 14.1 220 5300 male 2008
Gentoo Biscoe 45.1 14.4 210 4400 female 2008
Gentoo Biscoe 50.1 15.0 225 5000 male 2008
Gentoo Biscoe 46.5 14.4 217 4900 female 2008
Gentoo Biscoe 45.0 15.4 220 5050 male 2008
Gentoo Biscoe 43.8 13.9 208 4300 female 2008
Gentoo Biscoe 45.5 15.0 220 5000 male 2008
Gentoo Biscoe 43.2 14.5 208 4450 female 2008
Gentoo Biscoe 50.4 15.3 224 5550 male 2008
Gentoo Biscoe 45.3 13.8 208 4200 female 2008
Gentoo Biscoe 46.2 14.9 221 5300 male 2008
Gentoo Biscoe 45.7 13.9 214 4400 female 2008
Gentoo Biscoe 54.3 15.7 231 5650 male 2008
Gentoo Biscoe 45.8 14.2 219 4700 female 2008
Gentoo Biscoe 49.8 16.8 230 5700 male 2008
Gentoo Biscoe 46.2 14.4 214 4650 NA 2008
Gentoo Biscoe 49.5 16.2 229 5800 male 2008
Gentoo Biscoe 43.5 14.2 220 4700 female 2008
Gentoo Biscoe 50.7 15.0 223 5550 male 2008
Gentoo Biscoe 47.7 15.0 216 4750 female 2008
Gentoo Biscoe 46.4 15.6 221 5000 male 2008
Gentoo Biscoe 48.2 15.6 221 5100 male 2008
Gentoo Biscoe 46.5 14.8 217 5200 female 2008
Gentoo Biscoe 46.4 15.0 216 4700 female 2008
Gentoo Biscoe 48.6 16.0 230 5800 male 2008
Gentoo Biscoe 47.5 14.2 209 4600 female 2008
Gentoo Biscoe 51.1 16.3 220 6000 male 2008
Gentoo Biscoe 45.2 13.8 215 4750 female 2008
Gentoo Biscoe 45.2 16.4 223 5950 male 2008
Gentoo Biscoe 49.1 14.5 212 4625 female 2009
Gentoo Biscoe 52.5 15.6 221 5450 male 2009
Gentoo Biscoe 47.4 14.6 212 4725 female 2009
Gentoo Biscoe 50.0 15.9 224 5350 male 2009
Gentoo Biscoe 44.9 13.8 212 4750 female 2009
Gentoo Biscoe 50.8 17.3 228 5600 male 2009
Gentoo Biscoe 43.4 14.4 218 4600 female 2009
Gentoo Biscoe 51.3 14.2 218 5300 male 2009
Gentoo Biscoe 47.5 14.0 212 4875 female 2009
Gentoo Biscoe 52.1 17.0 230 5550 male 2009
Gentoo Biscoe 47.5 15.0 218 4950 female 2009
Gentoo Biscoe 52.2 17.1 228 5400 male 2009
Gentoo Biscoe 45.5 14.5 212 4750 female 2009
Gentoo Biscoe 49.5 16.1 224 5650 male 2009
Gentoo Biscoe 44.5 14.7 214 4850 female 2009
Gentoo Biscoe 50.8 15.7 226 5200 male 2009
Gentoo Biscoe 49.4 15.8 216 4925 male 2009
Gentoo Biscoe 46.9 14.6 222 4875 female 2009
Gentoo Biscoe 48.4 14.4 203 4625 female 2009
Gentoo Biscoe 51.1 16.5 225 5250 male 2009
Gentoo Biscoe 48.5 15.0 219 4850 female 2009
Gentoo Biscoe 55.9 17.0 228 5600 male 2009
Gentoo Biscoe 47.2 15.5 215 4975 female 2009
Gentoo Biscoe 49.1 15.0 228 5500 male 2009
Gentoo Biscoe 47.3 13.8 216 4725 NA 2009
Gentoo Biscoe 46.8 16.1 215 5500 male 2009
Gentoo Biscoe 41.7 14.7 210 4700 female 2009
Gentoo Biscoe 53.4 15.8 219 5500 male 2009
Gentoo Biscoe 43.3 14.0 208 4575 female 2009
Gentoo Biscoe 48.1 15.1 209 5500 male 2009
Gentoo Biscoe 50.5 15.2 216 5000 female 2009
Gentoo Biscoe 49.8 15.9 229 5950 male 2009
Gentoo Biscoe 43.5 15.2 213 4650 female 2009
Gentoo Biscoe 51.5 16.3 230 5500 male 2009
Gentoo Biscoe 46.2 14.1 217 4375 female 2009
Gentoo Biscoe 55.1 16.0 230 5850 male 2009
Gentoo Biscoe 44.5 15.7 217 4875 NA 2009
Gentoo Biscoe 48.8 16.2 222 6000 male 2009
Gentoo Biscoe 47.2 13.7 214 4925 female 2009
Gentoo Biscoe NA NA NA NA NA 2009
Gentoo Biscoe 46.8 14.3 215 4850 female 2009
Gentoo Biscoe 50.4 15.7 222 5750 male 2009
Gentoo Biscoe 45.2 14.8 212 5200 female 2009
Gentoo Biscoe 49.9 16.1 213 5400 male 2009
Chinstrap Dream 46.5 17.9 192 3500 female 2007
Chinstrap Dream 50.0 19.5 196 3900 male 2007
Chinstrap Dream 51.3 19.2 193 3650 male 2007
Chinstrap Dream 45.4 18.7 188 3525 female 2007
Chinstrap Dream 52.7 19.8 197 3725 male 2007
Chinstrap Dream 45.2 17.8 198 3950 female 2007
Chinstrap Dream 46.1 18.2 178 3250 female 2007
Chinstrap Dream 51.3 18.2 197 3750 male 2007
Chinstrap Dream 46.0 18.9 195 4150 female 2007
Chinstrap Dream 51.3 19.9 198 3700 male 2007
Chinstrap Dream 46.6 17.8 193 3800 female 2007
Chinstrap Dream 51.7 20.3 194 3775 male 2007
Chinstrap Dream 47.0 17.3 185 3700 female 2007
Chinstrap Dream 52.0 18.1 201 4050 male 2007
Chinstrap Dream 45.9 17.1 190 3575 female 2007
Chinstrap Dream 50.5 19.6 201 4050 male 2007
Chinstrap Dream 50.3 20.0 197 3300 male 2007
Chinstrap Dream 58.0 17.8 181 3700 female 2007
Chinstrap Dream 46.4 18.6 190 3450 female 2007
Chinstrap Dream 49.2 18.2 195 4400 male 2007
Chinstrap Dream 42.4 17.3 181 3600 female 2007
Chinstrap Dream 48.5 17.5 191 3400 male 2007
Chinstrap Dream 43.2 16.6 187 2900 female 2007
Chinstrap Dream 50.6 19.4 193 3800 male 2007
Chinstrap Dream 46.7 17.9 195 3300 female 2007
Chinstrap Dream 52.0 19.0 197 4150 male 2007
Chinstrap Dream 50.5 18.4 200 3400 female 2008
Chinstrap Dream 49.5 19.0 200 3800 male 2008
Chinstrap Dream 46.4 17.8 191 3700 female 2008
Chinstrap Dream 52.8 20.0 205 4550 male 2008
Chinstrap Dream 40.9 16.6 187 3200 female 2008
Chinstrap Dream 54.2 20.8 201 4300 male 2008
Chinstrap Dream 42.5 16.7 187 3350 female 2008
Chinstrap Dream 51.0 18.8 203 4100 male 2008
Chinstrap Dream 49.7 18.6 195 3600 male 2008
Chinstrap Dream 47.5 16.8 199 3900 female 2008
Chinstrap Dream 47.6 18.3 195 3850 female 2008
Chinstrap Dream 52.0 20.7 210 4800 male 2008
Chinstrap Dream 46.9 16.6 192 2700 female 2008
Chinstrap Dream 53.5 19.9 205 4500 male 2008
Chinstrap Dream 49.0 19.5 210 3950 male 2008
Chinstrap Dream 46.2 17.5 187 3650 female 2008
Chinstrap Dream 50.9 19.1 196 3550 male 2008
Chinstrap Dream 45.5 17.0 196 3500 female 2008
Chinstrap Dream 50.9 17.9 196 3675 female 2009
Chinstrap Dream 50.8 18.5 201 4450 male 2009
Chinstrap Dream 50.1 17.9 190 3400 female 2009
Chinstrap Dream 49.0 19.6 212 4300 male 2009
Chinstrap Dream 51.5 18.7 187 3250 male 2009
Chinstrap Dream 49.8 17.3 198 3675 female 2009
Chinstrap Dream 48.1 16.4 199 3325 female 2009
Chinstrap Dream 51.4 19.0 201 3950 male 2009
Chinstrap Dream 45.7 17.3 193 3600 female 2009
Chinstrap Dream 50.7 19.7 203 4050 male 2009
Chinstrap Dream 42.5 17.3 187 3350 female 2009
Chinstrap Dream 52.2 18.8 197 3450 male 2009
Chinstrap Dream 45.2 16.6 191 3250 female 2009
Chinstrap Dream 49.3 19.9 203 4050 male 2009
Chinstrap Dream 50.2 18.8 202 3800 male 2009
Chinstrap Dream 45.6 19.4 194 3525 female 2009
Chinstrap Dream 51.9 19.5 206 3950 male 2009
Chinstrap Dream 46.8 16.5 189 3650 female 2009
Chinstrap Dream 45.7 17.0 195 3650 female 2009
Chinstrap Dream 55.8 19.8 207 4000 male 2009
Chinstrap Dream 43.5 18.1 202 3400 female 2009
Chinstrap Dream 49.6 18.2 193 3775 male 2009
Chinstrap Dream 50.8 19.0 210 4100 male 2009
Chinstrap Dream 50.2 18.7 198 3775 female 2009

Palmer Penguins in R

  • The data is available in the package palmerpenguins.
    • You can install it using install.packages('palmerpenguins')
  • Accessing the data:
    1. Load the library library(palmerpenguins).
    2. The data will be stored in the variables penguins.

Flipper Length vs Body Mass

  • Suppose we want to use Penguins’ Flipper Length to explain their Body Mass;

Flipper Length vs Body Mass

  • This is a stochastic relation;

    • Many penguins have a flipper length of \(190mm\), but their weights range from \(3050g\) to \(4600g\).
    • Other aspects affect body mass besides flipper length.
  • The relation is “almost” linear;

  • A straight-line model could be a reasonable approximation for this relation.

    • This model is called Simple Linear Regression.

Simple Linear Regression

Specification

  • We have two variables:
    • Response variable (\(Y\)): what we are trying to predict/explain;
    • Explanatory Variable (\(X\)): the variable used to predict the response;
  • We have \(n\) observations/samples.
    • \(Y_i\) and \(X_i\) denote the values of \(X\) and \(Y\) for observation \(i\).
    • You can think of it as \(n\) pairs: \((X_1, Y_1), (X_2, Y_2), ..., (X_n, Y_n)\).
  • For example, for the penguins, \(i\) refers to each penguin and can vary from 1 to 344.
i flipper_length_mm (X) body_mass_g (Y)
1 181 3750
2 186 3800
3 195 3250
4 NA NA
5 193 3450
6 190 3650
7 181 3625
8 195 4675
9 193 3475
10 190 4250
11 186 3300
12 180 3700
13 182 3200
14 191 3800
15 198 4400
16 185 3700
17 195 3450
18 197 4500
19 184 3325
20 194 4200
21 174 3400
22 180 3600
23 189 3800
24 185 3950
25 180 3800
26 187 3800
27 183 3550
28 187 3200
29 172 3150
30 180 3950
31 178 3250
32 178 3900
33 188 3300
34 184 3900
35 195 3325
36 196 4150
37 190 3950
38 180 3550
39 181 3300
40 184 4650
41 182 3150
42 195 3900
43 186 3100
44 196 4400
45 185 3000
46 190 4600
47 182 3425
48 179 2975
49 190 3450
50 191 4150
51 186 3500
52 188 4300
53 190 3450
54 200 4050
55 187 2900
56 191 3700
57 186 3550
58 193 3800
59 181 2850
60 194 3750
61 185 3150
62 195 4400
63 185 3600
64 192 4050
65 184 2850
66 192 3950
67 195 3350
68 188 4100
69 190 3050
70 198 4450
71 190 3600
72 190 3900
73 196 3550
74 197 4150
75 190 3700
76 195 4250
77 191 3700
78 184 3900
79 187 3550
80 195 4000
81 189 3200
82 196 4700
83 187 3800
84 193 4200
85 191 3350
86 194 3550
87 190 3800
88 189 3500
89 189 3950
90 190 3600
91 202 3550
92 205 4300
93 185 3400
94 186 4450
95 187 3300
96 208 4300
97 190 3700
98 196 4350
99 178 2900
100 192 4100
101 192 3725
102 203 4725
103 183 3075
104 190 4250
105 193 2925
106 184 3550
107 199 3750
108 190 3900
109 181 3175
110 197 4775
111 198 3825
112 191 4600
113 193 3200
114 197 4275
115 191 3900
116 196 4075
117 188 2900
118 199 3775
119 189 3350
120 189 3325
121 187 3150
122 198 3500
123 176 3450
124 202 3875
125 186 3050
126 199 4000
127 191 3275
128 195 4300
129 191 3050
130 210 4000
131 190 3325
132 197 3500
133 193 3500
134 199 4475
135 187 3425
136 190 3900
137 191 3175
138 200 3975
139 185 3400
140 193 4250
141 193 3400
142 187 3475
143 188 3050
144 190 3725
145 192 3000
146 185 3650
147 190 4250
148 184 3475
149 195 3450
150 193 3750
151 187 3700
152 201 4000
153 211 4500
154 230 5700
155 210 4450
156 218 5700
157 215 5400
158 210 4550
159 211 4800
160 219 5200
161 209 4400
162 215 5150
163 214 4650
164 216 5550
165 214 4650
166 213 5850
167 210 4200
168 217 5850
169 210 4150
170 221 6300
171 209 4800
172 222 5350
173 218 5700
174 215 5000
175 213 4400
176 215 5050
177 215 5000
178 215 5100
179 216 4100
180 215 5650
181 210 4600
182 220 5550
183 222 5250
184 209 4700
185 207 5050
186 230 6050
187 220 5150
188 220 5400
189 213 4950
190 219 5250
191 208 4350
192 208 5350
193 208 3950
194 225 5700
195 210 4300
196 216 4750
197 222 5550
198 217 4900
199 210 4200
200 225 5400
201 213 5100
202 215 5300
203 210 4850
204 220 5300
205 210 4400
206 225 5000
207 217 4900
208 220 5050
209 208 4300
210 220 5000
211 208 4450
212 224 5550
213 208 4200
214 221 5300
215 214 4400
216 231 5650
217 219 4700
218 230 5700
219 214 4650
220 229 5800
221 220 4700
222 223 5550
223 216 4750
224 221 5000
225 221 5100
226 217 5200
227 216 4700
228 230 5800
229 209 4600
230 220 6000
231 215 4750
232 223 5950
233 212 4625
234 221 5450
235 212 4725
236 224 5350
237 212 4750
238 228 5600
239 218 4600
240 218 5300
241 212 4875
242 230 5550
243 218 4950
244 228 5400
245 212 4750
246 224 5650
247 214 4850
248 226 5200
249 216 4925
250 222 4875
251 203 4625
252 225 5250
253 219 4850
254 228 5600
255 215 4975
256 228 5500
257 216 4725
258 215 5500
259 210 4700
260 219 5500
261 208 4575
262 209 5500
263 216 5000
264 229 5950
265 213 4650
266 230 5500
267 217 4375
268 230 5850
269 217 4875
270 222 6000
271 214 4925
272 NA NA
273 215 4850
274 222 5750
275 212 5200
276 213 5400
277 192 3500
278 196 3900
279 193 3650
280 188 3525
281 197 3725
282 198 3950
283 178 3250
284 197 3750
285 195 4150
286 198 3700
287 193 3800
288 194 3775
289 185 3700
290 201 4050
291 190 3575
292 201 4050
293 197 3300
294 181 3700
295 190 3450
296 195 4400
297 181 3600
298 191 3400
299 187 2900
300 193 3800
301 195 3300
302 197 4150
303 200 3400
304 200 3800
305 191 3700
306 205 4550
307 187 3200
308 201 4300
309 187 3350
310 203 4100
311 195 3600
312 199 3900
313 195 3850
314 210 4800
315 192 2700
316 205 4500
317 210 3950
318 187 3650
319 196 3550
320 196 3500
321 196 3675
322 201 4450
323 190 3400
324 212 4300
325 187 3250
326 198 3675
327 199 3325
328 201 3950
329 193 3600
330 203 4050
331 187 3350
332 197 3450
333 191 3250
334 203 4050
335 202 3800
336 194 3525
337 206 3950
338 189 3650
339 195 3650
340 207 4000
341 202 3400
342 193 3775
343 210 4100
344 198 3775
  • The explanatory variable \(X\) is assumed to be fixed for each individual/sample;

The model

  • The model relating \(X\) and \(Y\) is given by the equation of a line: \[ Y_i = \beta_0 + \beta_1 X_i+ \epsilon_i \]
  • Let’s discuss each of these components in more detail;

The model: components

  • The model relating \(X\) and \(Y\) is given by the equation of a line: \[ Y_i = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} X_i + \underbrace{\epsilon_i}_{\text{error term}} \]
  • Intercept: tells us the \(Y\) value when \(X=0\) (i.e., the value of \(Y\) when the line crosses the \(Y\)-axis).
  • Slope: tells us how much change in \(Y\) to expect for a unit increase in \(X\).
  • Error: captures the variability of the response not explained by the model.

The model: components

  • The model relating \(X\) and \(Y\) is given by: \[ Y_i = \underbrace{\beta_0}_{\text{intercept}} + \beta_1 X_i + \epsilon_i \]

The model: components

  • The model relating \(X\) and \(Y\) is given by: \[ Y_i = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} X_i + \epsilon_i \]

Fitting

You might need to refresh this page to show the plot

  • What values to use for \(\beta_0\) and \(\beta_1\)?

Fitting

You might need to refresh this page to show the plot

  • We want \(\beta_0\) and \(\beta_1\) that minimizes the Sum of Square Error;

The random errors

  • The error component, \(\varepsilon_i\), captures everything that our model does not.

  • We treat \(\varepsilon_i\) as a random variable.

    • It has a distribution:
      • We will assume it to be Normal;
    • It has a mean:
      • safely assumed to be 0;
    • It has a variance:
      • unknown and denoted by \(\sigma^2\);
  • But what does this mean?

The random errors and the response

You might need to refresh this page to show the plot

  • Imagine a linear model relating Height (m) and Weight (kg): \[ \text{Weight} = -166+140\times\text{Height}+\varepsilon \]
  • For a \(1.63m\) tall person, we have: \[ \text{Weight} = -166+140\times1.63+\varepsilon = 62.2 + \varepsilon \]
  • We would expect this person to weigh around \(62.2kg\);
    • But the weight is affected by other factors as well;
    • so we cannot say precisely the value of the weight;
  • We have a probability distribution of possible weights for a \(1.63m\) tall person:



Modelling the average

  • For a \(1.63m\) tall person, we have: \[ \text{Weight} = -166+140\times1.63+\varepsilon = 62.2 + \varepsilon \]
  • We would expect this person to weigh around \(62.2kg\);

    • Some people will weigh more, and some will weigh less.
  • In average, \(1.63m\) tall people weigh 62.2;

  • The line \(−166+140\times\text{Height}\) gives the mean \(\text{Weight}\) of people of a given \(\text{Height}\);

The model as conditional expectation

  • In general: \[ E[Y|X] = \beta_0 + \beta_1 X \]
    • This just means that the regression line is the conditional average of \(Y\) for a given value of \(X\).
  • Note the difference: \[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]
    • This equation is for a given point, which is off the line (note the presence of the error term).

The model as conditional expectation

\[ \overbrace{\color{red}{\underbrace{\beta_0+\beta_1 X_i}_{\text{Regression Line}:\\\quad E[Y|X_i]}} + \epsilon_i}^{\text{Point: } Y_i} \]

The model as conditional expectation

Interpretation \(\beta_0\) and \(\beta_1\)

  • Slope: an increase of 1 unit of \(X\) is associated with an expected increase of \(\beta_1\) units in \(Y\).   - It is associated with, not the cause of!

  • Intercept: The average value of \(Y\) when \(X = 0\) is \(\beta_0\).   - Usually, we don’t care as much about this parameter.

Important: Association is not causality.

In general, we cannot conclude that changes in \(X\) cause a change in \(Y\). The conclusion of causality requires more than a good model.

Example: Fitting a model

  • Let’s fit a linear model relating Flipper Length and Body Mass of penguins.

  • To fit a linear model in R, we use the lm function.

    • Let’s explore this function in a code demo!

Code demo - Part I!

Fitting is estimation

  • \(E[Y|X] = \beta_0 + \beta_1 X\) is the population’s conditional mean for a given value of \(X\).

  • But instead of estimating the mean for each value of \(X\) separately, which is not feasible, we are assuming a linear structure between the mean of the population and the value of \(X\).

  • Therefore, estimating the means for the value of \(X\) (in a certain range) reduces to estimate \(\beta_0\) and \(\beta_1\).

Population vs Sample Regression

  • Population Regression: \[ Y_i = \beta_0 + \beta_1 X + \varepsilon_i \]

  • Sample Regression (estimated from the sample): \[ \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X + e_i \]

  • Note that \(e_i\neq\varepsilon_i\) because \(\hat{\beta}_0\neq \beta_0\) and \(\hat{\beta}_1\neq \beta_1\)

The parameters \(\beta_0\) and \(\beta_1\)

  • Since \(\beta_0\) and \(\beta_1\) are parameters, we estimate them based on a sample;

  • \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are the estimators of \(\beta_0\) and \(\beta_1\).

  • As statistics, \(\hat{\beta_0}\) and \(\hat{\beta_1}\) depend on the sample, therefore we will need their sampling distribution.

Inference for \(\hat{\beta}_0\)

You do not need to memorize these formulae.

  • Estimator of \(\beta_0\): \[\widehat{\beta}_0 = \bar{Y}-\widehat{\beta}_1\bar{X}\]
  • Std. Error of \(\hat{\beta}_0\):
    \[ SE\left(\hat{\beta}_0\right) = \sqrt{\sigma^2\left[\frac{1}{n} + \frac{\bar{X}^2}{\sum_{i=1}^n (X_i-\bar{X})^2}\right]} \]

  • This assumes that all errors, \(\varepsilon_i\), are uncorrelated with equal the variance \(\sigma^2\).

  • Sampling distribution of \(\hat{\beta}_0\) \[\hat{\beta}_0\sim N\left(\beta_0, \left[SE\left(\hat{\beta}_0\right)\right]^2\right)\]

  • Since \(\sigma^2\) is unknown, we can estimate using: \[\hat{\sigma}^2 = S^2 = \frac{\sum_{i=1}^n e_i^2}{n-2}\]

  • Finally, we can use: \[ \frac{\hat{\beta}_0 - \beta_0}{\widehat{SE}\left(\hat{\beta}_0\right)}\sim t_{n-2} \]

Inference for \(\hat{\beta}_1\)

You do not need to memorize these formulae.

  • Estimator of \(\beta_1\): \[\widehat{\beta}_1 = \frac{\sum_{i=1}^n \left(X_i-\bar{X}\right)\left(Y_i-\bar{Y}\right)}{\sum_{i=1}^n \left(X_i-\bar{X}\right)^2}\]

  • Std. Error of \(\hat{\beta}_1\):
    \[ SE\left(\hat{\beta}_1\right) = \sqrt{\frac{\sigma^2}{\sum_{i=1}^n (X_i-\bar{X})^2}} \]

  • This assumes the variance for all the errors, \(\varepsilon_i\), is:

    • fixed and equal to \(\sigma^2\)
    • uncorrelated;
  • Sampling distribution of \(\hat{\beta}_1\) \[\hat{\beta}_1\sim N\left(\beta_1, \left[SE\left(\hat{\beta}_1\right)\right]^2\right)\]

  • Since \(\sigma^2\) is unknown, we can estimate using: \[\hat{\sigma}^2 = S^2 = \frac{\sum_{i=1}^n e_i^2}{n-2}\]

  • Finally, we can use: \[ \frac{\hat{\beta}_1 - \beta_1}{\widehat{SE}\left(\hat{\beta}_1\right)}\sim t_{n-2} \]

Confidence Intervals for \(\beta_0\) and \(\beta_1\)

You do not need to memorize these formulae.


\[CI(\beta_0, 1-\alpha) = \hat{\beta}_0 \pm t^*_{1-\alpha/2}\widehat{SE}\left(\hat{\beta}_0\right)\]


\[CI(\beta_1, 1-\alpha) = \hat{\beta}_1 \pm t^*_{1-\alpha/2}\widehat{SE}\left(\hat{\beta}_1\right)\]

  • Or we can use bootstrap!! Yes, like in STAT 201.

Example: CI - Penguins data

Code demo - Part 2!

Example: CI - Penguins data

# Fit the model
penguins_lm <- lm(body_mass_g ~ flipper_length_mm, data = penguins %>% drop_na())

# Extract the CI
confint(penguins_lm, level = 0.95)
2.5 % 97.5 %
(Intercept) -6482.47 -5261.71
flipper_length_mm 47.12 53.18
  • With 95% confidence, we expect an increase in the penguins’ weight between 47.12 and 53.18 grams for every 1 mm increase in the penguin’s flipper length.

Example: CI - Penguins data

  • Via Bootstrap (slope):
# Infer package framework
ci_slope <-
  penguins_clean %>% 
  specify(formula = body_mass_g ~ flipper_length_mm) %>%
  generate(reps = 15000, type = "bootstrap") %>% 
  calculate(stat = "slope") %>% 
  get_ci(type = "percentile", level = 0.95)
lower_ci upper_ci
47.27984 53.10233

Hypothesis Test

  • We can also test the hypothesis that \(Y\) and \(X\) are linearly related. This is equivalent to testing \[ H_0: \beta_1 = 0\quad vs \quad H_1: \beta_1\neq 0 \]

    • We can change the alternative hypothesis if \(\beta_1\) cannot be positive or negative.

Hypothesis Test: Null Model

  • Test Statistic and Null model: \[ T = \frac{\hat{\beta}_1-0}{\widehat{SE}\left(\hat{\beta}_0\right)} \sim t_{n-2} \]


  • Reject if p-value \(< \alpha\).

Example: Penguins data

Code demo - Part 3!

Example: Penguins data

broom::tidy(penguins_lm, conf.int = TRUE, conf.level = 0.9)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -5872.09 310.29 -18.92 0 -6383.90 -5360.29
flipper_length_mm 50.15 1.54 32.56 0 47.61 52.69
  • Therefore, we reject the null hypothesis that \(\beta_1 = 0\), in favour of \(H_1: \beta_1\neq 0\).

Example: range problem

Code demo - Part 4!

The range problem

  • The linear model assumes that the relationship between \(X\) and \(E[Y|X]\) is linear, which may or may not be true;

  • Sometimes, there’s a linear association only in part of the data range.

    • The linear model could still be useful when restricted to that specific range;
  • We need to exercise caution when using the model outside the range of the data, as the relationship between \(X\) and \(Y\) may differ significantly.

Regression vs Correlation analysis

  • Correlation analysis: we’re interested in the strength of linear association between two variables;
    • no distinction between the two variables (no response and no covariate);
    • both variables are assumed to be stochastic;
  • Linear Regression: we’re interested in estimating the conditional average of the response given the value of the covariate.   - covariate is assumed to be non-stochastic;   - one of the variables is treated as a response and the other as a covariate;

Disclaimer

Warning

All the models in this course are approximations of reality. None of these models will be true or correct, but hopefully they will still be useful.

  • There’s a famous quote from George E. P. Box:

“Essentially, all models are wrong, but some are useful.”

Categorical Covariate?

  • Note that we can also have a categorical covariate.

  • We can try to explain the body mass of Penguins based on the sex of penguins: male or female. \[ \text{body_mass_g}_i = \beta_0 + \beta_1\text{sex}_i + \varepsilon_i \]

  • But wait! How can we have categories in an equation?

Dummy variables (2 categories)

  • We can encode the categories into multiple variables.

  • For example, the variable sex could be defined as:

\[\begin{equation} \text{sex}_i = \begin{cases} 0 & \text{if penguin $i$ is female}\\ 1 & \text{if penguin $i$ is male} \end{cases} \end{equation}\]

  • A variable patient_status could be defined:

\[\begin{equation} \text{patient_status}_i = \begin{cases} 0 & \text{if patient $i$ is healthy}\\ 1 & \text{if patient $i$ is sick} \end{cases} \end{equation}\]

  • If we have a variable with two categories, we need only 1 dummy variable to represent it.

Back to the model

  • We can try to explain the body mass of Penguins based on the sex of penguins: male or female. \[ \text{body_mass_g}_i = \beta_0 + \beta_1\text{sex}_i + \varepsilon_i \]

Where is the line?

  • Note that there is no line in this case.

  • Sex cannot be 0.1 or 0.5. It can only be 0 or 1.

  • So, what is going on?

\[\begin{equation} \text{body_mass_g}_i = \begin{cases} \beta_0 + \beta_1 + \varepsilon_i& \text{if penguin $i$ is male}\\ \beta_0 + \varepsilon_i & \text{if penguin $i$ is female} \end{cases} \end{equation}\]

We are just comparing means

  • Remember that in regression, we model the mean given the value of a covariate.

  • So, in this case, we are modelling the mean of female penguins and the mean of male penguins;

  • Note that:

    • \(\beta_0\) is the average body_mass_g of female penguins.
    • \(\beta_1\) is the difference in means.
    • \(\beta_0+\beta_1\) is the average body_mass_g of male penguins.
  • To test if the means are equal, we need to test \(H_0: \beta_1 = 0\) vs \(H_1: \beta_1 \neq 0\).

Testing for equality

penguin_sex_lm <- lm(body_mass_g ~ sex, data = penguins_clean)
tidy(penguin_sex_lm)
term estimate std.error statistic p.value
(Intercept) 3862.27 56.83 67.96 0
sexmale 683.41 80.01 8.54 0
  • The term sexmale is \(\beta_1\) in our model.
    • The female sex is the reference level, and its average is given by the intercept: \(3862.27\).
    • The estimated difference of body_mass_g between male and female is \(683.41\).
    • The estimated average for male \(3862.27+683.41 = 4545.68\)

Equivalent to t-test

t_test_average <- t.test(body_mass_g ~ sex, 
                         var.equal = TRUE, data = penguins_clean)

tidy(t_test_average) %>%
  select(estimate, estimate1, estimate2, statistic, p.value)
estimate estimate1 estimate2 statistic p.value
-683.41 3862.27 4545.68 -8.54 0

R for us

  • R is pretty good dealing with categorical variables.

  • All we need to do is to use factors, e.g.,

penguins_clean %>%
  mutate(sex = as_factor(sex))
  • The lm will create the dummy variables and tell us the levels of the factor associated with the coefficient.