Exploratory Data Analysis:
Quantitative Data

STAT 200 - Lecture 3

Exploratory Data Analysis (EDA)

  • One of the most critical steps when studying a new problem and data set.

  • EDA helps us to:

    • better understand the data;
    • find new and frequently unexpected relationships between variables;
    • raise interesting questions about the study;

Disclaimer

  • You are in the driver seat of your EDA.
    • There’s no recipe;
    • Little thought yields small findings
  • In general, EDA findings are preliminary that require more investigation.

Quantitative Variables

Penguins Heights

Location

Measures of Centrality

  • The measure of centrality tell us where our data is located in the real line.

  • Two important measures of centrality are: mean and median.

  • You can think of the mean and median as “typical” values of your data set.

Mean (\(\bar{y}\))

  • The mean is calculated as \[ \bar{y} = \frac{1}{n}\sum_{i=1}^n y_i = \frac{y_1+...+y_n}{n} \]

Notation:

  • \(y_i\) represents the \(i\)th observation (or data point), and
  • \(\sum_{i=1}^ny_i\) means to sum \(y_1\), \(y_2\), up to \(y_n\), i.e., to sum all the observations.
  • \(n\) is commonly used to denote the number of data points.

Example 1: Mean (\(\bar{y}\))

Compute the average of the following 9 length measures of penguins’ flippers (in cm):
\[18.1,\ 18.6,\ 19.5,\ 19.3,\ 19.0,\ 18.1,\ 19.5,\ 19.3,\ 19.0\]

The average is

\[\begin{align} \bar{y} & = \frac{18.1+18.6+19.5+19.3+19.0+18.1+19.5+19.3+19.0}{9} \\ & = 18.93\text{ cm} \end{align}\]

Example 1: Mean (\(\bar{y}\))

Compute the average of the following 9 length measures of penguins’ flippers (in cm):
\[18.1,\ 18.6,\ 19.5,\ 19.3,\ 19.0,\ 18.1,\ 19.5,\ 19.3,\ 19.0\]
  • You can always use R to calculate the mean:
# Stores the values in a variable called "flippers_length"
flippers_length <- c(18.1, 18.6, 19.5, 19.3, 19.0, 18.1, 19.5, 19.3, 19.0)

# Calculates the mean of the values stored in "flippers_length"
mean(flippers_length)
[1] 18.93333

Median (\(Q_2\))

  • The median is the value that splits equally the data set into two parts:
    • at least half of the observations are lower than or equal to the median;
    • at least half of the observations are higher than or equal to the median;

Median (\(Q_2\))

  • To calculate the median (denoted by \(Q_2\)), we must first arrange the data in ascending order.

  • Then, if the number of observations is:

    • odd: the median (\(Q_2\)) is the middle observation (i.e., the observation in position (n+1)/2);
    • even: the median (\(Q_2\)) is the average of the two central observations (i.e., the observations in positions (n/2) and (n/2+1));

Example 2: Median (\(Q_2\))

Compute the median of the following 9 length measures of penguins’ flippers (in cm):
\[18.1,\ 18.6,\ 19.5,\ 19.3,\ 19.0,\ 18.1,\ 19.5,\ 19.3,\ 19.0\]
  • First, we need to arrange the observations in ascending order:

\[18.1,\ 18.1,\ 18.6,\ 19.0,\ 19.0,\ 19.3,\ 19.3,\ 19.5,\ 19.5\]

Example 2: Median (\(Q_2\))

  • First, we need to arrange the observations in ascending order:

\[18.1,\ 18.1,\ 18.6,\ 19.0,\ 19.0,\ 19.3,\ 19.3,\ 19.5,\ 19.5\]

  • Then, we select the central observation as the median.
    • Since we have n = 9, which is odd, we select observation in position (9+1)/2 = 5.

Example 2: Median (\(Q_2\))

  • First, we need to arrange the observations in ascending order:

\[18.1,\ 18.1,\ 18.6,\ 19.0,\ \color{red}{\underline{19.0}},\ 19.3,\ 19.3,\ 19.5,\ 19.5 \]

  • Then, we select the central observation as the median.
    • Since we have n = 9, which is odd, we select observation in position (9+1)/2 = 5.
    • \(Q_2=19.0\)

Example 3: Median (\(Q_2\))

Compute the median of the following 8 length measures of penguins’ flippers (in cm):
\[18.1,\ 18.6,\ 19.5,\ 19.3,\ 19.0,\ 18.1,\ 19.5,\ 19.3\]
  • First, we need to arrange the observations in ascending order:

\[18.1,\ 18.1,\ 18.6,\ 19.0,\ 19.3,\ 19.3,\ 19.5,\ 19.5\]

Example 3: Median (\(Q_2\))

  • First, we need to arrange the observations in ascending order:

\[18.1,\ 18.1,\ 18.6,\ 19.0,\ 19.3,\ 19.3,\ 19.5,\ 19.5\]

  • Then, we calculate the average of the two central observations.
    • Since we have n = 8, which is even, we select the 8/2 = 4th and 8/2 + 1 = 5th observations.

Example 3: Median (\(Q_2\))

  • First, we need to arrange the observations in ascending order:

\[18.1,\ 18.1,\ 18.6,\ \color{red}{\underline{19.0}},\ \color{red}{\underline{19.3}},\ 19.3,\ 19.5,\ 19.5 \]

  • Then, we calculate the average of the two central observations.
    • Since we have n = 8, which is even, we select the 8/2 = 4th and 8/2 + 1 = 5th observations.
    • \(Q_2 = (19.0 + 19.3) / 2 = 19.15\)

Example 3: Median (\(Q_2\))

  • First, we need to arrange the observations in ascending order:

\[18.1,\ 18.1,\ 18.6,\ \color{red}{\underline{19.0}},\ \color{red}{\underline{19.3}},\ 19.3,\ 19.5,\ 19.5 \]

  • You can always use R to calculate the median:
# Stores the values in a variable called "flippers_length"
flippers_length <- c(18.1, 18.6, 19.5, 19.3, 19.0, 18.1, 19.5, 19.3)

# Calculates the mean of the values stored in "flippers_length"
median(flippers_length)
[1] 19.15

Mean vs Median

  • Both, the mean and the median, are very useful centrality measurements.

  • But they have different behaviors and interpretations.

Mean vs Median - Interpretation

Suppose you want to know how much you will spend on monthly groceries in a given year.

  • Mean: the mean gives you an idea of how much you spend per month;
    • some months you will spend a bit more, some months a bit less;
    • but the mean also gives you an idea of the total amount you will spend in a year; just multiply it by 12.

Mean vs Median - Interpretation

Suppose you want to know how much you will spend with groceries per month in a given year.

  • Median: The median gives you an idea of how much you spend per month;
    • around 50% of the months you will pay more, and 50% of the months you will spend less than the median;
      • Note the extra precision – for the mean, we said “some months”
    • The median is not based on the total, so you can’t just multiply by 12 to know how much you will spend in a year.

Mean vs Median - Outlier

Let’s return to the 8 length measures of penguins’ flippers (in cm). But this time, we have an additional measurement from a baby penguin:
\[\color{green}{\underline{2.3}}, 18.1,\ 18.6,\ 19.5,\ 19.3,\ 19.0,\ 18.1,\ 19.5,\ 19.3\]
  • Mean: \(\bar{y} = \frac{2.3 + 18.1 + ... + 19.3}{9} = 17.0778\) (it was \(18.925\));

  • Median: \(Q_2 = 19.0\) (it was \(19.15\));

Mean vs Median - Outlier

  • In general, the mean will be much more affected by outliers than the median;

Quantiles

  • The median is a value such that:
    1. at least 50% of the observations are at or below the median;
    2. at least 50% of the observations are at or above the median;
  • But why 50%?
    • Can we use a different percentage?
  • Yes! We could use any percentage that we want!
    • These quantities are called quantiles.

Quartiles

  • The median is the \(0.5\)-quantile (0.5 is equivalent to 50%).

  • You could use any percentage, for example

    • \(0.124\)-quantile; or \(0.0007\)-quantile; or \(0.99\)-quantile;
  • There are 3 quantiles that are commonly of interest:

    • \(0.25\)-quantile; \(0.50\)-quantile; and \(0.75\)-quantiles;
    • They are called quartiles!

Quartiles

Calculating First Quartiles (\(Q_1\))

  • The first quartile is the median of the first half of the data set:
    1. Arrange the observations in ascending order;
    2. Drop the upper half of the ordered data:
      • \(n\) even: keep the first \(n/2\) observations;
      • \(n\) odd: keep the first \((n+1)/2\) observations;
    3. Calculate the median of the remaining observations.

Third Quartiles (\(Q_3\))

  • The third quartile is the median of the second half of the data set:
    1. Arrange the observations in ascending order;
    2. Drop the lower half of the ordered data:
      • \(n\) even: keep the last \(n/2\) observations;
      • \(n\) odd: keep the last \((n+1)/2\) observations;
    3. Calculate the median of the remaining observations.

First Quartile (\(Q_1\)): \(n\) is even

Compute the first quartile of the following 8 length measures of penguins’ flippers (in cm):
\[18.1,\ 18.6,\ 19.5,\ 19.3,\ 19.0,\ 18.1,\ 19.5,\ 19.3\]
  • First, we need to arrange the observations in ascending order:
    \[18.1,\ 18.1,\ 18.6,\ 19.0,\ 19.3,\ 19.3,\ 19.5,\ 19.5\]

First Quartile (\(Q_1\)): \(n\) is even

  • First, we need to arrange the observations in ascending order:
    \[18.1,\ 18.1,\ 18.6,\ 19.0,\ 19.3,\ 19.3,\ 19.5,\ 19.5\]
  • Since \(n=8\) is even, we keep the first \(n/2 = 4\) observations

First Quartile (\(Q_1\)): \(n\) is even

  • First, we need to arrange the observations in ascending order:
    \[18.1,\ 18.1,\ 18.6, 19.0\color{white}{,\ 19.3,\ 19.3,\ 19.5,\ 19.5} \]

  • Since \(n=8\) is even, we keep the first \(n/2 = 4\) observations

First Quartile (\(Q_1\)): \(n\) is even

  • First, we need to arrange the observations in ascending order:
    \[18.1,\ \color{red}{18.1},\ \color{red}{18.6}, 19.0\color{white}{,\ 19.3,\ 19.3,\ 19.5,\ 19.5} \]

  • Since \(n=8\) is even, we keep the first \(n/2 = 4\) observations

  • Calculate the median of the remaining values: \(Q_1 = \frac{18.1+18.6}{2} = 18.35\)

Third Quartile (\(Q_3\)): \(n\) is odd

Let’s bring back the baby penguin. Now, compute the first quartile of the following 9 length measures of penguins’ flippers (in cm):
\[\color{green}{\underline{2.3}}, 18.1,\ 18.6,\ 19.5,\ 19.3,\ 19.0,\ 18.1,\ 19.5,\ 19.3\]
  • First, we need to arrange the observations in ascending order:
    \[2.3, 18.1,\ 18.1,\ 18.6,\ 19.0,\ 19.3,\ 19.3,\ 19.5,\ 19.5\]

Third Quartile (\(Q_3\)): \(n\) is odd

  • First, we need to arrange the observations in ascending order:
    \[2.3, 18.1,\ 18.1,\ 18.6,\ 19.0,\ 19.3,\ 19.3,\ 19.5,\ 19.5\]
  • Since \(n=9\) is odd, we keep the last \((n+1)/2 = 5\) observations

Third Quartile (\(Q_3\)): \(n\) is odd

  • First, we need to arrange the observations in ascending order:
    \[\color{white}{2.3, 18.1,\ 18.1,\ 18.6,} 19.0,\ 19.3,\ 19.3,\ 19.5,\ 19.5 \]

  • Since \(n=9\) is odd, we keep the last \((n+1)/2 = 5\) observations

Third Quartile (\(Q_3\)): \(n\) is odd

  • First, we need to arrange the observations in ascending order:
    \[\color{white}{2.3, 18.1,\ 18.1,\ 18.6,} 19.0,\ 19.3,\ \color{red}{19.3},\ 19.5,\ 19.5 \]

  • Since \(n=9\) is odd, we keep the last \((n+1)/2 = 5\) observations

  • Calculate the median of the remaining values: \(Q_3 = 19.3\)

Quantiles using R

Using R:

# Stores the values in a variable called "flippers_length"
flippers_length <- c(18.1, 18.6, 19.5, 19.3, 19.0, 18.1, 19.5, 19.3)

# Calculates the quantiles of the values stored in "flippers_length"
quantile(flippers_length, 0.25) # First quartile
   25% 
18.475 
quantile(flippers_length, 0.50) # Second quartile
  50% 
19.15 
quantile(flippers_length, 0.75) # Third quartile
  75% 
19.35 

Warning

R uses a fancier way to obtain quantiles, which might differ slightly from what you get using this approach.

Exercise

The final exam for STAT 200 was scheduled at a different time of the day than the lecture. You want to learn how long it takes to get to UBC at the time of the day so you know when to leave home. You asked your usual bus driver. As a passionate statistician hobbyist, the bus driver asked what measure of centrality you want to know:

  1. mean commute time;
  2. median commute time;
  3. another commute time quantile; which one?
  4. I have no idea!

Explain your answer!

Scale

Variability measures

  • The measures of centrality are very helpful to tell us where the data is centred around.

  • However, they don’t tell us how much the data varies.

  • There are two very important variability measures: standard deviation and interquartile range;

Variance

  • Variance is the “arithmetic average” of the squared deviation from the mean:

\(\quad\quad\ \ S^2 = \frac{\sum_{i=1}^n(y_i-\bar{y})^2}{n-1} = \frac{(y_1 - \bar{y})^2 + (y_2 - \bar{y})^2 + \ldots + (y_n - \bar{y})^2}{n-1}\)

Caution

Note that we divide by \(n-1\) and not by \(n\).

Variance

Calculate the variance of penguins’ heights. The observed data is given below:

Variance - Step 1

Variance - Step 2

Variance - Step 3

Variance

You can also use R:


# Stores the values in a variable called "penguins_height"
penguins_height <- c(50, 100, 75, 88, 65)

# Calculates the variance of the values stored in "penguins_height"
var(penguins_height)
[1] 379.3

Standard Deviation

  • The problem with the variance is that it uses the square of the deviations;
    • This affects the unit of measurement, and our interpretation;
  • To fix that, we can take the square root of the variance: \[ S = \sqrt{S^2} \] \(S\) is called standard deviation;

Properties of Standard Deviation

  • Std. Deviation is always non-negative (\(\geq 0\)).

  • If you sum all observations by a constant \(c\), the std. deviation does not change.

  • If you multiply all observations by a constant \(c\), then the std. deviation is also multiplied by \(c\).

Interquartile range (IQR)

  • It is the range that encloses the middle 50% of the observations: \[ IQR = Q_3 - Q_1 \]

You can use the IQR function in R to compute the IQR:

# Stores the values in a variable called "penguins_height"
penguins_height <- c(50, 100, 75, 88, 65)

# Calculating the IQR using quantiles
quantile(penguins_height, 0.75) - quantile(penguins_height, 0.25)
75% 
 23 
# Or your can use the IQR function
IQR(penguins_height)
[1] 23

Visualization of quantitative variables

Histogram

  • Since we are dealing with quantitative variables, we don’t have categories to count and should not use a bar chart;

  • We use histograms to create bins and then count how many observations there are in each bin.

  • There are some specificities in histogram:

    • There should be no space between bins.

Histogram

Histogram of peguin heights

Boxplot

Boxplot of peguin heights

The shape of distributions

  • How many peaks a distribution has?

  • Is the distribution skewed or symmetric?

    • skewed to the right: long right-hand tail
    • skewed to the left: long left-hand tail

Symmetry and Skeweness

Modality

References & Attributions

Images Attributions: