STAT 201 - Lecture 02

Populational Concepts

Population

  • Population: the group containing all elements you want to study.
    • The population is fixed;
    • You don’t have access to all elements of the population;

Population Distribution

  • The population distribution is obtained by measuring all the elements in the population.

  • The population distribution is unknown!

    • remember: we don’t have access to all elements in the population, so we can never get the population distribution.

Parameters

  • Parameters: quantities that summarize the population.
    • Parameters are fixed but unknown;
    • We want to estimate them because they give us useful information about the population;

Sample Concepts

Random Sample

  • Random Sample: a subset (part) of the population selected at random
    • A random sample is random – changes every time you draw a sample;
    • You do have access to all elements of the sample;

Sample Distribution

  • The sample distribution is obtained by measuring all the elements in the sample.

  • The sample distribution is known!

  • We hope that the sample distribution resembles the population distribution;

    • remember: we don’t know the population distribution, so we will never know.

Statistics

  • Statistics: quantities that summarize the sample.
    • Samples are random, so statistics are also random;
    • We use statistics to estimate unknown population parameters;

Example: BC’s Health System

  • Suppose we want to know the average income of all workers who work in BC’s hospitals.

Example: BC’s Health System

  • The first thing is to properly define our population;
    • part-time workers?
    • temporary workers?
    • casual workers?

Example: BC’s Health System

  • Second, the parameter(s) of interest.

  • What population quantities are you interested in?

    • population mean income (\(\mu\))?
    • population median income (\(Q_2\))?
    • population Std. Dev. (\(\sigma\))?
  • Finally, draw a random sample.

Random Sample

You might need to refresh this page to show the plot

Population (\(\mu = ?\))

Sample

Exercise

  • Use the previous slides to investigate the following questions:
    1. What happens to the statistics when a new sample is taken?
    2. What happens to the parameters when a new sample is taken?
    3. Contrast the Sample Distribution with the Population Distribution for small and large sample sizes. What do you notice?

A note on Sampling and statistical inference

Independence

  • The inferential methods we will be discussing make a “strong” assumption that our sample is independent.

  • Independent sample: the selection of one element does not influence the selection of another.

  • When taking a sample we can do it with or without replacement.

Sampling with Replacement

  • Sampling with replacement: this approach allows repeated elements in our sample.
    1. select one element from the population.
    2. put the element back in the population.
    3. do Steps 1 and 2 \(n\) times.

Sampling without Replacement

  • Sampling without replacement: this approach does not allow repeated elements in our sample.
    1. select one element from the population.
    2. remove the element from the population.
    3. do Steps 1 and 2 \(n\) times.

Example

  • Suppose we have a group of 4 people: Varada, Mike, John, and Hayley.
  • We want to take a sample of size 2.

Sampling without Replacement

Possible Samples:

Sampling with Replacement

Possible Samples:

With vs Without Replacement - Part I

  • If we select the same element twice, we select repeated information and learn nothing new.
    • Sampling without replacement is more informative, meaning that our parameter estimates will be more precise.
  • You might be asking yourself why we would ever want to use SRS with replacement.
    • Answer: independence!

With vs Without Replacement - Part II

  • Unfortunately, sampling without replacement does not yield independent sampling.
    • The first elements you pick will affect the chances of the elements you will pick later.

Example: Violation of Independence

  • Imagine you have a box with six balls, three reds and three blacks.

  • Say you will take a sample of size 3.

Example: Violation of independence

  • Imagine, you have a box with six balls, three reds and three blacks.

  • Say you will take a sample of size 3.

  • The chance of the third ball depends on the previous balls: not independent!!

Large populations and small samples

  • Luckily, if the population is very large compared to the sample size, the independence violation is minimal;

  • Imagine if the box had five thousand red balls and five thousand black balls.

  • Say you will take a sample of size 3.

  • It is still not independent, but it is “almost independent” (meaning the violation is very tiny).
    • In these cases, the assumption of independence is reasonable, and we are in the game.
  • Rule of thumb: the sample size at most 10% of the population size.

Pros, cons, and use

Sampling with replacement

  • Pros:
    • Independent sample
      • selection of an element doesn’t influence the selection of other elements
    • Variability even when the sample is the same size as the population
  • Cons:
    • Less informative (repeated information)
  • Use: bootstrap samples

Sampling without replacement

  • Pros:
    • More informative (less repeated information)
      • more precise parameter estimate
  • Cons:
    • Dependence
      • elements picked affect the chance of the elements you will pick later
      • less problematic when sample is small compare to population
    • No variability if the sample is the same size as the population
  • Use: sampling the population

Comments on Sampling distribution

Review: Parameter Estimation

Review: Parameter Estimation

Review: Parameter Estimation

Review: Parameter Estimation

Review: Parameter Estimation

Review: Parameter Estimation

Review: Sampling Distribution

Review: Sampling Distribution

Review: Sampling Distribution

Review: Sampling Distribution

Review: Sampling Distribution

Important

The sampling distribution shows us:

  1. What point estimates are possible (even more: their probabilities of occurring)

  2. Where the true parameter is (e.g. for means it lies at the mean of the sampling distribution)

Quantifying the uncertainty with the standard error

Quantifying the uncertainty with the standard error

Quantifying the uncertainty with the standard error

Quantifying the uncertainty with the standard error

Quantifying the uncertainty with the standard error

Quantifying the uncertainty with the standard error

Quantifying the uncertainty with the standard error

What is the standard error?

  • Standard error (SE) of a statistic: the standard deviation of its sampling distribution

  • Standard deviation (\(\sigma\) or \(s\)): the square root of the variance

    • measure of the amount of variation of the values of a variable about its mean

Sampling distribution - Part I

  • Sampling distribution is the distribution of a statistic across all possible samples;

  • Things that affect the sampling distribution:

    1. Population
    2. Sample Size
    3. Statistic
  • Once you have all three things set, the sampling distribution is fixed but unknown;

Sampling distribution - Part II

  • Technically, when you know the population, you could potentially obtain the exact sampling distribution;
    • Calculate the statistic across all possible samples (like we did for the aquarium example in Lecture 1)
    • But this is only manageable for very tiny problems.
  • For example, for a population of size \(200\) and samples of size \(20\), we need to consider 1613587787967350602876321792 possible samples.
    • this is still a very small population and sample!!

Sampling distribution - Part III

  • Since we cannot evaluate all possible samples, we take many samples from the population to approximate the sampling distribution;
    • this approximation (not the sampling distribution) depends on the samples we draw;

Attention

  • You NEVER know the population in practice!!!!
    • If you do, you don’t need statistics.
  • You NEVER take multiple samples – you take one sample as large as you can
    • Why?
  • So how do we estimate the sampling distribution?

Approximating the sampling distribution with bootstrapping

Estimating the sampling distribution with bootstrapping

Estimating the sampling distribution with bootstrapping

Estimating the sampling distribution with bootstrapping

Estimating the sampling distribution with bootstrapping

Estimating the sampling distribution with bootstrapping

Estimating the sampling distribution with bootstrapping

Estimating the sampling distribution with bootstrapping

Estimating the sampling distribution with bootstrapping

Important: Bootstrapping

  • Bootstrapping samples must be:
    • drawn with replacement;
    • of the same size as the original sample;
  • The boostrap distribution:
    • is an approximation of the sampling distribution (has similar spread and shape);
    • is centered around the sample statistic (not the parameter);
    • used to estimate the standard error of a statistic;

Take home

  • Concepts:
    • Bootstrap distribution estimate the sampling distribution
    • Sampling distribution centers around the population parameter
    • Bootstrap distribution centers around the sample mean
    • Even though the centers of bootstrap distribution and sampling distribution differ, the bootstrap standard error is a good estimate of the sampling standard error
  • Use:
    • Bootstrapping can be used with many sample statistics (means, proportions, median, percentile)
    • If the sample is not representative, the boostrap distribution will be biased
    • Does not work well when the original sample size is small

Today’s worksheet

  • Introduce bootstrapping
  • Compare the bootstrap distribution with the sample and sampling distributions