Module 3: Mathematical approx. of the sampling distribution

Learning objectives

By the end of this module, you will be able to:

  • Describe the Law of Large Numbers

  • Describe a normal distribution

  • Explain the Central Limit Theorem and other general asymptotic results

  • List the properties of the sampling distribution

  • Decide whether to use asymptotic theory or bootstrapping to compute estimator uncertainty

Estimating the sampling distribution via Bootstrapping

Estimating the sampling distribution via Bootstrapping

Estimating the sampling distribution via Bootstrapping

Estimating the sampling distribution via Bootstrapping

Estimating the sampling distribution via Bootstrapping

Estimating the sampling distribution via Bootstrapping

Estimating the sampling distribution via CLT

Today’s goal

  • Our goal: alternative method to estimate the sampling distribution

  • Use mathematical concepts to estimate the sampling distribution without re-sampling (not with bootstrapping)

    • More computationally efficient
  • But we need to understand:

    • The normal distribution

    • Central limit theorem

    • The law of large numbers

The Normal Distribution

Introduction

  • Surprisingly, many unrelated variables from different studies have an unimodal distribution that is (roughly) symmetric around the mean. For example:
    • Birthweight;
    • Housefly wing’s length;
    • Pulse rate per minute of adults;

Introduction

  • We might be interested in questions like:
    • What is the proportion of newborns with weight between 2.5kg and 5kg?
    • What is the proportion of adults with a pulse rate above 100 beats/min?
    • What is the birthweight such that 95% of the newborns are below that? (quantile)
    • What is the rate such that 95% of adults’ pulse rate are above that? (quantile)

Normal Model

  • A specific probabilistic model, named Normal (or Gaussian) Distribution, frequently can model these variables quite well.
  • But why use models?
    • We can use the model to answer questions (such as the ones in the previous slide), instead of the data itself;
    • Models can help us to describe the relation between variables;

Normal Model

Scroll down

  • Properties:
    • Bell-shaped and Unimodal;

    • Fully specified by two parameters, \(\mu\) and \(\sigma\):

      • \(\mu\) determines the location;

      • \(\sigma\) determines the spread;

    • Symmetric about the mean \(\mu\);

Areas under the Normal Model

  • The area under the Normal model tells us the probability that the corresponding variable is in a specified region.

  • We need to use computers to obtain the area under the normal model (there’s no analytical solution).

  • But, there’s a rule that can help us do a quick check of our calculations.

The 68-95-99.7% Rule

Scroll down

No matter what is the value of \(\mu\) and \(\sigma\) we have the following rule

Interval % of data within the interval
within \(1\sigma\) of \(\mu\) about \(68\%\)
within \(2\sigma\) of \(\mu\) about \(95\%\)
within \(3\sigma\) of \(\mu\) about \(99.7\%\)


  • This is an useful approximation for sanity check!
    • For actual solutions use R.

R’s pnorm and qnorm functions

Scroll down

Probability:

  • To obtain the area under the curve, we use the pnorm function.

  • For example, suppose we have a \(N( \mu = 10, \sigma^2 = 3)\) and want the area below 11.5:

  • We can use the following code
pnorm( 11.5, mean = 10,  sd = sqrt(3))  
[1] 0.8067619

Quantile:

  • To obtain the quantile of a Normal, we use the qnorm function.

  • For example, suppose we have a \(N( \mu = 10, \sigma^2 = 3)\) and want the 0.69-quantile:

  • We can use the following code
qnorm( 0.69, mean = 10,  sd = sqrt(3))
[1] 10.85884

Standard Normal

  • The Normal distribution with \(\mu=0\) and \(\sigma^2=1\) is called the Standard Normal distribution, i.e., \(N(0, 1)\).

  • There are multiple ways to check for adequacy of the Normal model. A simple (and subjective) way is to check if the relative frequency histogram looks like a Normal curve.

Example 1: Housefly Wing Lengths

  • Sokal and Hunter (1955) studied the wing lengths of houseflies.

Example 2: Birthweight

In this case, we have a heavier left tail, which might compromise the Normal approximation.

The Central Limit Theorem (CLT)

Central Limit Theorem (CLT)

  • The Central Limit Theorem helps us to approximate the sampling distribution of certain statistics.

  • In loose words, the CLT states that no matter what the population is, the sampling distribution of certain statistics, such as the sample mean and the sample proportion, approximates the Normal distributions for large sample sizes.

CLT for the Sample Mean

Scroll down

  • For large samples sizes, the sampling distribution of the sample mean is approx.: \[\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\] regardless the population distribution.
  • Note the mean of the sampling distribution is the population mean \(\mu\);

  • The Std. Error is given by: \[SE(\bar{X})=\frac{\sigma}{\sqrt{n}}\] where \(\sigma\) is the std. dev. of the populationl

Warning

If the population distribution is Normal, then \(\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\) is an exact result for any sample size. We don’t need CLT in this case.

CLT for the sample proportion

Scroll down

  • For large samples sizes, we can approx. the sampling dist. of \(\hat{p}\) by \[{\hat{p}} \sim N\left(p, \frac{p(1-p)}{n}\right)\]
  • Note the mean of the sampling distribution is the population proportion \(p\);

  • The Std. Error is given by: \[SE(\hat{p})=\sqrt{\frac{p(1-p)}{n}}\]

Sample size effect on the sampling distribution (Normal)

Sample size effect on the sampling distribution (Not-Normal)

Assumptions & conditions

  • Sample is randomly drawn from the population
  • Sample values are independent
    • Generally, if your sample size is greater than 10% of the population size, there will be a violation of independence.
  • Sample size must be large enough.
    • For means:
      • no universal guideline for how big \(n\) should be
      • but, usually sample \(> 30\) are big enough to get a reasonable approximation (not guaranteed!)
    • For proportions:
      • check \(n\times p \ge 10\) and \(n\times(1-p) \ge 10\)

Standard Error (SE)

  • Standard error: standard deviation of point estimates.

  • Mean: \[SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}\] where \(\sigma\) is the standard deviation of population.

  • Proportion: \[SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}\] where \(p\) is the population proportion.

  • Reality: we don’t know the population values, instead we use sample estimates.

  • Mean: \[\widehat{SE}(\bar{X}) = \frac{s}{\sqrt{n}}\] where \(s\) is the sample standard deviation.

  • Proportion: \[\widehat{SE}(\hat{p}) = \sqrt{\frac{\hat{p} \times (1-\hat{p})}{n}}\] where \(\hat{p}\) is the sample proportion.

The law of large numbers (LLN)

The law of large numbers (LLN)

  • The law of large numbers states that as the sample size increases, the sample mean converges to the population mean.

  • In other words, with a sufficiently large number of observations, the sample mean will be close to the population mean (guaranteed!).

    • But again, what is large?

The law of large number

  • The law of large numbers is actually intuitive given what we have seen so far.

The law of large numbers

To Take Home

Take home: CLT

  • CLT only works for certain statistics (e.g., sample mean, sample proportion);

  • As sample size increases the sampling distribution for the sample mean and proportion becomes narrower, more symmetrical, and more bell shaped

Take home: Std. Errors

  • The standard errors:
    • \(SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}\)
    • \(SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}\)
  • These formulae do not depend on the CLT. They are valid for all sample sizes.

Today’s worksheet

  • Investigate the law of large numbers and the central limit theorem
  • See that the sampling distributions for the sample mean/proportion can be well approximated by the Normal distribution when the sample size is large, regardless of the distribution of the population

References

Sokal, Robert R., and Preston E. Hunter. 1955. A Morphometric Analysis of Ddt-Resistant and Non-Resistant House Fly Strains1, 2.” Annals of the Entomological Society of America 48 (6): 499–507. https://doi.org/10.1093/aesa/48.6.499.