Module 3: Mathematical approx. of the sampling distribution

Learning objectives

By the end of this module, you will be able to:

Describe the Law of Large Numbers
Describe a normal distribution
Explain the Central Limit Theorem and other general asymptotic results
List the properties of the sampling distribution
Decide whether to use asymptotic theory or bootstrapping to compute estimator uncertainty

Estimating the sampling distribution via Bootstrapping

Estimating the sampling distribution via CLT

Today’s goal

Our goal: alternative method to estimate the sampling distribution
Use mathematical concepts to estimate the sampling distribution without re-sampling (not with bootstrapping)
- More computationally efficient
But we need to understand:
- The normal distribution
- Central limit theorem
- The law of large numbers

The Normal Distribution

Introduction

Surprisingly, many unrelated variables from different studies have an unimodal distribution that is (roughly) symmetric around the mean. For example:
- Birthweight;
- Housefly wing’s length;
- Pulse rate per minute of adults;

Introduction

We might be interested in questions like:
- What is the proportion of newborns with weight between 2.5kg and 5kg?
- What is the proportion of adults with a pulse rate above 100 beats/min?
- What is the birthweight such that 95% of the newborns are below that? (quantile)
- What is the rate such that 95% of adults’ pulse rate are above that? (quantile)

Normal Model

A specific probabilistic model, named Normal (or Gaussian) Distribution, frequently can model these variables quite well.

But why use models?
- We can use the model to answer questions (such as the ones in the previous slide), instead of the data itself;
- Models can help us to describe the relation between variables;

Normal Model

Scroll down

viewof mu = {

  let input = Inputs.range([-5, 5], {value: 0, step: 0.1, label: "Mean"});
  
  d3.select(input)
  .select('input[name="number"]')
  .style("width", "75px");

  d3.select(input)
   .select('input[name="range"]')
   .style("width", "230px");

  return input;
}

viewof sigma = {
  let input = Inputs.range([0.3, 3], {value: 1, step: 0.1, label: "Standard Deviation"})
 
 d3.select(input)
   .select('input[name="number"]')
   .style("width", "50px");
  
  d3.select(input)
   .select('input[name="range"]')
   .style("width", "150px");
  
  return input;
}

range = (lower, upper, step) => {
  var size = (upper-lower)/step;
  
  return Array(size).fill().map((value, index) => {
    return lower + step * index
  });
};


normal_pdf = (x, mu, sigma) => {
  return x.map(value => {
    return Math.exp(-0.5 * ( (value - mu) **2 ) / (sigma**2))/(Math.sqrt(2 * Math.PI * sigma ** 2 ))
  });
};

garbage_variable = {


  d3.select("#normal-density")
    .selectAll("*")
    .remove()

  var margin = {top: 40, right: 30, bottom: 50, left: 100},
      width = 530 - margin.left - margin.right,
      height = 400 - margin.top - margin.bottom;

  // append the svg object to the body of the page
  var svg = d3.select("#normal-density")
    .append("svg")
      .attr("width", width + margin.left + margin.right)
      .attr("height", height + margin.top + margin.bottom)
    .append("g")
      .attr("transform",
            "translate(" + margin.left + "," + margin.top + ")");
  
  const data_x =  range(-10, 10, 0.01);
  const data_y_std = normal_pdf(range(-10, 10, 0.01), 0, 1);
  const data_y = normal_pdf(range(-10, 10, 0.01), mu, sigma);
  const data = data_x.map((value, index) => {
    return {'x': value, 'y_std': data_y_std[index], 'y': data_y[index]};
    });

    // Now I can use this dataset:
    // Add X axis --> it is a date format
    var x = d3.scaleLinear()
      .domain([d3.min(data, function(d) { return +d.x; }), d3.max(data, function(d) { return +d.x; })])
      .range([ 0, width ]);
    svg.append("g")
      .attr("transform", "translate(0," + height + ")")
      .call(d3.axisBottom(x));

    // Add Y axis
    var y = d3.scaleLinear()
      .domain([0, d3.max(data, function(d) { return +Math.max(d.y, d.y_std); })])
      .range([ height, 0 ]);
    svg.append("g")
      .call(d3.axisLeft(y));

    // Add the line
    svg.append("path")
      .datum(data)
      .attr("fill", "none")
      .attr("stroke", "steelblue")
      .attr("stroke-width", 1.5)
      .attr("d", d3.line()
        .x(function(d) { return x(d.x) })
        .y(function(d) { return y(d.y_std) })
        );
    
    svg.append("path")
    .datum(data)
    .attr("fill", "none")
    .attr("stroke", "red")
    .attr("stroke-width", 1.5)
    .attr("d", d3.line()
      .x(function(d) { return x(d.x) })
      .y(function(d) { return y(d.y) })
      );

  svg.selectAll('text')    
     .style('font-size', '14px');

  svg.append("text")
      .attr("class", "y label")
      .attr("text-anchor", "middle")
      .attr("y", x(-14))
      .attr("x", -y(0)/2)
      .attr("dy", ".75em")
      .attr("transform", "rotate(-90)")
      .style('font-size', '24px')
      .text("Density");

  svg.append("text")
    .attr("class", "x label")
    .attr("text-anchor", "middle")
    .attr("x", x(0))
    .attr("y", height + 45)
    .style('font-size', '24px')
    .text("x");

   svg.append("text")
        .attr("x", width / 2)
        .attr("y", 0)
        .attr("text-anchor", "middle")
        .text("Normal Curve")
        .attr("dy", "-15px")
        .style('font-size', '32px')
        .attr("class", "plot-title");

   // create a list of keys
  var keys = ['N(0, 1)', 'N(' + 3 + ' '+ 5 +')']

  svg.append("rect")
    .attr("x", x(6))
    .attr("y", 40)
    .attr("width", 20)
    .attr("height", 2)
    .style("fill", "steelblue")
  svg.append("text")
    .attr("x", x(7.5))
    .attr("y", 45)
    .text("N(0, 1)")
    .style("font-size", "15px")
    .attr("alignment-baseline","middle")

  svg.append("rect")
    .attr("x", x(6))
    .attr("y", 60)
    .attr("width", 20)
    .attr("height", 2)
    .style("fill", "red")
  svg.append("text")
    .attr("x", x(7.5))
    .attr("y", 65)
    .text('N(' + mu + ', '+ (sigma**2).toPrecision(2) +')')
    .style("font-size", "15px")
    .attr("alignment-baseline","middle")


};

Properties:
- Bell-shaped and Unimodal;
- Fully specified by two parameters, \(\mu\) and \(\sigma\):
  - \(\mu\) determines the location;
  - \(\sigma\) determines the spread;
- Symmetric about the mean \(\mu\);

Areas under the Normal Model

The area under the Normal model tells us the probability that the corresponding variable is in a specified region.
We need to use computers to obtain the area under the normal model (there’s no analytical solution).
But, there’s a rule that can help us do a quick check of our calculations.

The 68-95-99.7% Rule

Scroll down

No matter what is the value of \(\mu\) and \(\sigma\) we have the following rule

Interval	% of data within the interval
within \(1\sigma\) of \(\mu\)	about \(68\%\)
within \(2\sigma\) of \(\mu\)	about \(95\%\)
within \(3\sigma\) of \(\mu\)	about \(99.7\%\)

This is an useful approximation for sanity check!
- For actual solutions use R.

R’s `pnorm` and `qnorm` functions

Scroll down

Probability:

To obtain the area under the curve, we use the pnorm function.
For example, suppose we have a \(N( \mu = 10, \sigma^2 = 3)\) and want the area below 11.5:

We can use the following code

pnorm( 11.5, mean = 10,  sd = sqrt(3))

[1] 0.8067619

Quantile:

To obtain the quantile of a Normal, we use the qnorm function.
For example, suppose we have a \(N( \mu = 10, \sigma^2 = 3)\) and want the 0.69-quantile:

We can use the following code

qnorm( 0.69, mean = 10,  sd = sqrt(3))

[1] 10.85884

Standard Normal

The Normal distribution with \(\mu=0\) and \(\sigma^2=1\) is called the Standard Normal distribution, i.e., \(N(0, 1)\).
There are multiple ways to check for adequacy of the Normal model. A simple (and subjective) way is to check if the relative frequency histogram looks like a Normal curve.

Example 1: Housefly Wing Lengths

Sokal and Hunter (1955) studied the wing lengths of houseflies.

Example 2: Birthweight

In this case, we have a heavier left tail, which might compromise the Normal approximation.

The Central Limit Theorem (CLT)

Central Limit Theorem (CLT)

The Central Limit Theorem helps us to approximate the sampling distribution of certain statistics.
In loose words, the CLT states that no matter what the population is, the sampling distribution of certain statistics, such as the sample mean and the sample proportion, approximates the Normal distributions for large sample sizes.

CLT for the Sample Mean

Scroll down

For large samples sizes, the sampling distribution of the sample mean is approx.: \[\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\] regardless the population distribution.

Note the mean of the sampling distribution is the population mean \(\mu\);
The Std. Error is given by: \[SE(\bar{X})=\frac{\sigma}{\sqrt{n}}\] where \(\sigma\) is the std. dev. of the populationl

Warning

If the population distribution is Normal, then \(\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\) is an exact result for any sample size. We don’t need CLT in this case.

CLT for the sample proportion

Scroll down

For large samples sizes, we can approx. the sampling dist. of \(\hat{p}\) by \[{\hat{p}} \sim N\left(p, \frac{p(1-p)}{n}\right)\]

Note the mean of the sampling distribution is the population proportion \(p\);
The Std. Error is given by: \[SE(\hat{p})=\sqrt{\frac{p(1-p)}{n}}\]

Sample size effect on the sampling distribution (Normal)

Sample size effect on the sampling distribution (Not-Normal)

Assumptions & conditions

Sample is randomly drawn from the population
Sample values are independent
- Generally, if your sample size is greater than 10% of the population size, there will be a violation of independence.
Sample size must be large enough.
- For means:
  - no universal guideline for how big \(n\) should be
  - but, usually sample \(> 30\) are big enough to get a reasonable approximation (not guaranteed!)
- For proportions:
  - check \(n\times p \ge 10\) and \(n\times(1-p) \ge 10\)

Standard Error (SE)

Standard error: standard deviation of point estimates.
Mean: \[SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}\] where \(\sigma\) is the standard deviation of population.
Proportion: \[SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}\] where \(p\) is the population proportion.

Reality: we don’t know the population values, instead we use sample estimates.
Mean: \[\widehat{SE}(\bar{X}) = \frac{s}{\sqrt{n}}\] where \(s\) is the sample standard deviation.
Proportion: \[\widehat{SE}(\hat{p}) = \sqrt{\frac{\hat{p} \times (1-\hat{p})}{n}}\] where \(\hat{p}\) is the sample proportion.

The law of large numbers (LLN)

The law of large numbers states that as the sample size increases, the sample mean converges to the population mean.
In other words, with a sufficiently large number of observations, the sample mean will be close to the population mean (guaranteed!).
- But again, what is large?

The law of large number

The law of large numbers is actually intuitive given what we have seen so far.

The law of large numbers

To Take Home

Take home: CLT

CLT only works for certain statistics (e.g., sample mean, sample proportion);
As sample size increases the sampling distribution for the sample mean and proportion becomes narrower, more symmetrical, and more bell shaped

Take home: Std. Errors

The standard errors:
- \(SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}\)
- \(SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}\)
These formulae do not depend on the CLT. They are valid for all sample sizes.

Today’s worksheet

Investigate the law of large numbers and the central limit theorem
See that the sampling distributions for the sample mean/proportion can be well approximated by the Normal distribution when the sample size is large, regardless of the distribution of the population

References

Sokal, Robert R., and Preston E. Hunter. 1955. “A Morphometric Analysis of Ddt-Resistant and Non-Resistant House Fly Strains1, 2.” Annals of the Entomological Society of America 48 (6): 499–507. https://doi.org/10.1093/aesa/48.6.499.

Module 3: Mathematical approx. of the sampling distribution

Learning objectives

Estimating the sampling distribution via Bootstrapping

Estimating the sampling distribution via Bootstrapping

Estimating the sampling distribution via Bootstrapping

Estimating the sampling distribution via Bootstrapping

Estimating the sampling distribution via Bootstrapping

Estimating the sampling distribution via Bootstrapping

Estimating the sampling distribution via CLT

Today’s goal

The Normal Distribution

Introduction

Introduction

Normal Model

Normal Model

Areas under the Normal Model

The 68-95-99.7% Rule

R’s pnorm and qnorm functions

Standard Normal

Example 1: Housefly Wing Lengths

Example 2: Birthweight

The Central Limit Theorem (CLT)

Central Limit Theorem (CLT)

CLT for the Sample Mean

CLT for the sample proportion

Sample size effect on the sampling distribution (Normal)

Sample size effect on the sampling distribution (Not-Normal)

Assumptions & conditions

Standard Error (SE)

The law of large numbers (LLN)

The law of large numbers (LLN)

The law of large number

The law of large numbers

To Take Home

Take home: CLT

Take home: Std. Errors

Today’s worksheet

References

R’s `pnorm` and `qnorm` functions