The Normal Distribution

STAT 200 - Chapter 5 Part II

Introduction

Surprisingly, many unrelated variables from different studies have an unimodal distribution that is (roughly) symmetric around the mean. For example:
- Birthweight;
- Housefly wing’s length;
- Pulse rate per minute of adults;

Introduction

We might be interested in questions like:
- What is the proportion of newborns with weight between 2.5kg and 5kg?
- What is the proportion of adults with a pulse rate above 100 beats/min?
- What is the birthweight such that 95% of the newborns are below that? (quantile)
- What is the rate such that 95% of adults’ pulse rate are above that? (quantile)

Normal Model

A specific probabilistic model, named Normal (or Gaussian) Distribution, frequently can model these variables quite well.

But why use models?
- We can use the model to answer questions (such as the ones in the previous slide), instead of the data itself;
- Models can help us to describe the relation between variables;

Normal Model

Scroll down

viewof mu = {

  let input = Inputs.range([-5, 5], {value: 0, step: 0.1, label: "Mean"});
  
  d3.select(input)
  .select('input[name="number"]')
  .style("width", "75px");

  d3.select(input)
   .select('input[name="range"]')
   .style("width", "230px");

  return input;
}

viewof sigma = {
  let input = Inputs.range([0.3, 3], {value: 1, step: 0.1, label: "Standard Deviation"})
 
 d3.select(input)
   .select('input[name="number"]')
   .style("width", "50px");
  
  d3.select(input)
   .select('input[name="range"]')
   .style("width", "150px");
  
  return input;
}

range = (lower, upper, step) => {
  var size = (upper-lower)/step;
  
  return Array(size).fill().map((value, index) => {
    return lower + step * index
  });
};


normal_pdf = (x, mu, sigma) => {
  return x.map(value => {
    return Math.exp(-0.5 * ( (value - mu) **2 ) / (sigma**2))/(Math.sqrt(2 * Math.PI * sigma ** 2 ))
  });
};

garbage_variable = {


  d3.select("#normal-density")
    .selectAll("*")
    .remove()

  var margin = {top: 40, right: 30, bottom: 50, left: 100},
      width = 530 - margin.left - margin.right,
      height = 400 - margin.top - margin.bottom;

  // append the svg object to the body of the page
  var svg = d3.select("#normal-density")
    .append("svg")
      .attr("width", width + margin.left + margin.right)
      .attr("height", height + margin.top + margin.bottom)
    .append("g")
      .attr("transform",
            "translate(" + margin.left + "," + margin.top + ")");
  
  const data_x =  range(-10, 10, 0.01);
  const data_y_std = normal_pdf(range(-10, 10, 0.01), 0, 1);
  const data_y = normal_pdf(range(-10, 10, 0.01), mu, sigma);
  const data = data_x.map((value, index) => {
    return {'x': value, 'y_std': data_y_std[index], 'y': data_y[index]};
    });

    // Now I can use this dataset:
    // Add X axis --> it is a date format
    var x = d3.scaleLinear()
      .domain([d3.min(data, function(d) { return +d.x; }), d3.max(data, function(d) { return +d.x; })])
      .range([ 0, width ]);
    svg.append("g")
      .attr("transform", "translate(0," + height + ")")
      .call(d3.axisBottom(x));

    // Add Y axis
    var y = d3.scaleLinear()
      .domain([0, d3.max(data, function(d) { return +Math.max(d.y, d.y_std); })])
      .range([ height, 0 ]);
    svg.append("g")
      .call(d3.axisLeft(y));

    // Add the line
    svg.append("path")
      .datum(data)
      .attr("fill", "none")
      .attr("stroke", "steelblue")
      .attr("stroke-width", 1.5)
      .attr("d", d3.line()
        .x(function(d) { return x(d.x) })
        .y(function(d) { return y(d.y_std) })
        );
    
    svg.append("path")
    .datum(data)
    .attr("fill", "none")
    .attr("stroke", "red")
    .attr("stroke-width", 1.5)
    .attr("d", d3.line()
      .x(function(d) { return x(d.x) })
      .y(function(d) { return y(d.y) })
      );

  svg.selectAll('text')    
     .style('font-size', '14px');

  svg.append("text")
      .attr("class", "y label")
      .attr("text-anchor", "middle")
      .attr("y", x(-14))
      .attr("x", -y(0)/2)
      .attr("dy", ".75em")
      .attr("transform", "rotate(-90)")
      .style('font-size', '24px')
      .text("Density");

  svg.append("text")
    .attr("class", "x label")
    .attr("text-anchor", "middle")
    .attr("x", x(0))
    .attr("y", height + 45)
    .style('font-size', '24px')
    .text("x");

   svg.append("text")
        .attr("x", width / 2)
        .attr("y", 0)
        .attr("text-anchor", "middle")
        .text("Normal Curve")
        .attr("dy", "-15px")
        .style('font-size', '32px')
        .attr("class", "plot-title");

   // create a list of keys
  var keys = ['N(0, 1)', 'N(' + 3 + ' '+ 5 +')']

  svg.append("rect")
    .attr("x", x(6))
    .attr("y", 40)
    .attr("width", 20)
    .attr("height", 2)
    .style("fill", "steelblue")
  svg.append("text")
    .attr("x", x(7.5))
    .attr("y", 45)
    .text("N(0, 1)")
    .style("font-size", "15px")
    .attr("alignment-baseline","middle")

  svg.append("rect")
    .attr("x", x(6))
    .attr("y", 60)
    .attr("width", 20)
    .attr("height", 2)
    .style("fill", "red")
  svg.append("text")
    .attr("x", x(7.5))
    .attr("y", 65)
    .text('N(' + mu + ', '+ (sigma**2).toPrecision(2) +')')
    .style("font-size", "15px")
    .attr("alignment-baseline","middle")


};

Properties:
- Bell-shaped and Unimodal;
- Fully specified by two parameters, \(\mu\) and \(\sigma\):
  - \(\mu\) determines the location;
  - \(\sigma\) determines the spread;
- Symmetric about the mean \(\mu\);

Areas under the Normal Model

The area under the Normal model tells us the probability that the corresponding variable is in a specified region.
We need to use computers to obtain the area under the normal model (there’s no analytical solution).
But, there’s a rule that can help us do a quick check of our calculations.

The 68-95-99.7% Rule

Scroll down

No matter what is the value of \(\mu\) and \(\sigma\) we have the following rule

Interval	% of data within the interval
within \(1\sigma\) of \(\mu\)	about \(68\%\)
within \(2\sigma\) of \(\mu\)	about \(95\%\)
within \(3\sigma\) of \(\mu\)	about \(99.7\%\)

This is an useful approximation for sanity check!
- For actual solutions use R (or a table if you don’t have access to R).

R’s `pnorm` and `qnorm` functions

Scroll down

Probability:

To obtain the area under the curve, we use the pnorm function.
For example, suppose we have a \(N( \mu = 10, \sigma^2 = 3)\) and want the area below 11.5:

We can use the following code

pnorm( 11.5, mean = 10,  sd = sqrt(3))

[1] 0.8067619

Quantile:

To obtain the quantile of a Normal, we use the qnorm function.
For example, suppose we have a \(N( \mu = 10, \sigma^2 = 3)\) and want the 0.69-quantile:

We can use the following code

qnorm( 0.69, mean = 10,  sd = sqrt(3))

[1] 10.85884

Standard Normal

The \(Z\)-score of a variable coming from \(N(\mu, \sigma^2)\) follows the Standard Normal distribution, i.e., \(N(0, 1)\).
There are multiple ways to check for adequacy of the Normal model. A simple (and subjective) way is to check if the relative frequency histogram looks like a Normal curve.

Example 1: Housefly Wing Lengths

Sokal and Hunter (1955) studied the wing lengths of houseflies.

Example 2: Birthweight

In this case, we have a heavier left tail, which might compromise the Normal approximation.

Exercise 1

Scores on a standard IQ test for the 20 to 34 age group follow approximately the Normal model with mean \(\mu=110\) and standard deviation \(\sigma=25\).

What percentage of people aged 20 to 34 have IQ scores below 160?
What percentage have scores between 90 and 120?
How high is the IQ such that only 0.15% of the group fall above?

Exercise 2

A machine used to regulate the amount of dye dispensed for mixing shades of paint can be so that it discharges an average of \(\mu\) milliliters of dye per can of paint. The amount of dye discharged is known to follow the Normal model with a standard deviation of 0.4 milliliter. If more than 6 milliliters of dye are discharged when making a certain shade of blue paint, the shade is unacceptable. Determine the setting for the mean \(\mu\) such that only 2% of the cans of paint will be unacceptable.

References

Image Attributions

Fly Image Attribution: See page for author, CC BY 4.0, via Wikimedia Commons.

Data Attributions

Houseflies data was obtained from Seattle Central.
Birthweight of American Babies data was made available by National Bureau of Economic Research.

Other references

Sokal, Robert R., and Preston E. Hunter. 1955. “A Morphometric Analysis of Ddt-Resistant and Non-Resistant House Fly Strains1, 2.” Annals of the Entomological Society of America 48 (6): 499–507. https://doi.org/10.1093/aesa/48.6.499.

The Normal Distribution

Introduction

Introduction

Normal Model

Normal Model

Areas under the Normal Model

The 68-95-99.7% Rule

R’s pnorm and qnorm functions

Standard Normal

Example 1: Housefly Wing Lengths

Example 2: Birthweight

Exercise 1

Exercise 2

References

Image Attributions

Data Attributions

Other references

R’s `pnorm` and `qnorm` functions