garbage_variable = { d3.select("#normal-density").selectAll("*").remove()var margin = {top:40,right:30,bottom:50,left:100}, width =530- margin.left- margin.right, height =400- margin.top- margin.bottom;// append the svg object to the body of the pagevar svg = d3.select("#normal-density").append("svg").attr("width", width + margin.left+ margin.right).attr("height", height + margin.top+ margin.bottom).append("g").attr("transform","translate("+ margin.left+","+ margin.top+")");const data_x =range(-10,10,0.01);const data_y_std =normal_pdf(range(-10,10,0.01),0,1);const data_y =normal_pdf(range(-10,10,0.01), mu, sigma);const data = data_x.map((value, index) => {return {'x': value,'y_std': data_y_std[index],'y': data_y[index]}; });// Now I can use this dataset:// Add X axis --> it is a date formatvar x = d3.scaleLinear().domain([d3.min(data,function(d) { return+d.x; }), d3.max(data,function(d) { return+d.x; })]).range([ 0, width ]); svg.append("g").attr("transform","translate(0,"+ height +")").call(d3.axisBottom(x));// Add Y axisvar y = d3.scaleLinear().domain([0, d3.max(data,function(d) { return+Math.max(d.y, d.y_std); })]).range([ height,0 ]); svg.append("g").call(d3.axisLeft(y));// Add the line svg.append("path").datum(data).attr("fill","none").attr("stroke","steelblue").attr("stroke-width",1.5).attr("d", d3.line().x(function(d) { returnx(d.x) }).y(function(d) { returny(d.y_std) }) ); svg.append("path").datum(data).attr("fill","none").attr("stroke","red").attr("stroke-width",1.5).attr("d", d3.line().x(function(d) { returnx(d.x) }).y(function(d) { returny(d.y) }) ); svg.selectAll('text') .style('font-size','14px'); svg.append("text").attr("class","y label").attr("text-anchor","middle").attr("y",x(-14)).attr("x",-y(0)/2).attr("dy",".75em").attr("transform","rotate(-90)").style('font-size','24px').text("Density"); svg.append("text").attr("class","x label").attr("text-anchor","middle").attr("x",x(0)).attr("y", height +45).style('font-size','24px').text("x"); svg.append("text").attr("x", width /2).attr("y",0).attr("text-anchor","middle").text("Normal Curve").attr("dy","-15px").style('font-size','32px').attr("class","plot-title");// create a list of keysvar keys = ['N(0, 1)','N('+3+' '+5+')'] svg.append("rect").attr("x",x(6)).attr("y",40).attr("width",20).attr("height",2).style("fill","steelblue") svg.append("text").attr("x",x(7.5)).attr("y",45).text("N(0, 1)").style("font-size","15px").attr("alignment-baseline","middle") svg.append("rect").attr("x",x(6)).attr("y",60).attr("width",20).attr("height",2).style("fill","red") svg.append("text").attr("x",x(7.5)).attr("y",65).text('N('+ mu +', '+ (sigma**2).toPrecision(2) +')').style("font-size","15px").attr("alignment-baseline","middle")};
Properties:
Bell-shaped and Unimodal;
Fully specified by two parameters, \(\mu\) and \(\sigma\):
\(\mu\) determines the location;
\(\sigma\) determines the spread;
Symmetric about the mean \(\mu\);
Areas under the Normal Model
The area under the Normal model tells us the probability that the corresponding variable is in a specified region.
We need to use computers to obtain the area under the normal model (there’s no analytical solution).
But, there’s a rule that can help us do a quick check of our calculations.
The 68-95-99.7% Rule
Scroll down
No matter what is the value of \(\mu\) and \(\sigma\) we have the following rule
Interval
% of data within the interval
within \(1\sigma\) of \(\mu\)
about \(68\%\)
within \(2\sigma\) of \(\mu\)
about \(95\%\)
within \(3\sigma\) of \(\mu\)
about \(99.7\%\)
This is an useful approximation for sanity check!
For actual solutions use R.
R’s pnorm and qnorm functions
Scroll down
Probability:
To obtain the area under the curve, we use the pnorm function.
For example, suppose we have a \(N( \mu = 10, \sigma^2 = 3)\) and want the area below 11.5:
We can use the following code
pnorm( 11.5, mean =10, sd =sqrt(3))
[1] 0.8067619
Quantile:
To obtain the quantile of a Normal, we use the qnorm function.
For example, suppose we have a \(N( \mu = 10, \sigma^2 = 3)\) and want the 0.69-quantile:
We can use the following code
qnorm( 0.69, mean =10, sd =sqrt(3))
[1] 10.85884
Standard Normal
The Normal distribution with \(\mu=0\) and \(\sigma^2=1\) is called the Standard Normal distribution, i.e., \(N(0, 1)\).
There are multiple ways to check for adequacy of the Normal model. A simple (and subjective) way is to check if the relative frequency histogram looks like a Normal curve.
Example 1: Housefly Wing Lengths
Sokal and Hunter (1955) studied the wing lengths of houseflies.
Example 2: Birthweight
In this case, we have a heavier left tail, which might compromise the Normal approximation.
The Central Limit Theorem (CLT)
Central Limit Theorem (CLT)
The Central Limit Theorem helps us to approximate the sampling distribution of certain statistics.
In loose words, the CLT states that no matter what the population is, the sampling distribution of certain statistics, such as the sample mean and the sample proportion, approximates the Normal distributions for large sample sizes.
CLT for the Sample Mean
Scroll down
For large samples sizes, the sampling distribution of the sample mean is approx.: \[\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\] regardless the population distribution.
Note the mean of the sampling distribution is the population mean \(\mu\);
The Std. Error is given by: \[SE(\bar{X})=\frac{\sigma}{\sqrt{n}}\] where \(\sigma\) is the std. dev. of the populationl
Warning
If the population distribution is Normal, then \(\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\) is an exact result for any sample size. We don’t need CLT in this case.
CLT for the sample proportion
Scroll down
For large samples sizes, we can approx. the sampling dist. of \(\hat{p}\) by \[{\hat{p}} \sim N\left(p, \frac{p(1-p)}{n}\right)\]
Note the mean of the sampling distribution is the population proportion \(p\);
The Std. Error is given by: \[SE(\hat{p})=\sqrt{\frac{p(1-p)}{n}}\]
Sample size effect on the sampling distribution (Normal)
Sample size effect on the sampling distribution (Not-Normal)
Assumptions & conditions
Sample is randomly drawn from the population
Sample values are independent
Generally, if your sample size is greater than 10% of the population size, there will be a violation of independence.
Sample size must be large enough.
For means:
no universal guideline for how big \(n\) should be
but, usually sample \(> 30\) are big enough to get a reasonable approximation (not guaranteed!)
For proportions:
check \(n\times p \ge 10\) and \(n\times(1-p) \ge 10\)
Standard Error (SE)
Standard error: standard deviation of point estimates.
Mean:\[SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}\] where \(\sigma\) is the standard deviation of population.
Proportion:\[SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}\] where \(p\) is the population proportion.
Reality: we don’t know the population values, instead we use sample estimates.
Mean:\[\widehat{SE}(\bar{X}) = \frac{s}{\sqrt{n}}\] where \(s\) is the sample standard deviation.
Proportion:\[\widehat{SE}(\hat{p}) = \sqrt{\frac{\hat{p} \times (1-\hat{p})}{n}}\] where \(\hat{p}\) is the sample proportion.
The law of large numbers (LLN)
The law of large numbers (LLN)
The law of large numbers states that as the sample size increases, the sample mean converges to the population mean.
In other words, with a sufficiently large number of observations, the sample mean will be close to the population mean (guaranteed!).
But again, what is large?
The law of large number
The law of large numbers is actually intuitive given what we have seen so far.
The law of large numbers
To Take Home
Take home: CLT
CLT only works for certain statistics (e.g., sample mean, sample proportion);
As sample size increases the sampling distribution for the sample mean and proportion becomes narrower, more symmetrical, and more bell shaped
Take home: Std. Errors
The standard errors:
\(SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}\)
\(SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}\)
These formulae do not depend on the CLT. They are valid for all sample sizes.
Today’s worksheet
Investigate the law of large numbers and the central limit theorem
See that the sampling distributions for the sample mean/proportion can be well approximated by the Normal distribution when the sample size is large, regardless of the distribution of the population
References
Sokal, Robert R., and Preston E. Hunter. 1955. “A Morphometric Analysis of Ddt-Resistant and Non-Resistant House Fly Strains1, 2.”Annals of the Entomological Society of America 48 (6): 499–507. https://doi.org/10.1093/aesa/48.6.499.