Module 5: Conf. Intervals Based on Distributional Assumptions

Learning objectives

Scroll down

Explain the role of the Central Limit Theorem in constructing confidence intervals.
Describe the \(t\)-distribution family and its relationship with the Normal distribution.
Write a computer script to calculate confidence intervals based on distributional assumptions.
Calculate z-scores.
Discuss the potential limitations of these methods.
Decide whether to use asymptotic theory or bootstrapping to compute estimator uncertainty.

Review: boostrap vs clt approximation of the sampling distribution

Today’s goal

Our goal: calculate confidence intervals using distributional assumptions

Using the bootstrap to calculate a confidence interval

Technical: 95% of samples of size \(n\) will produce a 95% CI that contains the true population parameter value
Simpler: we are 95% confident that the true population parameter value lies in our interval

Assumption of normality and central limit theorem

Quantiles for the conf. interval

Before we took quantiles from the Bootstrap Distribution.
This time we will take quantiles from the Sampling Distribution estimated via CLT.

Quantiles for the conf. interval

2.5\(^{th}\) percentile: \(P(X \leq x) = 0.025\)
- qnorm(0.025, mu, sigma)
97.5\(^{th}\) percentile: \(P(X \leq x) = 0.975\)
- qnorm(0.975, mu, sigma)

Standard Normal - z scores

Standard normal distribution

Example: CI for Proportion via CLT

To estimate the proportion of UBC students who own cars, a random sample of 200 students was taken, and it was found that 74 of them owned cars.

CLT? Check conditions:
- UBC has more than 2000 students (independence holds);
- \(n\times p \geq 10\) and \(n(1-p)\geq 10\);
- ✓

Example: CI for Proportion via CLT

Scroll down

The CLT says \(\widehat{p}\sim N\left(p, \frac{p(1-p)}{n}\right)\) (for large samples);

Since we don’t know \(p\), we approximate with \(\widehat{p}\): \[N\left(\widehat{p}, \frac{\widehat{p}(1-\widehat{p})}{n}\right)\equiv N\left(0.37, \frac{0.37(1-0.37)}{200}\right)\]

\(95\%\) CI boundaries:
- ci_lower <- qnorm(0.025, phat, sqrt(phat*(1-phat)/n))
- ci_upper <- qnorm(0.975, phat, sqrt(phat*(1-phat)/n))

The usual format: proportion

Scroll down

CI for proportion is given by:

\[ \text{CI}\left(p, 1-\alpha\right) = \widehat{p}\pm z^*_{(1+C)/2}\sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}} \]

\(1-\alpha\) represents the confidence level
- e.g., for \(95\%\) CI we use \(\alpha = 0.05\);
\(z^*_{(1+C)/2}\): is the right quantile for the Normal distribution;
- e.g., for \(95\%\) CI, \(z^*_{1-0.05/2}=z^*_{0.975}=\)qnorm(0.975)
\(\sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\) is the estimated std. error;
\(z^*_{(1+C)/2}\sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\) is called Margin of Error;
- it tells us how wide the confidence interval is;

Example: Proportion

Let’s compute the \(95\%\) confidence interval for the proportion of students who own a car using the data stored in sample_students.

The usual format: General

Scroll down

Usually, we write the CI for a parameter \(\theta\) by:

\[ \text{CI}\left(\theta, 1-\alpha\right) = \widehat{\theta}\pm q^*_{(1+C)/2}\widehat{SE}(\widehat{\theta}) \]

\(\theta\) is a generic parameter (e.g., proportion, mean, difference in prop, difference in means);
\(q^*_{(1+C)/2}\): is the right quantile of the sampling distribution of \(\hat{\theta}\);
- e.g., for the proportion this distribution was the Normal (usually denoted by \(Z\))
\(\widehat{SE}(\widehat{\theta})\) is the estimated std. error;
- e.g., for the proportion \(\widehat{SE}(\widehat{p})=\sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\);

\(t\)-distribution

Scroll down

bell shaped, unimodal and symmetric about 0
thicker tails and is more spread out than the standard normal distribution
probabilities depend on the degrees of freedom, \(\text{df} = n-1\) (denoted as \(t_{\text{df}}\));
When \(n \uparrow \infty\), then \(t_{n} \rightarrow N (0, 1)\)

Code for quantiles e.g., \(P(X \le x) = 0.25\)

Normal

qnorm(0.25, mu, sigma)

\(t\)-distribution; df = degrees of freedom


qt(0.25, df)

image source: Visual guide to pnorm, dnorm, qnorm, and rnorm functions in R

CI: One mean

Scroll down

\[ \text{CI}\left(\mu, 1-\alpha\right) = \bar{X}\pm t^*_{n-1, (1+C)/2}\frac{S}{\sqrt{n}} \]

\(t^*_{n-1, (1+C)/2}\): is the right quantile of a t-distribution with \(n-1\) degrees of freedom;
- You calculate this in R using qt((1+C)/2, n - 1).
\(\frac{S}{\sqrt{n}}\) is the estimated std. error (\(S\) is the sample std. dev.);

Example: Mean

Let’s calculate the \(99\%\) confidence interval for the mean body weight of Adelie penguins stored in the penguins_clean.

CI: difference in proportions

Let \(\color{red}{x_1, x_2, ..., x_{n_1}}\) be a random sample from a population with proportion \(\color{red}{p_1}\).
Let \(\color{blue}{y_1, y_2, ..., y_{n_2}}\) be a random sample from a population with proportion \(\color{blue}{p_2}\).
Assuming CLT conditions are satisfied, CI can be estimated:

\[\color{red}{\widehat{p}_1} - \color{blue}{\widehat{p}_2} \pm z^*_{(1+C)/2} \times \sqrt{ \color{red}{\frac{\widehat{p}_1(1-\widehat{p}_1)}{n_1}} + \color{blue}{\frac{\widehat{p}_2(1-\widehat{p}_2)}{n_2}}}\]

CI: Example difference in proportions

Is the proportion of Adelie penguins with more than 4000g the same as Chinstrap penguins?
Let’s calculate the 90% CI for the difference in proportions (Adelie - Chinstrap)!

CI: difference in means for independent groups

Let \(\color{red}{x_1, x_2, ..., x_{n_1}}\) be a random sample from a population with mean \(\color{red}{\mu_1}\) and standard deviation \(\color{red}{\sigma_1}\)
Let \(\color{blue}{y_1, y_2, ..., y_{n_2}}\) be a random sample from a population with mean \(\color{blue}{\mu_2}\) and standard deviation \(\color{blue}{\sigma_2}\)
Assuming CLT conditions are satisfied, CI can be estimated: \[\color{red}{\bar{X}} - \color{blue}{\bar{Y}} \pm t^*_{k, (1+C)/2} \sqrt{\color{red}{\frac{S_X^2}{n_1}} + \color{blue}{\frac{S_Y^2}{n_2}}}\] where the degrees of freedom \(k\) is given by: \[ k = \frac{ \left(\color{red}{\frac{S^2_X}{n_1}} + \color{blue}{\frac{S^2_Y}{n_2}}\right)^2 }{ \color{red}{\frac{S_X^4}{n_1^2(n_1-1)}} + \color{blue}{\frac{S_Y^4}{n_2^2(n_2-1)}} } \]

CI: Example difference in means

Is the body_mass_g of Adelie penguins the same as that of Chinstrap penguins?
Let’s calculate the 95% CI for the difference in means (Adelie - Chinstrap)!
- note that our samples are independent!

Take home

Scroll down

Can calculate CIs for population parameters as:

statistic \(\pm\) margin of error
- Margin of Error: extent of the interval on each side of the point estimate
statistic \(\pm\) critical value \(\times\) standard error
- critical value: value such that the upper tail area under the distribution \(= \frac{1 - C}{2}\), where \(C\) is the confidence level
  - proportions: standard Normal
  - means: \(t\)-distribution
- standard error is the standard deviation of point estimates:
  - \(\sigma_\widehat{p} = \sqrt{\frac{p(1-p)}{n}}\), where \(p\) is the population proportion
  - \(\sigma_\bar{X} = \frac{\sigma}{\sqrt{n}}\), where \(\sigma\) is the standard deviation of population
  - In reality: We don’t know the value of the population parameters. Thus,
  - \(\widehat{\sigma}_\widehat{p} = \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\), where \(\widehat{p}\) is the sample proportion
  - \(\widehat{\sigma}_\bar{X} = \frac{s}{\sqrt{n}}\), where \(s\) is the sample standard deviation

Assumptions & conditions

The sample is randomly drawn from the population
The sample values are independent. In general, if your sample size is greater than 10% of the population size, there will be a severe violation of independence
The sample size must be large enough
- For proportions: Check \(n\times \widehat{p} \ge 10\) and \(n\times(1-\widehat{p}) \ge 10\)
- For means: Usually, \(n > 30\) are enough to get a reasonable approximation (but not guaranteed)

Today’s worksheet

Explore the relationship between the Normal and \(t\)-distribution
Calculate confidence intervals using the Normal and \(t\)-distributions

Now it’s your turn!

navigate to Canvas, open worksheet_05

We are here to help!