Module 5: Conf. Intervals Based on Distributional Assumptions

Learning objectives

Scroll down

  • Explain the role of the Central Limit Theorem in constructing confidence intervals.
  • Describe the \(t\)-distribution family and its relationship with the Normal distribution.
  • Write a computer script to calculate confidence intervals based on distributional assumptions.
  • Calculate z-scores.
  • Discuss the potential limitations of these methods.
  • Decide whether to use asymptotic theory or bootstrapping to compute estimator uncertainty.

Review: boostrap vs clt approximation of the sampling distribution

Review: boostrap vs clt approximation of the sampling distribution

Review: boostrap vs clt approximation of the sampling distribution

Review: boostrap vs clt approximation of the sampling distribution

Review: boostrap vs clt approximation of the sampling distribution

Review: boostrap vs clt approximation of the sampling distribution

Today’s goal

  • Our goal: calculate confidence intervals using distributional assumptions

Using the bootstrap to calculate a confidence interval

  • Technical: 95% of samples of size \(n\) will produce a 95% CI that contains the true population parameter value

  • Simpler: we are 95% confident that the true population parameter value lies in our interval

Assumption of normality and central limit theorem

Quantiles for the conf. interval

  • Before we took quantiles from the Bootstrap Distribution.

  • This time we will take quantiles from the Sampling Distribution estimated via CLT.

Quantiles for the conf. interval

  • 2.5\(^{th}\) percentile: \(P(X \leq x) = 0.025\)
    • qnorm(0.025, mu, sigma)
  • 97.5\(^{th}\) percentile: \(P(X \leq x) = 0.975\)
    • qnorm(0.975, mu, sigma)

Standard Normal - z scores

Standard normal distribution

Example: CI for Proportion via CLT

  • To estimate the proportion of UBC students who own cars, a random sample of 200 students was taken, and it was found that 74 of them owned cars.
  • CLT? Check conditions:
    • UBC has more than 2000 students (independence holds);
    • \(n\times p \geq 10\) and \(n(1-p)\geq 10\);

Example: CI for Proportion via CLT

Scroll down

  • The CLT says \(\widehat{p}\sim N\left(p, \frac{p(1-p)}{n}\right)\) (for large samples);
  • Since we don’t know \(p\), we approximate with \(\widehat{p}\): \[N\left(\widehat{p}, \frac{\widehat{p}(1-\widehat{p})}{n}\right)\equiv N\left(0.37, \frac{0.37(1-0.37)}{200}\right)\]
  • \(95\%\) CI boundaries:
    • ci_lower <- qnorm(0.025, phat, sqrt(phat*(1-phat)/n))
    • ci_upper <- qnorm(0.975, phat, sqrt(phat*(1-phat)/n))

The usual format: proportion

Scroll down

  • CI for proportion is given by:

\[ \text{CI}\left(p, 1-\alpha\right) = \widehat{p}\pm z^*_{(1+C)/2}\sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}} \]

  • \(1-\alpha\) represents the confidence level

    • e.g., for \(95\%\) CI we use \(\alpha = 0.05\);
  • \(z^*_{(1+C)/2}\): is the right quantile for the Normal distribution;

    • e.g., for \(95\%\) CI, \(z^*_{1-0.05/2}=z^*_{0.975}=\)qnorm(0.975)
  • \(\sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\) is the estimated std. error;

  • \(z^*_{(1+C)/2}\sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\) is called Margin of Error;

    • it tells us how wide the confidence interval is;

Example: Proportion

  • Let’s compute the \(95\%\) confidence interval for the proportion of students who own a car using the data stored in sample_students.

The usual format: General

Scroll down

  • Usually, we write the CI for a parameter \(\theta\) by:

\[ \text{CI}\left(\theta, 1-\alpha\right) = \widehat{\theta}\pm q^*_{(1+C)/2}\widehat{SE}(\widehat{\theta}) \]

  • \(\theta\) is a generic parameter (e.g., proportion, mean, difference in prop, difference in means);

  • \(q^*_{(1+C)/2}\): is the right quantile of the sampling distribution of \(\hat{\theta}\);

    • e.g., for the proportion this distribution was the Normal (usually denoted by \(Z\))
  • \(\widehat{SE}(\widehat{\theta})\) is the estimated std. error;

    • e.g., for the proportion \(\widehat{SE}(\widehat{p})=\sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\);

\(t\)-distribution

Scroll down

  • bell shaped, unimodal and symmetric about 0
  • thicker tails and is more spread out than the standard normal distribution
  • probabilities depend on the degrees of freedom, \(\text{df} = n-1\) (denoted as \(t_{\text{df}}\));
  • When \(n \uparrow \infty\), then \(t_{n} \rightarrow N (0, 1)\)

Code for quantiles e.g., \(P(X \le x) = 0.25\)

  • Normal
qnorm(0.25, mu, sigma)
  • \(t\)-distribution; df = degrees of freedom

qt(0.25, df)

image source: Visual guide to pnorm, dnorm, qnorm, and rnorm functions in R

CI: One mean

Scroll down

\[ \text{CI}\left(\mu, 1-\alpha\right) = \bar{X}\pm t^*_{n-1, (1+C)/2}\frac{S}{\sqrt{n}} \]

  • \(t^*_{n-1, (1+C)/2}\): is the right quantile of a t-distribution with \(n-1\) degrees of freedom;
    • You calculate this in R using qt((1+C)/2, n - 1).
  • \(\frac{S}{\sqrt{n}}\) is the estimated std. error (\(S\) is the sample std. dev.);

Example: Mean

  • Let’s calculate the \(99\%\) confidence interval for the mean body weight of Adelie penguins stored in the penguins_clean.

CI: difference in proportions

  • Let \(\color{red}{x_1, x_2, ..., x_{n_1}}\) be a random sample from a population with proportion \(\color{red}{p_1}\).
  • Let \(\color{blue}{y_1, y_2, ..., y_{n_2}}\) be a random sample from a population with proportion \(\color{blue}{p_2}\).
  • Assuming CLT conditions are satisfied, CI can be estimated:

\[\color{red}{\widehat{p}_1} - \color{blue}{\widehat{p}_2} \pm z^*_{(1+C)/2} \times \sqrt{ \color{red}{\frac{\widehat{p}_1(1-\widehat{p}_1)}{n_1}} + \color{blue}{\frac{\widehat{p}_2(1-\widehat{p}_2)}{n_2}}}\]

CI: Example difference in proportions

  • Is the proportion of Adelie penguins with more than 4000g the same as Chinstrap penguins?

  • Let’s calculate the 90% CI for the difference in proportions (Adelie - Chinstrap)!

CI: difference in means for independent groups

  • Let \(\color{red}{x_1, x_2, ..., x_{n_1}}\) be a random sample from a population with mean \(\color{red}{\mu_1}\) and standard deviation \(\color{red}{\sigma_1}\)
  • Let \(\color{blue}{y_1, y_2, ..., y_{n_2}}\) be a random sample from a population with mean \(\color{blue}{\mu_2}\) and standard deviation \(\color{blue}{\sigma_2}\)
  • Assuming CLT conditions are satisfied, CI can be estimated: \[\color{red}{\bar{X}} - \color{blue}{\bar{Y}} \pm t^*_{k, (1+C)/2} \sqrt{\color{red}{\frac{S_X^2}{n_1}} + \color{blue}{\frac{S_Y^2}{n_2}}}\] where the degrees of freedom \(k\) is given by: \[ k = \frac{ \left(\color{red}{\frac{S^2_X}{n_1}} + \color{blue}{\frac{S^2_Y}{n_2}}\right)^2 }{ \color{red}{\frac{S_X^4}{n_1^2(n_1-1)}} + \color{blue}{\frac{S_Y^4}{n_2^2(n_2-1)}} } \]

CI: Example difference in means

  • Is the body_mass_g of Adelie penguins the same as that of Chinstrap penguins?

  • Let’s calculate the 95% CI for the difference in means (Adelie - Chinstrap)!

    • note that our samples are independent!

Take home

Scroll down

Can calculate CIs for population parameters as:

  • statistic \(\pm\) margin of error
    • Margin of Error: extent of the interval on each side of the point estimate
  • statistic \(\pm\) critical value \(\times\) standard error
    • critical value: value such that the upper tail area under the distribution \(= \frac{1 - C}{2}\), where \(C\) is the confidence level
      • proportions: standard Normal
      • means: \(t\)-distribution
    • standard error is the standard deviation of point estimates:
      • \(\sigma_\widehat{p} = \sqrt{\frac{p(1-p)}{n}}\), where \(p\) is the population proportion
      • \(\sigma_\bar{X} = \frac{\sigma}{\sqrt{n}}\), where \(\sigma\) is the standard deviation of population
      • In reality: We don’t know the value of the population parameters. Thus,
      • \(\widehat{\sigma}_\widehat{p} = \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\), where \(\widehat{p}\) is the sample proportion
      • \(\widehat{\sigma}_\bar{X} = \frac{s}{\sqrt{n}}\), where \(s\) is the sample standard deviation

Assumptions & conditions

  • The sample is randomly drawn from the population
  • The sample values are independent. In general, if your sample size is greater than 10% of the population size, there will be a severe violation of independence
  • The sample size must be large enough
    • For proportions: Check \(n\times \widehat{p} \ge 10\) and \(n\times(1-\widehat{p}) \ge 10\)
    • For means: Usually, \(n > 30\) are enough to get a reasonable approximation (but not guaranteed)

Today’s worksheet

  • Explore the relationship between the Normal and \(t\)-distribution
  • Calculate confidence intervals using the Normal and \(t\)-distributions

Now it’s your turn!

  • navigate to Canvas, open worksheet_05

We are here to help!