Module 7: Hypothesis testing based on distributional assumptions

Learning objectives

Use results from the assumption of normality or the Central Limit Theorem to perform hypothesis testing.
Compare and contrast the parts of estimation and hypothesis testing that differ between simulation- and resampling-based approaches with the assumption of normality or the Central Limit Theorem-based approaches.
Write a computer script to perform hypothesis testing based on results from the assumption of normality or the Central Limit Theorem.
Discuss the potential limitations of these methods.

Review: inferential goals

Today’s goal

Our goal: use assumption of normality or the Central Limit Theorem to perform hypothesis testing

General procedures for null hypothesis testing

Define null and alternative hypotheses
Set significance level
Choose a test statistic
Create the distribution of test statistic under null
Calculate observed test statistic and associated \(p\)-value
Draw a conclusion

General procedures for null hypothesis testing

Define null and alternative hypotheses
Set significance level
Choose a test statistic
Create the distribution of test statistic under null
Calculate observed test statistic and associated \(p\)-value
Draw a conclusion

Assumption of normality and central limit theorem

Normal population

Any population

Population (binary)

Testing one proportion

Example: Coin Flip

Your friend claims that they have psychic abilities and can predict the outcome of coin flips before they happen. You decide to test them by flipping a coin 100 times, and count the number of times they guessed right.

Step 1: Define null and alternative hypotheses

\(H_0: p = 0.5\) vs. \(H_A: p > 0.5\)

Step 2: Specify the significance level

Let’s set the significance level at 10%

Step 3: Choose a test statistic

Since we are testing proportion, we can use \(\hat{p}\).

Step 4: Find the null model

If we assume \(H_0\) to be true, we have that \(p = p_0\). In that case, based on the CLT:

\[\hat{p} \sim N \left(p_0, \frac{p_0 (1-p_0)}{n} \right)\]

Check:
- \(n\times p_0 \ge 10\) and \(n\times(1 − p_0) \ge 10\)
- necessary conditions (random, independent)

In this example, \(H_0: p = 0.5\), therefore \(p_0 = 0.5\), consequently under \(H_0\) we have:

\[\hat{p} \sim N \left(0.5, \frac{0.5\times0.5}{100} \right)\]

Step 5: Compute the p-value

Your friend correctly predicts 55 out of 100 flips. How unusual is \(\hat{p} = 0.55\) under our null model?

(p_value <- pnorm(0.55, 0.5, sqrt((0.5*0.5)/100), lower.tail = FALSE))

[1] 0.1586553

Step 6: Make Decision

Since the \(p\)-value > 0.10, we do not reject \(H_0\) and conclude that:

There is not enough evidence, at \(10\%\) significance level, to suggest that your friend’s guesses were better than random guessing

Review: p-value

\(p\)-value:
- summarizes the evidence
- describes how unusual the data would be if \(H_0\) were true
- defined as the probability of observing a result as extreme or more extreme towards the alternative hypothesis than what we observed given that \(H_0\) is true

CLT and Standardization

When relying on the CLT, we usually use a standardized version of the test statistic;
For example, instead of using \(\hat{p} \sim N\left(p_0, \frac{p_0(1-p_0)}{n}\right)\), we use the z-score: \[Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}\sim N(0,1)\]
The advantage of doing this is that we can use the same process for the proportion, mean, difference in proportion, and difference in means.

Example - Coin Flip - Revisited

Step 1: Define null and alternative hypotheses

\(H_0: p = 0.5\) vs. \(H_A: p > 0.5\)

Step 2: Specify the significance level

Let’s set the significance level at 10%

Step 3: Choose a test statistic

Since we are testing proportion, we will use

\[Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}\]

Step 4: Find the null model

If we assume \(H_0\) to be true, we have that \(p = p_0\). In that case, based on the CLT:

\[Z \sim N \left(0, 1\right)\]

Check:
- \(n\times p_0 \ge 10\) and \(n\times(1 − p_0) \ge 10\)
- necessary conditions (random, independent)

Step 5: Compute the p-value

Your friend correctly predicts 55 out of 100 flips. How unusual is \(\hat{p} = 0.55\) under our null model? \[Z = \frac{0.55 - 0.5}{0.05} = 1\]

(p_value <- pnorm(1, lower.tail = FALSE))

[1] 0.1586553

Note we got the exactly same p-value;
- the tests are equivalent.

Step 6: Make Decision

Since the \(p\)-value > 0.10, we do not reject \(H_0\) and conclude that:

There is not enough evidence, at \(10\%\) significance level, to suggest that your friend’s guesses were better than random guessing

Testing one mean

Example: Coffee Shop

Alice, the coffee shop owner, wants to see if her new marketing campaign increased her average daily latte sales, which were 50 before the campaign. After 25 days, the average sales increased to 55 lattes per day, with a sample standard deviation of 8.

Step 1: Define null and alternative hypotheses

\(H_0: \mu = 50\) vs. \(H_A: \mu > 50\)

Step 2: Specify the significance level

Let’s set the significance level at \(5\%\).

Step 3: Choose a test statistic

We will use the \(t\)-statistic: \[T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}}\]

Step 4: Find the null model

If we assume \(H_0\) to be true, we have that \(\mu = \mu_0\). In that case, based on the CLT:

\[T \sim t_{n-1}\]

Assumptions and conditions:

Sample is randomly drawn from the population.
Sample values are independent.
- If your sample size is greater than 10% of the population size, there will be a severe violation of independence.
Normality:
- When the underlying distribution of \(x\) is non-Normal or unknown, sample size must be large enough
- When the underlying distribution of \(x\) is exactly or nearly Normal, using the t-model is justified with small sample sizes

Step 5: Compute the p-value

Alice’s sample had \(\bar{X} = 55\) and \(S = 8\). The observed test statistic is:

\[T = \frac{55 - 50}{\frac{8}{\sqrt{25}}} = 3.125\]

(p_value <- pt(3.125, 24, lower.tail = FALSE))

[1] 0.002301319

Step 6: Make Decision

Since the \(p\)-value \(< 5\%\), we reject \(H_0\) and conclude that:

There is sufficient evidence, at \(5\%\) significance level, to conclude that the average daily latte sales have increased after the new marketing campaign.

Test the difference in proportions

Example: Cigarette

Example from: https://online.stat.psu.edu/stat415/lesson/9/9.4 Pennsylvania State University

Via a telephone poll, Time magazine asked 800 adult Americans:

“Should the federal tax on cigarettes be raised to pay for health care reform?”

The results of the survey were:

351 out of 605 non-smokers said “yes”
41 out of 195 smokers said “yes”

Is there sufficient evidence at the \(\alpha = 0.05\) to conclude that the two populations differ significantly with respect to their opinions?

Step 1: Define null and alternative hypotheses

The hypotheses are: \[H_0: p_1 - p_2 = 0\quad vs \quad H_A: p_1 - p_2 \neq 0\]

\(p_1\): proportion of the non-smoker population who reply “yes”
\(p_2\): proportion of the smoker population who reply “yes”

Step 2: Specify the significance level

The significance level was specified as \(\alpha = 0.05\).

Step 3: Choose a test statistic

The test statistic we will use is: \[Z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p} \left(1 -\hat{p}\right)\left( \frac{1}{n_1} + \frac{1}{n_2}\right)}}\]

where \(\hat{p}\) is the overall sample proportion, i.e., \(\hat{p} = \frac{\# Successes}{\# Total}\) (considering both groups).

Step 4: Find the null model

If we assume \(H_0\) to be true, for large samples we have that:

\[Z \sim N \left(0, 1\right)\]

Step 5: Compute the p-value

Observed proportions:

Non-smokers: \(\hat{p}_1 = \frac{351}{605} = 0.58\)
Smokers: \(\hat{p}_2 = \frac{41}{195} = 0.21\)
Overall sample proportion: \(\hat{p} = \frac{351 + 41}{605 + 195} = 0.49\)

Observed test statistic:

\[ Z = \frac{0.58 - 0.21}{\sqrt{0.49\times 0.51\left( \frac{1}{195} + \frac{1}{605}\right)}} = 8.99\]

observed_test_statistic <- 8.99

(p_value <- 2*pnorm(abs(observed_test_statistic), lower.tail = FALSE))

[1] 2.472304e-19

Step 6: Make Decision

Since the \(p\)-value \(\leq 0.05\), we reject \(H_0\) and conclude that:

There is sufficient evidence at the \(5\%\) significance level to conclude that the two populations differ with respect to their opinions concerning imposing a federal tax to help pay for health care reform.

Hypothesis testing for difference in means

Comparing mean of two groups

To discuss the difference in means, we need to consider the relationship between the two groups.
If the two groups are independent, we use the Welch’s two-sample t-test.
If the two groups are dependent, we use the paired t-test.

Independent groups

We want to see if people tend to marry later in life in the US compared to Canada.
We want to compare the red cells count in healthy people and people with leukemia.
We want to compare how much money Apple users are willing to spend on a new phone compared to Samsung users.

In all these cases, one group has nothing to do with the other group.

Independent groups

Dependent groups

We want to see if a new drug is effective in reducing blood pressure. We measure the blood pressure before the treatment and after the treatment.
We want to see if married people have similar IQ levels.
We want to compare the weight of twins at birth.

In all these cases, the elements in one group are related to the elements in the other group.

Dependent groups

The case of Independent groups

Example: Independent Groups

A researcher wants to investigate whether there’s a difference in the average daily screen time between teenagers in urban and rural areas.
The researcher randomly samples 20 teenagers from urban areas and 25 teenagers from rural areas. They ask each teenager to report their average daily screen time (in hours) over the past week.
The results of the survey were:

Group	Sample size	Sample mean	Std Dev
Urban	20	6.2 hours	1.5 hours
Rural	25	5.5 hours	1.2 hours

Is there sufficient evidence at the \(\alpha = 10\%\) to conclude that the average daily screen time differs between teenagers in urban and rural areas?

Step 1: Define null and alternative hypotheses

The hypotheses are: \[H_0: \mu_1 - \mu_2 = 0\quad vs \quad H_A: \mu_1 - \mu_2 \neq 0\]

\(\mu_1\): average screen time for teenagers in urban areas
\(\mu_2\): average screen time for teenagers in rural areas

Step 2: Specify the significance level

The significance level was specified as \(\alpha = 10\%\).

Step 3: Choose a test statistic

The test statistic we will use is: \[T = \frac{\left(\bar{X}_1 - \bar{X}_2\right) - \Delta_0}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}}\]

Step 4: Find the null model

If we assume \(H_0\) to be true, then \(\mu_1-\mu_2 = \Delta_0\), for large samples we have that:

\[T \sim t_k\] where k is \[k = \frac{ \left(\color{red}{\frac{S^2_1}{n_1}} + \color{blue}{\frac{S^2_2}{n_2}}\right)^2 }{ \color{red}{\frac{S_1^4}{n_1^2(n_1-1)}} + \color{blue}{\frac{S_2^4}{n_2^2(n_2-1)}} } \]

Assumptions and conditions for validity of using the t-model:

The two samples are randomly drawn from their respective populations.
Sampled individuals within the same sample are independent of each other. Just check that two sample sizes are no greater than 10% of their respective population sizes.
Sample size:

If both \(x_1\) and \(x_2\) follow the Normal model, there is no restriction on the sample sizes \(n_1\) and \(n_2\).
If \(x_1\) and \(x_2\) are non-Normal or follow an unknown distribution, we need reasonably large sample sizes to validate the Normal approximation by the CLT as well as the use of the t-model.

The two samples must be independent of each other.

Step 5: Compute the p-value

Observed test statistic:

\[T = \frac{\left(6.2 - 5.5\right) - 0}{\sqrt{\frac{1.5^2}{20} + \frac{1.2^2}{25}}} \approx 1.6973\]

Degrees of freedom (\(k\)): \[k = 35.97\]

observed_test_statistic <- 1.6973

(p_value <- 2 * pt(abs(observed_test_statistic), 35.97, lower.tail = FALSE))

[1] 0.09827767

Step 6: Make Decision

Since the \(p\)-value \(\leq 10\%\), we reject \(H_0\) and conclude that:

There is sufficient evidence to conclude that teenagers in urban and rural areas have a different average daily screen times.

The case of Paired groups

Paired data

The trick is to take the difference between the two groups and we can do a one-mean test for the differences.

Example: Paired Groups

A fitness instructor wants to evaluate the effectiveness of a new 8-week training program designed to improve participants’ resting heart rate (RHR). They believe the program will lower RHR.
The instructor recruits 12 participants and measures their RHR (in beats per minute) before starting the program and again after completing the 8-week program.
The data is collected as follows:

Statistic	Before	After	Difference
Mean	73.42	70.08	0.8
Std Dev	5.16	5.23	1.07
n	12	12	12

Step 1: Define null and alternative hypotheses

The hypotheses are: \[H_0: \mu_1 - \mu_2 = 0\quad vs \quad H_A: \mu_1 - \mu_2 > 0\]

\(\mu_1\): average RHR before the program
\(\mu_2\): average RHR after the program

Step 2: Specify the significance level

Let’s set the significance level at \(\alpha = 1\%\).

Step 3: Choose a test statistic

The test statistic we will use is: \[T = \frac{\bar{d} - \Delta_0}{\frac{s_d}{\sqrt{n}}}\]
where:
- \(\bar{d}\): mean of the within pair differences
- \(s_d\): standard deviation of the within pair differences
- \(n\): number of pairs

Step 4: Find the null model

If we assume \(H_0\) to be true, then \(\mu_1-\mu_2 = \Delta_0\), for large samples we have that:

\[T \sim t_{n-1}\]

Assumptions and conditions for validity of using the t-model:

The n pairs are randomly drawn from the population.
The n pairs are independent of each other, i.e. any two distinct pairs are independent of each other. Just check that the number of pairs \(n\) is no greater than 10% of the population size.
Sample size

If the within-pair differences (\(d_i\)) follow a Normal or nearly Normal distribution model, you can use the model even if the sample size \(n\) is small.
If \(d_i\)’s follow an unknown or a non-Normal distribution, we need a reasonably large sample size \(n\) to validate the Normal approximation of \(\bar{d}\) by the CLT as well as the use of the t-model.

Step 5: Compute the p-value

Observed test statistic:

\[T = \frac{0.8 - 0}{\frac{1.07}{\sqrt{12}}} \approx 2.48\]

observed_test_statistic <- 2.48

(p_value <- pt(observed_test_statistic, 11, lower.tail = FALSE))

[1] 0.01528686

Step 6: Make Decision

Since the \(p\)-value \(> 1\%\), we fail to reject \(H_0\) and conclude that:

There is not enough evidence, at \(\alpha = 1\%\), to conclude that the 8-week training program decreases participants’ average resting heart rate.

Take home - Bootstrapping vs. theory based approaches

Traditional theory based approach

Makes assumptions about the distribution
- results may not be valid if the assumptions are not met
Uses theory to tell what the sampling distribution should look like. We use equations to estimate sampling distribution for specific sample statistics

Simulation approach

Does not make assumption about the distribution
- If sample is representative of the population, bootstraping/permutation will work well
- For small sample sizes, the sample may not do a good job of representing the population
Bootstrap is useful for cases where formula for sample statistics do not exist

Today’s worksheet

Perform a range of hypothesis tests based on distributional assumptions and/or central limit theorem