flowchart TD
A[Data] -->|Check Variance| B{F-test}
B -->|F-test p > 0.05| C[Student's t-test]
B -->|F-test p <= 0.05| D[Welch's t-test]
I believe the topic of “data-driven analysis” to be highly relevant for our students, but also often overlooked;
I have been spending some energy trying to figure out how (and what) to teach this topic in a way that is relatable and informative for students.
How sequential decisions can invalidate your inference
By the end of this lecture, you will be able to:
Recognize how “data-driven” decisions (like pre-testing assumptions or selecting variables based on data) can invalidate standard p-values.
Demonstrate through simulation how “Post-Selection Inference” inflates Type I error rates.
Implement data splitting as a practical solution to ensure valid inference.
This “methodological randomness” is not accounted for by standard inferential methods, but it is often overlooked.
Unfortunately, this can lead to “invalid” inference by inflating error rates.
Do it or not do it? That is the question.
Suppose we want to evaluate two different bottling machines (A and B). We need to ensure they fill bottles to the same average volume (\(\mu_A = \mu_B\)).
We collect a sample of filled bottles from each machine:
If the variances are equal, which test should we use? More importantly, why is that? What is the advantage of the chosen test?
Why is it “dangerous” to use \(t\)-test if the variances are unequal?
How would you decide which test to use? Would you check the variances first? If so, how?
flowchart TD
A[Data] -->|Check Variance| B{F-test}
B -->|F-test p > 0.05| C[Student's t-test]
B -->|F-test p <= 0.05| D[Welch's t-test]
Suppose that both machines have the exact same average fill volume. We want to test \(H_0: \mu_A = \mu_B\) vs \(H_A: \mu_A \neq \mu_B\).
Which hypothesis is true?
Suppose that both machines have the exact same average fill volume. We want to test \(H_0: \mu_A = \mu_B\) vs \(H_A: \mu_A \neq \mu_B\).
Which hypothesis is true?
Important: the truth vs the decision
It is important to distinguish between the underlying true state (the true hypothesis) and the decision you make based on the data.
Suppose that both machines have the exact same average fill volume. We want to test \(H_0: \mu_A = \mu_B\) vs \(H_A: \mu_A \neq \mu_B\) at \(\alpha = 0.05\) significance level.
Suppose that both machines have the exact same average fill volume. We want to test \(H_0: \mu_A = \mu_B\) vs \(H_A: \mu_A \neq \mu_B\) at \(\alpha = 0.05\) significance level.
Exercise 1: Using the data frame results, calculate the proportion of times we rejected \(H_0\) (i.e., p-value < 0.05) for each of the three testing approaches: (1) Student’s \(t\)-test, (2) Welch’s \(t\)-test, and (3) the “Data-Driven” conditional approach.
Exercise 2: Compute the Type I Error of the “Data-Driven” approach stratified by the F-test decision. Calculate the rejection rate when: (a) the F-test failed to reject (\(p > 0.05\)) and (b) when the F-test rejected (\(p \leq 0.05\)).
Why? We want to see which branch of our decision tree is causing the inflated overall error rate.
While different values of the parameters (sample sizes, variances, etc.) can lead to different error rates, the main point stands: running an F-test before a t-test is not good practice.
The solution here is simple: just default to Welch’s t-test, which is robust to unequal variances, and the power loss is minimal when the variances are actually equal.
Questions?
A harder problem
We have learned many ways to compare models: (1) \(C_p\), (2) AIC, (3) BIC, (4) F-test, and (5) cross-validation MSE.
and different techniques to search for a good model:
The problem with using the same data for model selection and inference is that we become overconfident.
The sampling distribution is not what we assume it is, which leads to:
Generate 100 observations of each variable from a normal distribution.
Apply only the first step of forward selection. In other words, we will add only one variable among ten potential candidates.
Finally, we will test whether the selected covariate is significant at the 5% level.
replicate this study 2,000 times and compute how many times we reject \(H_0\).
What is the correct decision for our hypothesis?
A. Reject \(H_0\).
B. Do not reject \(H_0\).
C. Depends on the data.
What is the correct decision for our hypothesis?
A. Reject \(H_0\).
B. Do not reject \(H_0\). \(\leftarrow\) Correct!
C. Depends on the data.
This simulation will help us estimate the …
A. Power of the test.
B. Type I Error Rate.
C. Probability of making an error.
D. None of the above.
This simulation will help us estimate the …
A. Power of the test.
B. Type I Error Rate. \(\leftarrow\) Correct!
C. Probability of making an error.
D. None of the above.
What should be the probability of rejecting \(H_0\)?
A. \(5\%\).
B. \(95\%\).
C. Depends on the true value of \(\beta\), which is unknown.
D. None of the above.
What should be the probability of rejecting \(H_0\)?
A. \(5\%\). \(\leftarrow\) Correct!
B. \(95\%\).
C. Depends on the true value of \(\beta\), which is unknown.
D. None of the above.
[1] "Out of the 2000 simulations, we rejected H0 763 times."
[1] "Type I Error Rate: 0.3815"
In this case, we knew that \(H_0\) was true. But in practice, we do not know.
So how can we trust that we are rejecting \(H_0\) because the variable is truly important rather than because our Type I error probability is inflated?
Intuition: if a variable is truly important, it will be important in both splits.
Trade-off: We sacrifice a large chunk of our data for search, which increases the standard error in the inference step:
Data-driven decisions can lead to invalid inference by affecting Type I error rates.
In the case of pre-testing for variances, we can just default to Welch’s t-test, which is robust to unequal variances.
In the case of model selection, data splitting is a practical approach to mitigate the issue, but it comes with trade-offs:
Thank you!
Questions?
© 2026 Rodolfo Lourenzutti CC-BY-SA-NC 4.0