Confidence intervals and hypothesis tests

The two questions of inference

Statistical inference generally addresses two fundamental questions that guide scientific inquiry. First, what is the most likely value for a population parameter? This is the domain of estimation, where we seek to identify a plausible range of values. Second, is the observed data consistent with a specific hypothesized value for that parameter? This is the domain of hypothesis testing. While point estimation provides a single “best guess,” intervals and tests allow us to quantify the uncertainty surrounding that guess and make decisions under uncertainty. These two approaches are closely related; in fact, a confidence interval can be viewed as the set of all null hypotheses that would not be rejected by a corresponding hypothesis test.

Confidence intervals

A confidence interval (CI) provides a range of values that is likely to contain the population parameter. It is a fundamental tool for expressing the precision of an estimate.

Construction

For a parameter \theta, a 100(1-\alpha)\% confidence interval is defined by two statistics, L(X_1, \dots, X_n) and U(X_1, \dots, X_n), such that the probability of the interval covering the true parameter is 1 - \alpha: P(L(X) \leq \theta \leq U(X)) = 1 - \alpha The value 1-\alpha is the confidence level. Common choices are 0.95 or 0.99, reflecting a desire for high certainty. The construction typically involves a point estimator \hat{\theta}, its standard error SE(\hat{\theta}), and a critical value from a relevant probability distribution. For many estimators that are asymptotically normal, the interval takes the form: \hat{\theta} \pm (\text{Critical Value}) \times SE(\hat{\theta})

Correct interpretation

Interpretation of confidence intervals requires extreme precision to avoid common fallacies. A 95% confidence interval does not mean there is a 95% probability that the true parameter lies within the calculated interval for a specific sample. In the frequentist framework, the parameter \theta is a fixed, albeit unknown, constant. The interval [L, U] is the random entity because it depends on the stochastic nature of the sample.

The correct interpretation is a procedural one. If we were to repeat the sampling process many times under identical conditions and calculate a new interval for each sample, approximately 95% of those generated intervals would cover the true, fixed parameter. For any single calculated interval, it either contains the parameter or it does not; we simply do not know which is the case, but we trust the 95% success rate of the procedure.

Classical confidence intervals

Different parameters and distributional assumptions lead to specific interval formulas.

Mean of a normal distribution

If the population variance \sigma^2 is known, we use the standard normal distribution (z): CI = \bar{X} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} In practice, \sigma^2 is rarely known. When we must estimate it using the sample variance s^2, we use the Student’s t-distribution with n-1 degrees of freedom: CI = \bar{X} \pm t_{\alpha/2, n-1} \frac{s}{\sqrt{n}} The t-distribution has thicker tails than the normal distribution, which accounts for the additional uncertainty introduced by estimating the variance. As the sample size increases, the t-distribution converges to the normal distribution.

Proportions

For a large sample proportion \hat{p}, the Wald interval is frequently used due to its simplicity: CI = \hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} However, the Wald interval is notorious for poor performance when p is near 0 or 1, or when the sample size is small. In these cases, the actual coverage probability can be much lower than the nominal level. The Wilson score interval or the Agresti-Coull “plus four” interval are often preferred. For the highest rigor, exact Clopper-Pearson intervals, based on the binomial distribution, are used.

Variance

For the variance \sigma^2 of a normal population, the interval is based on the chi-square distribution: \left( \frac{(n-1)s^2}{\chi^2_{\alpha/2, n-1}}, \frac{(n-1)s^2}{\chi^2_{1-\alpha/2, n-1}} \right) Unlike the interval for the mean, the chi-square interval is not symmetric around the point estimate s^2, reflecting the skewness of the chi-square distribution.

Regression coefficients

In the context of linear regression, the interval for a specific coefficient \beta_j is: CI = \hat{\beta}_j \pm t_{\alpha/2, n-k-1} SE(\hat{\beta}_j) where n is the number of observations and k+1 is the number of estimated parameters (including the intercept).

Hypothesis testing framework

Hypothesis testing is a formal, decision-theoretic procedure for evaluating claims about a population based on sample evidence.

Components

  1. Null Hypothesis (H_0): This represents the default state, typically a claim of “no effect” or “no difference.” We assume H_0 is true unless the data provides overwhelming evidence to the contrary.
  2. Alternative Hypothesis (H_a or H_1): This is the claim we seek to find evidence for. It can be one-sided (e.g., \mu > 0) or two-sided (e.g., \mu \neq 0).
  3. Test Statistic: A summary of the sample data (e.g., t, z, F, \chi^2) that follows a known probability distribution if the null hypothesis is true.
  4. Sampling Distribution: The probability distribution of the test statistic under the assumption that H_0 is true. This distribution allows us to determine how “unlikely” our observed result is.
  5. Rejection Region: The set of values for the test statistic that are so extreme that we decide to reject H_0. The size of this region is determined by the significance level \alpha.
  6. p-value: The probability, calculated under the assumption that H_0 is true, of obtaining a test statistic value at least as extreme as the one actually observed.
  7. Decision: If the p-value \leq \alpha, we reject the null hypothesis. Otherwise, we fail to reject it. It is important to note that we never “accept” the null hypothesis; we only fail to find sufficient evidence to reject it.

Type I and Type II errors

The decision-making process in hypothesis testing is inherently prone to two types of errors.

  • Type I Error (\alpha): Rejecting the null hypothesis when it is actually true. This is often called a “false positive.” We control this error rate by choosing a small value for \alpha, such as 0.05.
  • Type II Error (\beta): Failing to reject the null hypothesis when it is actually false. This is a “false negative.”

The power of a statistical test is defined as 1-\beta, which is the probability of correctly rejecting a false null hypothesis. A high-power test is one that is very likely to detect an effect if it truly exists. Power depends on several factors: 1. Effect Size: Larger effects are easier to detect. 2. Sample Size: Larger samples provide more information and increase power. 3. Significance Level (\alpha): A more lenient \alpha (e.g., 0.10) increases power but also increases the risk of a Type I error. 4. Population Variability: Lower noise in the data makes the signal easier to see.

The big tests

z-test and t-test

The z-test is used for comparing means when the population variance is known or when the sample size is large enough for the Central Limit Theorem to ensure normality. The t-test is the standard for comparing means when the population variance must be estimated from the sample. It includes the one-sample t-test, the two-sample (independent) t-test, and the paired t-test.

Chi-square tests

The chi-square goodness-of-fit test checks if a sample distribution matches a theoretical one. The chi-square test for independence evaluates whether two categorical variables are related in a contingency table by comparing observed frequencies to expected frequencies under the assumption of independence.

F-tests

The F-test is primarily used to compare variances or to evaluate multiple linear restrictions in a regression model. In Analysis of Variance (ANOVA), the F-test determines if there are any statistically significant differences between the means of three or more independent groups.

Multiple comparisons

In the modern era of “big data,” we often perform many hypothesis tests simultaneously. However, if we conduct 100 independent tests each at a 5% significance level, we expect to reject the null hypothesis 5 times purely by chance, even if every null hypothesis is true. This is the problem of multiple comparisons.

Adjustments

To maintain a controlled overall error rate, we must adjust our p-values or significance thresholds. - Bonferroni Correction: This is the simplest and most conservative method. We divide \alpha by the number of tests m. While it strictly controls the Family-Wise Error Rate (FWER), it can lead to very low power when m is large. - Benjamini-Hochberg (BH) Procedure: This method controls the False Discovery Rate (FDR), which is the expected proportion of false positives among all rejected null hypotheses. It is much more powerful than Bonferroni and is widely used in genomics and large-scale screening. - Romano-Wolf Procedure: A more sophisticated resampling-based approach that accounts for the potential dependence between different tests, providing more accurate control of the FWER in complex settings.

The trouble with p-values

The p-value is perhaps the most misunderstood and misused concept in statistics. In 2016, the American Statistical Association (ASA) felt compelled to issue a formal statement on p-values to clarify their meaning and limitations.

First, a p-value is not the probability that the null hypothesis is true. It is a conditional probability about the data, not the hypothesis. Second, “statistically significant” does not mean “practically important.” With a large enough sample, even a trivial effect will produce a tiny p-value. Third, a p-value greater than 0.05 does not prove that there is no effect; it may simply mean the study had insufficient power to detect it.

To improve scientific transparency, many journals now encourage or require the reporting of effect sizes and confidence intervals alongside p-values. Pre-registration of study designs and analysis plans is another powerful remedy against “p-hacking,” the practice of trying multiple statistical models until a significant result emerges.

Worked example: One-sample t-test

Suppose a researcher wants to test if a new teaching method improves test scores. The historical average score is 70. A sample of 30 students using the new method yields a mean score of 74.5 with a sample standard deviation of 8.0.

H_0: \mu = 70 H_a: \mu \neq 70

The test statistic is: t = \frac{\bar{X} - \mu_0}{s/\sqrt{n}} = \frac{74.5 - 70}{8.0/\sqrt{30}} = \frac{4.5}{1.46} = 3.08

With df = 29 and a two-sided \alpha = 0.05, the critical t-values are approximately \pm 2.045. Since our calculated t = 3.08 is greater than 2.045, we reject the null hypothesis. The p-value is approximately 0.0045, suggesting that a result this extreme is very unlikely to occur by chance if the new method had no effect.

Excel

A1: [Data of test scores]
B1: =AVERAGE(A1:A30)
B2: =STDEV.S(A1:A30)
B3: =(B1 - 70) / (B2 / SQRT(30))  (Calculated t-stat)
B4: =T.DIST.2T(ABS(B3), 29)      (Two-sided p-value)
B5: =B1 - T.INV.2T(0.05, 29)*(B2/SQRT(30)) (Lower Bound of 95% CI)
B6: =B1 + T.INV.2T(0.05, 29)*(B2/SQRT(30)) (Upper Bound of 95% CI)

Stata

Stata’s ttest command is the standard way to perform this analysis.

* Test if the mean of score is 70
ttest score == 70

* Display the confidence interval explicitly
mean score
ci means score

R

The t.test function in R is versatile and provides both the test result and the confidence interval.

# One-sample t-test
res <- t.test(scores, mu = 70)
print(res)

# Accessing specific parts of the output
res$p.value
res$conf.int

Julia

In Julia, the HypothesisTests.jl package provides a clean, object-oriented interface for testing.

using HypothesisTests
using Distributions

# Example data
scores = [72, 75, 68, 80, 74, 71, 73, 76, 70, 78, 
          74, 72, 75, 69, 81, 73, 70, 77, 74, 76,
          72, 75, 68, 80, 74, 71, 73, 76, 70, 78]

# Perform one-sample t-test against null mean of 70
test_results = OneSampleTTest(scores, 70)

# Display full results
println(test_results)

# Extract confidence interval
println("95% CI: ", confint(test_results))

# Extract p-value
println("p-value: ", pvalue(test_results))

Common traps

A major trap is interpreting a p-value as the probability that the null hypothesis is true. This is a Bayesian interpretation that is logically incompatible with the frequentist framework. In the frequentist world, hypotheses do not have probabilities; they are either true or false.

Another mistake is using two-sided tests when a one-sided test is theoretically required, or vice versa. While two-sided tests are the conservative default, one-sided tests are appropriate when only one direction of effect is physically possible or theoretically interesting. However, switching from two-sided to one-sided after seeing the data to achieve “significance” is a serious breach of statistical ethics.

Researchers often fail to correct for multiple comparisons, leading to a high rate of false discoveries in exploratory analyses. Finally, using the t-test on small samples that are heavily skewed or have extremely thick tails can be misleading. While the t-test is relatively robust, extreme violations of the normality assumption in small samples can distort the results. In such cases, non-parametric alternatives like the Wilcoxon signed-rank test should be considered.

Reporting checklist

  • State the null and alternative hypotheses explicitly in the context of the research question.
  • Report the value of the test statistic and the associated degrees of freedom.
  • Provide the exact p-value rather than just a threshold like p < 0.05.
  • Include a 95% confidence interval for the effect size to show the range of plausible values.
  • Mention any adjustments made for multiple comparisons (e.g., “p-values were adjusted using the Benjamini-Hochberg procedure”).
  • Discuss the practical significance of the findings. Is the observed difference large enough to matter in the real world, regardless of its statistical significance?
  • State the assumptions made (e.g., normality, independence) and describe how they were verified.

References

  • Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury Press.
  • Lehmann, E. L., & Romano, J. P. (2005). Testing Statistical Hypotheses (3rd ed.). Springer.
  • Wasserstein, R. L., & Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician, 70(2), 129-133.
  • Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289-300.