Nonparametrics and the bootstrap

When parametric assumptions fail

A parametric model commits to a finite-dimensional family of distributions, then estimates which member of that family generated the data. This works beautifully when the assumption is right. It can mislead badly when it is wrong. Heavy tails violate normality. Multimodal distributions violate unimodality. The standard error of a complicated estimator may have no closed form even when the data are well-behaved.

Nonparametric statistics is the alternative. We avoid restricting the distribution to a finite-parameter family and let the data speak more directly. The price is a slower rate of convergence (typically n^{-2/5} rather than n^{-1/2}) and weaker conclusions about the parameter of interest. The payoff is robustness to assumptions that would otherwise be load-bearing.

Three families of techniques cover most applied needs. The bootstrap handles complicated estimators by resampling. Kernel density estimation gives a smooth picture of a distribution without assuming its shape. Rank-based tests replace numerical values with their ranks, removing scale assumptions altogether. We cover all three here, plus a brief introduction to splines as a bridge to the smoothing methods of later chapters.

The empirical distribution function

The empirical distribution function (ECDF) is the foundation of nonparametric inference. Given a sample X_1, \ldots, X_n, define \hat F_n(x) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}\{X_i \leq x\} This is the proportion of observations no larger than x. It is a step function with jumps of size 1/n at each observed value.

For any fixed x, \hat F_n(x) is an average of independent Bernoulli random variables, so its expectation is the true F(x) and its variance is F(x)[1 - F(x)] / n. By the law of large numbers, \hat F_n(x) \to F(x) in probability. The Glivenko-Cantelli theorem strengthens this dramatically: convergence is uniform over x, \sup_{x \in \mathbb{R}} \big| \hat F_n(x) - F(x) \big| \xrightarrow{\text{a.s.}} 0 The Dvoretzky-Kiefer-Wolfowitz inequality even gives an explicit nonasymptotic bound: P\left( \sup_x \big| \hat F_n(x) - F(x) \big| > \varepsilon \right) \leq 2 e^{-2 n \varepsilon^2} The practical implication is that we can treat \hat F_n as a good stand-in for the unknown F in modest sample sizes. The bootstrap exploits exactly this fact.

The bootstrap

Bradley Efron’s 1979 paper introduced the bootstrap as a way to estimate the sampling distribution of nearly any statistic without making distributional assumptions. The idea is simple. If we knew the true F, we could simulate from it B times, compute the statistic of interest on each simulation, and use the distribution of those B values to summarize sampling variation. We do not know F. We do know \hat F_n, and Glivenko-Cantelli says it is close. So we sample from \hat F_n instead.

Sampling from \hat F_n means drawing n values with replacement from the observed sample. Each draw is a “bootstrap sample.” For each bootstrap sample X_1^*, \ldots, X_n^*, compute the statistic \hat\theta^* = T(X_1^*, \ldots, X_n^*). Repeat B times to get \hat\theta^{(1)}, \ldots, \hat\theta^{(B)}. The empirical distribution of these B values approximates the sampling distribution of \hat\theta.

Three flavors of bootstrap confidence interval

Percentile interval. Take the \alpha/2 and 1 - \alpha/2 quantiles of \{\hat\theta^{(1)}, \ldots, \hat\theta^{(B)}\} directly. Simple, intuitive, and correct in the limit. Can be biased in finite samples when the statistic is skewed.

Basic (pivotal) interval. Use the distribution of \hat\theta^* - \hat\theta as an approximation to the distribution of \hat\theta - \theta: \left[ 2 \hat\theta - q_{1 - \alpha/2}, \ 2 \hat\theta - q_{\alpha/2} \right] where q_p is the p-quantile of the bootstrap distribution. Often performs better than the percentile interval when the statistic has a noticeable bias.

BCa (bias-corrected and accelerated) interval. Adjusts the percentile method using two corrections: a bias correction \hat z_0 estimated from the bootstrap distribution and an acceleration \hat a estimated by jackknife. The resulting interval has better coverage than the percentile or basic interval for skewed statistics, at the cost of more computation. BCa is the default in many software packages (R’s boot::boot.ci, for instance).

For most well-behaved problems with n \geq 50 and B \geq 1000, all three intervals agree to two significant figures. Differences emerge with small samples, skewed statistics, or both.

Wild and cluster bootstrap

The nonparametric bootstrap described above assumes the observations are independent and identically distributed. Two extensions cover common departures.

Wild bootstrap preserves the regressor matrix and resamples residuals with random sign flips. For a fitted regression \hat y_i = x_i^\top \hat\beta with residuals \hat\varepsilon_i, a wild bootstrap sample is y_i^* = x_i^\top \hat\beta + v_i \hat\varepsilon_i where v_i is a random variable with mean 0 and variance 1, often the two-point Rademacher distribution (v_i = \pm 1 with equal probability) or the Mammen two-point distribution. This handles heteroskedasticity without making strong assumptions about its form.

Cluster bootstrap resamples whole clusters rather than individual observations, preserving within-cluster dependence. If you have G clusters (firms, schools, countries), draw G clusters with replacement and include all observations within each drawn cluster. This is the right tool when the unit of observation and the unit of independence differ.

When the number of clusters is small (say G < 30), a wild cluster bootstrap is often preferred over the conventional cluster bootstrap. The wild cluster bootstrap flips the signs of residuals at the cluster level rather than resampling clusters, and it has better small-G coverage. Cameron, Gelbach, and Miller (2008) is the standard reference.

Permutation tests

Permutation tests are conceptually older than the bootstrap (Fisher, 1935) and answer a slightly different question. Where the bootstrap asks “how variable is my estimator across replicate samples from the same population,” a permutation test asks “how likely is this association under the null hypothesis of no relationship?”

The setup is simplest for two-sample comparisons. Combine the two samples, randomly reassign the group labels, and compute the test statistic on the reshuffled data. Repeat B times. The p-value is the fraction of permutations whose statistic is at least as extreme as the observed one.

Permutation tests are exact in finite samples when the null hypothesis is exchangeability of the labels. They make no distributional assumption beyond that. The cost is computational and conceptual: the null must be tight enough that exchangeability holds (a difference in means with equal variances, for example, but not a difference in means with unequal variances).

Kernel density estimation

The histogram is the simplest nonparametric density estimator. It is also unsatisfying: it depends on the choice of bin origin, it is not smooth, and its variance and bias are hard to tune separately. Kernel density estimation (KDE) addresses all three problems.

A KDE places a smooth kernel K at each observation and averages them: \hat f_h(x) = \frac{1}{n h} \sum_{i=1}^n K\!\left( \frac{x - X_i}{h} \right) K is a symmetric probability density (Gaussian, Epanechnikov, triangular). h > 0 is the bandwidth. The choice of kernel matters little. The choice of bandwidth matters a great deal.

Bandwidth selection

A small h produces a wiggly estimate with low bias and high variance: we see every bump in the data, including noise. A large h produces a smooth estimate with low variance and high bias: we miss real features. The optimal bandwidth balances these.

Rule of thumb (Silverman). For Gaussian-shaped data, h^* = 0.9 \min(s, \text{IQR}/1.34) \cdot n^{-1/5} where s is the sample standard deviation. Fast and reasonable as a starting point. Tends to oversmooth multimodal distributions.

Cross-validation. Choose h to minimize the leave-one-out integrated squared error \text{ISE}(h) = \int [\hat f_h(x) - f(x)]^2 dx estimated from the data. Better behavior for nonstandard distributions but computationally more expensive.

Plug-in (Sheather-Jones). Estimate the unknown functionals in the asymptotic optimal bandwidth formula from the data itself. The current best practice for general use. Available in R as bw.SJ().

The optimal bandwidth has rate n^{-1/5}, which is slower than the parametric n^{-1/2}. This is the cost of nonparametric flexibility.

Rank-based tests

Sometimes the entire numerical scale of the data is suspect. Survey responses on a 5-point scale, pain ratings, ordinal grades: the numbers convey order but not interval distance. Rank-based tests use only the ordinal information and are exact under fairly weak conditions.

Wilcoxon signed-rank test. A one-sample or paired-sample test of H_0: \text{median} = \theta_0. Replace each observation (or paired difference) with its signed rank, sum the positive ranks, and compare to the null distribution of that sum.

Wilcoxon-Mann-Whitney (rank-sum) test. A two-sample test of stochastic equality between two distributions. Rank all n_1 + n_2 observations together, sum the ranks in group 1, and compare to the null distribution. The statistic is equivalent to the Mann-Whitney U statistic, which counts the number of pairs (X_i, Y_j) with X_i < Y_j.

Kruskal-Wallis test. The generalization to more than two groups. A rank-based one-way ANOVA. Test statistic is a function of the average ranks within each group, asymptotically \chi^2_{k-1} under the null.

Rank tests typically have very high efficiency relative to their parametric counterparts: the asymptotic relative efficiency of the Mann-Whitney to the two-sample t-test is 3/\pi \approx 0.955 under normality, and can exceed 1 (sometimes by a lot) for heavy-tailed distributions. Using a rank test when the data are roughly normal costs almost nothing. Using a parametric test when the data are heavy-tailed can be catastrophic.

A caveat: rank tests assume continuous distributions, which guarantees no ties. Software handles ties through midranks, which works in practice but slightly changes the small-sample distribution of the test statistic.

Splines

When the relationship between X and Y is smooth but not linear, splines provide a flexible nonparametric regression tool. The idea is to fit piecewise polynomials joined at knots, with continuity constraints at the knots.

Regression splines. Choose K knots \xi_1 < \cdots < \xi_K in the range of X. Fit a polynomial of degree d (typically d = 3, “cubic”) in each segment, requiring continuity of the function and its first d - 1 derivatives at the knots. The resulting basis has K + d + 1 terms. With cubic splines and four knots, that is eight basis functions.

Smoothing splines. Rather than choosing K, place a knot at every distinct x_i and penalize the integrated squared second derivative: \min_f \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int [f''(x)]^2 dx The penalty \lambda controls the trade-off between fit and smoothness. The solution is a natural cubic spline.

Natural cubic splines. Cubic splines with the additional constraint that the function is linear beyond the boundary knots. This reduces the effective number of parameters and improves behavior in the tails. The default choice for smoothing splines.

We return to splines at length in the regression-diagnostics chapter, where they show up as flexible alternatives to polynomial expansions for nonlinear relationships. Here we mention them so the bridge to the next chapter is in place.

Worked example: bootstrap CI for the median

We have a sample of n = 100 observations from a heavy-tailed distribution (say, a t distribution with 3 degrees of freedom, multiplied by 10 to make the numbers easier to read). The sample median is \hat\theta = 0.42, and we want a 95 percent confidence interval that does not assume normality.

The sample median’s asymptotic standard error involves the density at the true median, f(\theta), which is unknown and hard to estimate well in small samples. The bootstrap sidesteps this entirely.

Excel

Excel handles bootstrap-style resampling with care. Use INDEX(data, RANDBETWEEN(1, n)) to draw a single observation with replacement, fill a column of n such draws to form one bootstrap sample, and MEDIAN() that column. Repeat by manually copying the structure or, better, by laying out a table where each row is one bootstrap replicate.

A1:A100   the original sample x_1, ..., x_100

C1:C100   bootstrap sample 1
  C1 =INDEX($A$1:$A$100, RANDBETWEEN(1, 100))
  C1 filled down to C100
C101 =MEDIAN(C1:C100)

D1:D100, E1:E100, ... up to ALL1:ALL100  (1000 columns is unwieldy)

Better: lay out replicates in rows.
F1:FF1   replicate 1, 100 cells
F2:FF2   replicate 2, ... 1000 rows

In a separate column, compute =MEDIAN(F1:FF1) for each row.

Then in one cell:
=PERCENTILE.INC(median_col, 0.025)  lower CI
=PERCENTILE.INC(median_col, 0.975)  upper CI

A practical Excel trick is to enable iterative calculation, draw 1000 bootstrap medians once, and copy-paste-values to freeze them before computing the percentiles. RAND-based formulas recalculate on every keystroke otherwise. This works but does not scale well past B = 2000 or so. For serious work, do the bootstrap in R, Stata, or Julia.

Stata

* simple percentile bootstrap CI for the median, B = 1000
bootstrap r(p50), reps(1000) seed(42): summarize x, detail

* equivalent using bsample
program define boot_median, rclass
    bsample
    summarize x, detail
    return scalar med = r(p50)
end
simulate med = r(med), reps(1000) seed(42): boot_median

* BCa interval
bootstrap r(p50), reps(1000) seed(42) bca: summarize x, detail
estat bootstrap, all

Stata’s built-in bootstrap prefix is convenient but inflexible for complex estimators. For anything beyond r() and e() returns, write a small program and call simulate or bsample.

R

library(boot)

# define statistic as a function of (data, indices)
med_fn <- function(x, i) median(x[i])

set.seed(42)
boot_obj <- boot(data = x, statistic = med_fn, R = 1000)

# percentile, basic, and BCa intervals
boot.ci(boot_obj, conf = 0.95, type = c("perc", "basic", "bca"))

# manual percentile interval, for understanding
boot_meds <- replicate(1000, median(sample(x, replace = TRUE)))
quantile(boot_meds, c(0.025, 0.975))

The boot package is the workhorse for nonparametric, parametric, and stratified bootstraps in R. For regression bootstraps with complex residual structures, car::Boot and rsample::bootstraps are friendlier interfaces.

Julia

using Bootstrap, Statistics, Random

Random.seed!(42)
bs = bootstrap(median, x, BasicSampling(1000))

# percentile CI
ci_perc = confint(bs, PercentileConfInt(0.95))

# BCa CI
ci_bca = confint(bs, BCaConfInt(0.95))

Bootstrap.jl offers basic, percentile, normal, and BCa intervals out of the box. For cluster or wild bootstrap, the package has dedicated sampling schemes (MaximumEntropySampling, etc.) and you can also roll your own using sample from StatsBase.

Common traps

The bootstrap fails for non-smooth functionals. The sample maximum is the classic counterexample. If X_1, \ldots, X_n are uniform on [0, \theta], the maximum likelihood estimator of \theta is the sample max. Its bootstrap distribution puts probability 1 - (1 - 1/n)^n \approx 1 - 1/e \approx 0.63 on exactly \hat\theta itself, because that is the chance that the original maximum appears in the resample. The bootstrap distribution is degenerate in a way the true sampling distribution is not. More generally, the bootstrap is reliable for smooth functions of sample moments and unreliable at boundary parameters and for extreme order statistics.

Confusing percentile and basic intervals. They are not the same. The percentile interval is symmetric around the median of the bootstrap distribution; the basic interval is symmetric around the original estimate. When the bootstrap distribution is skewed, the two can differ noticeably. Read the documentation of whatever package you are using and know which one it gives by default.

Treating bootstrap standard errors as a free upgrade. If you bootstrap a parameter that has a simple closed-form variance, the bootstrap standard error will be similar but not identical to the analytic one. Reporting the bootstrap version without acknowledging this can confuse readers. Bootstrap when the analytic answer is hard, not for ornamentation.

Bandwidth choice in KDE. Different software defaults can produce visibly different density estimates from the same data. The Silverman rule oversmooths bimodal distributions in particular. When the conclusion of a paper depends on the visual appearance of a density estimate, report the bandwidth and the rule used to pick it, and show robustness to a few alternative bandwidths.

Ties in rank tests. Software handles ties by averaging the ranks of tied observations (midranks). This is reasonable but it means the test statistic no longer has its exact null distribution. For data with many ties, an asymptotic approximation is unavoidable, and the small-sample p-value is approximate rather than exact. If exactness matters and ties are heavy, fall back to a permutation test.

Heavy-tailed distributions and the bootstrap. When tails are extremely heavy (Cauchy, t with one or two degrees of freedom), even the bootstrap can struggle. The variance of the sample mean is infinite under Cauchy, and bootstrapping the mean of a Cauchy sample gives an interval that does not have its nominal coverage. For such cases, target a quantile (the median, say) rather than the mean.

Reporting checklist

State the resampling unit. Individual observations? Residuals? Clusters?
Report B, the number of bootstrap replicates. For confidence intervals, B = 1000 is a minimum; for p-values close to a threshold, B = 10{,}000 or more.
State the interval method (percentile, basic, BCa, studentized). The choice can change the answer.
Set and report a random seed for reproducibility.
For KDE, report the kernel and bandwidth selection rule. Show robustness to alternative bandwidths if the result is visually borderline.
For rank tests, state how ties were handled and whether the p-value is exact or asymptotic.
For permutation tests, define the exchangeability assumption being tested and state the number of permutations.

References

Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall/CRC. The standard reference.
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge University Press. More mathematically rigorous, with extensive applied examples.
Wasserman, L. (2006). All of Nonparametric Statistics. Springer. Compact, formal coverage of nonparametric methods including density estimation, regression, and the bootstrap.
Efron, B. (1979). Bootstrap methods: another look at the jackknife. Annals of Statistics, 7(1), 1 to 26.
Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2008). Bootstrap-based improvements for inference with clustered errors. Review of Economics and Statistics, 90(3), 414 to 427.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall.