Sampling, LLN, and the Central Limit Theorem

Sampling, LLN, and the Central Limit Theorem

Why the CLT matters for applied work

The Central Limit Theorem explains why normal approximations appear throughout applied statistics. Sample means, regression coefficients, differences in means, maximum likelihood estimators, and many other estimators are functions of many observations. Under regular conditions, their sampling distributions become close to normal after centering and scaling. This is why a normal curve can support inference even when the raw data are binary, skewed, or discrete.

The theorem does not say that the data are normal. It says that averages of independent, similarly distributed observations often behave approximately normally when the sample is large enough and the variance is finite. That difference matters. Household income is not normal. Hospital stays are not normal. Counts of rare events are not normal. Still, averages and many estimators built from them can have sampling distributions that are close enough to normal for practical inference.

The Central Limit Theorem also gives a clean separation between population variability and sampling variability. If X_1,\ldots,X_n have mean \mu and variance \sigma^2, the sample mean has variance \sigma^2/n. More data reduce sampling variation at the rate 1/n in variance or 1/\sqrt{n} in standard error. That slow square-root rate is one of the hard facts of empirical work. To cut a standard error in half, the sample size must be roughly quadrupled.

The theorem is powerful, but it is not magic. Small samples, dependence, severe skewness, infinite variance, and sampling designs with unequal weights can all make the normal approximation poor. Applied use of the theorem always requires judgment about the data-generating setting.

Random samples and i.i.d.

A random sample is usually written

X_1,\ldots,X_n.

The standard starting assumption is that the observations are independent and identically distributed, abbreviated i.i.d. Independence means that knowing one observation does not change the probability distribution of another. Identically distributed means each observation has the same distribution.

The i.i.d. assumption is a useful benchmark, but it is often an approximation. Students within the same classroom, patients within the same hospital, and repeated measurements from the same person are usually dependent. A sample drawn with unequal selection probabilities is not identically distributed unless the sampling design is built into the analysis. Time series often violate independence because today’s value is related to yesterday’s value.

For i.i.d. observations with common mean \mu and variance \sigma^2, the sample mean is

\bar{X}_n=\frac{1}{n}\sum_{i=1}^{n}X_i.

Linearity of expectation gives

E[\bar{X}_n] = \frac{1}{n}\sum_{i=1}^{n}E[X_i] = \mu.

Independence gives

\operatorname{Var}(\bar{X}_n) = \operatorname{Var}\left(\frac{1}{n}\sum_{i=1}^{n}X_i\right) = \frac{1}{n^2}\sum_{i=1}^{n}\operatorname{Var}(X_i) = \frac{\sigma^2}{n}.

If observations are correlated, covariance terms enter:

\operatorname{Var}(\bar{X}_n) = \frac{1}{n^2} \left[ \sum_{i=1}^{n}\operatorname{Var}(X_i) +2\sum_{i<j}\operatorname{Cov}(X_i,X_j) \right].

Positive dependence reduces the amount of independent information in the sample. This is why clustered and serially correlated data need methods that account for dependence.

Sampling distribution of the sample mean

The sampling distribution of \bar{X}_n is the distribution that would arise if we repeatedly drew samples of size n from the same population and computed a sample mean each time. We rarely get to see this distribution directly, but it is the object behind standard errors and confidence intervals.

For i.i.d. data with finite mean and variance,

E[\bar{X}_n]=\mu, \qquad \operatorname{Var}(\bar{X}_n)=\frac{\sigma^2}{n}.

The shape of the sampling distribution depends on both the population distribution and the sample size. If the population distribution is normal, then \bar{X}_n is exactly normal for every n:

\bar{X}_n \sim N\left(\mu,\frac{\sigma^2}{n}\right).

If the population distribution is not normal, the exact distribution of \bar{X}_n may be complicated. For a binary variable, n\bar{X}_n is binomial. For an exponential variable, sums have gamma distributions. For strongly skewed variables, small-sample means may remain skewed.

The standard error of the sample mean is

\operatorname{SE}(\bar{X}_n)=\frac{\sigma}{\sqrt{n}}.

When \sigma is unknown, it is usually estimated by the sample standard deviation s:

\widehat{\operatorname{SE}}(\bar{X}_n)=\frac{s}{\sqrt{n}}.

This replacement is harmless asymptotically under regular conditions, but it is the reason small-sample normal inference becomes t inference when sampling from a normal population with unknown variance.

Law of Large Numbers

The Law of Large Numbers says that sample averages settle near their population mean as sample size grows.

The Weak Law of Large Numbers states that if X_1,X_2,\ldots are i.i.d. with E[X_i]=\mu and finite variance, then for every \epsilon>0,

P(|\bar{X}_n-\mu|>\epsilon) \to 0 \qquad \text{as } n \to \infty.

This is convergence in probability, written

\bar{X}_n \xrightarrow{p} \mu.

It means that the probability of being more than any fixed distance \epsilon from \mu goes to zero.

The Strong Law of Large Numbers states, under common regularity conditions, that

P\left(\lim_{n \to \infty}\bar{X}_n=\mu\right)=1.

This is almost sure convergence, written

\bar{X}_n \xrightarrow{a.s.} \mu.

Almost sure convergence is stronger than convergence in probability. A useful image is to imagine one infinite sequence of observations. The strong law says that for almost every such sequence, the running average eventually settles to the mean. The weak law says that for any large fixed n, most samples of size n have sample means close to the mean.

The law does not say that every sample mean gets close quickly. It says what happens in the limit. Heavy tails and strong skewness can make finite-sample averages noisy for a long time.

The Central Limit Theorem

The classical i.i.d. Central Limit Theorem states that if X_1,\ldots,X_n are i.i.d. with mean \mu and finite, positive variance \sigma^2, then

\frac{\sqrt{n}(\bar{X}_n-\mu)}{\sigma} \xrightarrow{d} N(0,1).

Equivalently,

\bar{X}_n \approx N\left(\mu,\frac{\sigma^2}{n}\right)

for large n.

The notation \xrightarrow{d} means convergence in distribution. It says that the CDFs of the standardized statistics approach the standard normal CDF at continuity points.

A sketch of the logic uses moment generating functions or characteristic functions. The sample mean is a scaled sum. After centering, the sum has mean zero. After dividing by \sqrt{n}, the variance stays fixed rather than growing with n. For small values around zero, the log of the characteristic function is controlled by the first two moments. Higher-order terms shrink as n grows. What remains is the characteristic function of a normal distribution.

Finite variance is not a decorative condition. If X_i has no finite variance, the usual \sqrt{n} scaling may fail and the limiting distribution may not be normal. Cauchy data are the clearest warning: the average of i.i.d. Cauchy variables has the same Cauchy distribution as each observation.

The theorem also requires enough independence or weak dependence. Versions of the CLT exist for many dependent settings, but they add conditions that control the dependence. Applied analysts should not cite the i.i.d. CLT for clustered, networked, or serial data without adjustment.

Berry-Esseen rate

The Berry-Esseen theorem gives a finite-sample bound on how far the standardized sample mean can be from the standard normal CDF. One common i.i.d. version says that if E|X_i-\mu|^3<\infty, then

\sup_x \left| P\left(\frac{\sqrt{n}(\bar{X}_n-\mu)}{\sigma}\le x\right)-\Phi(x) \right| \le \frac{C E|X_i-\mu|^3}{\sigma^3\sqrt{n}},

where C is a universal constant. The rate is 1/\sqrt{n}, but the constant depends on the standardized third absolute moment. Skewed or heavy-tailed data can need much larger samples before the normal approximation is accurate.

The delta method

The delta method extends asymptotic normality through smooth transformations. Suppose

\sqrt{n}(T_n-\theta) \xrightarrow{d} N(0,\sigma^2),

and g is differentiable at \theta. Then

\sqrt{n}(g(T_n)-g(\theta)) \xrightarrow{d} N(0, [g'(\theta)]^2\sigma^2).

The intuition is a first-order Taylor approximation:

g(T_n) \approx g(\theta) + g'(\theta)(T_n-\theta).

The transformation changes the scale of uncertainty by g'(\theta). For example, if \hat{p} is a sample proportion and g(p)=\log(p/(1-p)) is the log odds, then

g'(p)=\frac{1}{p(1-p)}.

The delta method gives an approximate standard error for log odds from the standard error of \hat{p}, provided p is not too close to 0 or 1.

Slutsky’s theorem

Slutsky’s theorem lets us replace unknown constants with consistent estimates. If

X_n \xrightarrow{d} X

and

Y_n \xrightarrow{p} c,

then

X_n + Y_n \xrightarrow{d} X+c,

X_nY_n \xrightarrow{d} cX,

and, if c \ne 0,

\frac{X_n}{Y_n} \xrightarrow{d} \frac{X}{c}.

This result justifies the usual z statistic with an estimated standard error:

\frac{\sqrt{n}(\bar{X}_n-\mu)}{s} \xrightarrow{d} N(0,1),

when s \xrightarrow{p} \sigma.

Worked example: simulating the CLT

We will simulate sample means from an exponential distribution. The exponential distribution is strongly right-skewed and has finite variance, so it is a good classroom test of the CLT. It is not heavy-tailed in the strict mathematical sense. A t distribution with 3 degrees of freedom is heavier-tailed and also has finite variance, so it is a useful extension once the simulation structure is clear.

Let

X \sim \operatorname{Exponential}(\lambda).

Using the rate parameterization,

E[X]=\frac{1}{\lambda}, \qquad \operatorname{Var}(X)=\frac{1}{\lambda^2}.

For sample size n,

E[\bar{X}_n]=\frac{1}{\lambda}, \qquad \operatorname{Var}(\bar{X}_n)=\frac{1}{n\lambda^2}.

The CLT says

\frac{\sqrt{n}(\bar{X}_n-1/\lambda)}{1/\lambda} \approx N(0,1)

for large n.

Set \lambda=2. Then E[X]=0.5 and \operatorname{SD}(X)=0.5. For n=30, the standard error of the sample mean is

\frac{0.5}{\sqrt{30}}\approx 0.0913.

If we repeat the experiment 1,000 times, each time drawing 30 observations and computing a sample mean, the histogram of the 1,000 means should look much more normal than the original exponential data.

Excel

Excel can generate exponential draws with the inverse CDF method. If U \sim \operatorname{Uniform}(0,1), then

X = -\frac{\log(1-U)}{\lambda}

has an exponential distribution with rate \lambda.

Assume B1 contains 2 for \lambda, B2 contains 30 for sample size, and B3 contains 1000 for replications. One practical layout is to use 30 rows per replication.

B1: 2
B2: 30
B3: 1000

# In A6, create an observation index within replication, then fill down 30000 rows
=MOD(ROW(A1)-1, $B$2)+1

# In B6, create a replication id, then fill down 30000 rows
=INT((ROW(A1)-1)/$B$2)+1

# In C6, generate one exponential draw, then fill down
=-LN(1-RAND())/$B$1

# In E6:E1005, list replication ids 1 through 1000
1
2
3

# In F6, compute the mean for replication E6, then fill down
=AVERAGEIF($B$6:$B$30005, E6, $C$6:$C$30005)

# Theoretical mean and standard error for n = 30
=1/$B$1
=(1/$B$1)/SQRT($B$2)

# Empirical mean and standard deviation of the simulated sample means
=AVERAGE(F6:F1005)
=STDEV.S(F6:F1005)

# Standardize the first simulated sample mean
=(F6-(1/$B$1))/((1/$B$1)/SQRT($B$2))

# Histogram bins in H6:H25, then frequency counts in I6:I25
=FREQUENCY(F6:F1005, H6:H25)

# Normal approximation for Pr(sample mean <= 0.65)
=NORM.DIST(0.65, 1/$B$1, (1/$B$1)/SQRT($B$2), TRUE)

Use the frequency counts to make a histogram. The raw exponential draws in column C will be right-skewed. The sample means in column F should be far more symmetric.

Stata

clear all
set seed 20260512

local reps = 1000
local n = 30
local lambda = 2

set obs `=`reps' * `n''
gen rep = ceil(_n / `n')
gen x = -ln(1 - runiform()) / `lambda'

collapse (mean) xbar = x, by(rep)

summarize xbar

gen z = sqrt(`n') * (xbar - 1/`lambda') / (1/`lambda')
summarize z

histogram xbar, normal
histogram z, normal

display "Theoretical mean = " 1/`lambda'
display "Theoretical SE = " (1/`lambda')/sqrt(`n')
display "Normal approx Pr(xbar <= .65) = " normal((.65 - 1/`lambda') / ((1/`lambda')/sqrt(`n')))

The collapse command reduces the simulated raw data to one mean per replication. The standardized variable z should have mean near zero and standard deviation near one.

R

set.seed(20260512)

reps <- 1000
n <- 30
lambda <- 2

xbar <- replicate(reps, mean(rexp(n, rate = lambda)))

mean(xbar)
sd(xbar)

theory_mean <- 1 / lambda
theory_se <- (1 / lambda) / sqrt(n)

theory_mean
theory_se

z <- sqrt(n) * (xbar - theory_mean) / (1 / lambda)
mean(z)
sd(z)

hist(xbar, breaks = 30, probability = TRUE,
     xlab = "sample mean", main = "Sampling distribution of the mean")
curve(dnorm(x, mean = theory_mean, sd = theory_se),
      add = TRUE, col = "blue", lwd = 2)

pnorm(0.65, mean = theory_mean, sd = theory_se)

# A heavier-tailed variant with finite variance
xbar_t3 <- replicate(reps, mean(rt(n, df = 3)))
hist(xbar_t3, breaks = 30, probability = TRUE,
     xlab = "sample mean", main = "Means from t with 3 df")

# Clean plotting with ggplot2
# library(ggplot2)
# data.frame(xbar = xbar) |>
#   ggplot(aes(xbar)) +
#   geom_histogram(aes(y = after_stat(density)), bins = 30,
#                  fill = "grey80", color = "white") +
#   stat_function(fun = dnorm,
#                 args = list(mean = theory_mean, sd = theory_se),
#                 color = "blue", linewidth = 1)

The exponential simulation shows skewed data producing a much more symmetric sampling distribution for the mean. The t_3 extension will usually converge more slowly because the tails are heavier.

Julia

using Random
using Distributions
using Statistics
using Plots

Random.seed!(20260512)

reps = 1000
n = 30
lambda = 2

dist = Exponential(1 / lambda)  # Distributions.jl uses scale for Exponential
xbar = [mean(rand(dist, n)) for _ in 1:reps]

mean(xbar)
std(xbar)

theory_mean = 1 / lambda
theory_se = (1 / lambda) / sqrt(n)

theory_mean
theory_se

z = sqrt(n) .* (xbar .- theory_mean) ./ (1 / lambda)
mean(z)
std(z)

histogram(xbar, normalize = true, bins = 30,
          xlabel = "sample mean", ylabel = "density",
          label = "simulated means")

grid = range(minimum(xbar), maximum(xbar), length = 200)
plot!(grid, pdf.(Normal(theory_mean, theory_se), grid),
      linewidth = 2, label = "normal approximation")

cdf(Normal(theory_mean, theory_se), 0.65)

# A heavier-tailed finite-variance variant
t3 = TDist(3)
xbar_t3 = [mean(rand(t3, n)) for _ in 1:reps]
histogram(xbar_t3, normalize = true, bins = 30,
          xlabel = "sample mean", ylabel = "density",
          label = "t3 means")

Distributions.jl parameterizes Exponential by scale, so Exponential(1 / lambda) has mean 1/\lambda. The standardized z values should resemble draws from a standard normal distribution more closely as n grows.

Common traps

The first trap is applying the CLT too casually with n=10 and a skewed or heavy-tailed population. The theorem is asymptotic. It does not promise that the approximation is useful at a particular sample size.

A second trap is forgetting the finite variance condition. If the population has infinite variance, the usual CLT may fail. Cauchy data are the standard example, but some real datasets have tails heavy enough that normal approximations need careful checking.

A third trap is using the sample mean when the median is more appropriate. The mean targets the population average, which is sensitive to tails. If the applied question concerns a typical case, the median or another quantile may answer the question better.

Dependence is another common problem. The i.i.d. CLT does not justify ordinary standard errors for clustered classrooms, repeated patient visits, or serially correlated monthly outcomes. There are CLTs for dependent data, but the variance formula changes.

Weighted samples need care too. Survey weights, inverse probability weights, and post-stratification can make the estimator behave differently from a simple average. The right standard error should match the design or estimating equation.

Simulation can also mislead if the random seed is not fixed, if the number of replications is too small, or if the histogram is compared to the wrong normal curve. The normal approximation for \bar{X}_n uses standard deviation \sigma/\sqrt{n}, not \sigma.

Reporting checklist

State the sampling unit and the target population. A sample of visits is not the same as a sample of patients.

State whether observations are assumed independent. If they are clustered or repeated, describe the dependence structure.

Report the sample size used for the estimator. If units have missing values, report the effective sample size for each analysis.

Report the estimator, its standard error, and the distributional approximation used for inference.

When invoking the Law of Large Numbers, identify the quantity that converges and the target it converges to.

When invoking the Central Limit Theorem, state the mean, variance, and scaling. The usual scaling is \sqrt{n}.

Check whether the finite variance assumption is plausible. Heavy tails do not automatically break the CLT, but infinite variance does.

For small samples or skewed data, use simulation, bootstrap checks, exact methods, or transformations when appropriate.

If using the delta method, report the transformation g, the derivative g', and the value where the derivative is evaluated.

If replacing unknown parameters with estimates, name the consistency argument or cite Slutsky’s theorem when the step is central to the analysis.

For simulation examples, set the seed, report the number of replications, and compare empirical standard errors to theoretical standard errors.

References

Billingsley, Patrick. 1995. Probability and Measure. 3rd ed. Wiley.

Hansen, Bruce E. 2022. Probability and Statistics for Economists. Princeton University Press.

van der Vaart, A. W. 1998. Asymptotic Statistics. Cambridge University Press.

Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. Springer.

Casella, George, and Roger L. Berger. 2002. Statistical Inference. 2nd ed. Duxbury.