Distributions: discrete and continuous
Distributions: discrete and continuous
Why distributions matter
A distribution tells us how probability is spread across possible values of a random variable. It answers questions such as: Which values can occur? Which are common? How much variation should we expect? How much probability sits in the tails?
Every estimator and test used in applied statistics relies on a distributional claim. Sometimes the claim is a model for the data, as when a count is modeled with a Poisson distribution or a binary outcome with a Bernoulli distribution. Sometimes the claim is about an estimator, as when a sample mean has an approximate normal sampling distribution. Sometimes the claim is only a working approximation, as with a t statistic in a small regression. The quality of the applied conclusion depends on how well the distributional claim matches the problem.
The same dataset can suggest different analyses depending on its distributional shape. A symmetric continuous outcome may work well with means and standard deviations. A right-skewed income variable may need a log scale. A count with many zeros may need a model that allows more zeros than a simple Poisson would predict. Heavy tails may make the sample mean unstable enough that the median or a trimmed mean deserves attention.
Distributions also keep us honest about extrapolation. If a model assumes a normal distribution, then it assigns positive probability to every real number, including impossible values for a strictly positive outcome. That may not matter near the center of the data, but it can matter in the tails or when predictions are reported on the original scale.
The discrete family
A discrete distribution places probability mass on countable values. Its probability mass function, or PMF, is p(x)=P(X=x).
Bernoulli
A Bernoulli random variable records one trial with two outcomes, usually coded 1 for success and 0 for failure. If P(X=1)=p, then
P(X=x)=p^x(1-p)^{1-x}, \qquad x \in \{0,1\}.
The mean and variance are
E[X]=p, \qquad \operatorname{Var}(X)=p(1-p).
Bernoulli variables appear whenever an outcome is binary: voted or did not vote, admitted or not admitted, defaulted or did not default.
Binomial
A binomial random variable counts successes in n independent Bernoulli trials with common success probability p:
X \sim \operatorname{Binomial}(n,p).
The PMF is
P(X=x)=\binom{n}{x}p^x(1-p)^{n-x}, \qquad x=0,1,\ldots,n.
The mean and variance are
E[X]=np, \qquad \operatorname{Var}(X)=np(1-p).
The binomial distribution is natural for survey counts, defect counts from fixed lots, and the number of treated units with an outcome when each unit has the same probability under the model.
Geometric
The geometric distribution models the number of trials until the first success. With support x=1,2,\ldots,
P(X=x)=(1-p)^{x-1}p.
The mean and variance are
E[X]=\frac{1}{p}, \qquad \operatorname{Var}(X)=\frac{1-p}{p^2}.
Some software uses the alternative support x=0,1,2,\ldots, where X is the number of failures before the first success. Then the mean is (1-p)/p. Always check the convention.
Negative binomial
The negative binomial generalizes the geometric distribution. With support x=r,r+1,\ldots, let X be the number of trials needed to observe r successes:
P(X=x)=\binom{x-1}{r-1}p^r(1-p)^{x-r}.
The mean and variance are
E[X]=\frac{r}{p}, \qquad \operatorname{Var}(X)=\frac{r(1-p)}{p^2}.
In regression modeling, “negative binomial” often refers to a related count distribution used for overdispersed counts, where \operatorname{Var}(X)>E[X]. The parameterization there differs across software.
Poisson
A Poisson random variable counts events over a fixed exposure period when events occur at rate \lambda>0:
P(X=x)=e^{-\lambda}\frac{\lambda^x}{x!}, \qquad x=0,1,2,\ldots.
The mean and variance are both
E[X]=\lambda, \qquad \operatorname{Var}(X)=\lambda.
The equality of mean and variance is a strong restriction. Real count data often show overdispersion, which means the variance exceeds the mean.
Hypergeometric
The hypergeometric distribution counts successes when sampling without replacement from a finite population. Suppose the population has size N, contains K success states, and we draw n units. Then
P(X=x) = \frac{\binom{K}{x}\binom{N-K}{n-x}}{\binom{N}{n}},
where x must be feasible:
\max(0,n-(N-K)) \le x \le \min(n,K).
The mean and variance are
E[X]=n\frac{K}{N},
\operatorname{Var}(X) = n\frac{K}{N}\left(1-\frac{K}{N}\right) \frac{N-n}{N-1}.
The final factor is the finite population correction. Sampling without replacement creates dependence between draws.
The continuous family
A continuous distribution assigns probabilities to intervals through a density f(x). The density itself is not a probability. For an interval,
P(a \le X \le b)=\int_a^b f(x)\,dx.
Uniform
For X \sim \operatorname{Uniform}(a,b),
f(x)=\frac{1}{b-a}, \qquad a \le x \le b.
The mean and variance are
E[X]=\frac{a+b}{2}, \qquad \operatorname{Var}(X)=\frac{(b-a)^2}{12}.
Uniform distributions are common in simulation because pseudo-random number generators start with values that behave like draws from \operatorname{Uniform}(0,1).
Normal or Gaussian
For X \sim N(\mu,\sigma^2),
f(x)=\frac{1}{\sigma\sqrt{2\pi}} \exp\left[-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right].
The mean and variance are
E[X]=\mu, \qquad \operatorname{Var}(X)=\sigma^2.
Normal distributions are symmetric, bell-shaped, and stable under sums of independent normal variables. The standard normal is Z \sim N(0,1).
Exponential
For X \sim \operatorname{Exponential}(\lambda) with rate \lambda>0,
f(x)=\lambda e^{-\lambda x}, \qquad x \ge 0.
The mean and variance are
E[X]=\frac{1}{\lambda}, \qquad \operatorname{Var}(X)=\frac{1}{\lambda^2}.
The exponential distribution is memoryless:
P(X>s+t \mid X>s)=P(X>t).
This property is mathematically convenient, but it is often unrealistic for human durations such as unemployment spells or survival times after diagnosis.
Gamma
Using shape \alpha>0 and rate \beta>0,
f(x)=\frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1}e^{-\beta x}, \qquad x>0.
The mean and variance are
E[X]=\frac{\alpha}{\beta}, \qquad \operatorname{Var}(X)=\frac{\alpha}{\beta^2}.
The gamma distribution is flexible for positive, right-skewed data. When \alpha=1, it reduces to the exponential distribution.
Beta
For X \sim \operatorname{Beta}(\alpha,\beta),
f(x)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1}(1-x)^{\beta-1}, \qquad 0<x<1.
The mean and variance are
E[X]=\frac{\alpha}{\alpha+\beta},
\operatorname{Var}(X) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}.
Beta distributions model probabilities, proportions, and rates bounded between 0 and 1.
Log-normal
If Y \sim N(\mu,\sigma^2) and X=e^Y, then X has a log-normal distribution. Its density is
f(x)=\frac{1}{x\sigma\sqrt{2\pi}} \exp\left[-\frac{(\log x-\mu)^2}{2\sigma^2}\right], \qquad x>0.
The mean and variance are
E[X]=\exp\left(\mu+\frac{\sigma^2}{2}\right),
\operatorname{Var}(X) = \left(e^{\sigma^2}-1\right)e^{2\mu+\sigma^2}.
The mean on the original scale is larger than the median, which is e^\mu, because right tails pull the average upward.
Chi-square
A chi-square random variable with \nu degrees of freedom can be written as a sum of squared independent standard normal variables:
X = \sum_{j=1}^{\nu} Z_j^2, \qquad Z_j \sim N(0,1).
Its density is
f(x)=\frac{1}{2^{\nu/2}\Gamma(\nu/2)} x^{\nu/2-1}e^{-x/2}, \qquad x>0.
The mean and variance are
E[X]=\nu, \qquad \operatorname{Var}(X)=2\nu.
The chi-square distribution is right-skewed, especially for small degrees of freedom.
Student’s t
If Z \sim N(0,1) and U \sim \chi^2_\nu are independent, then
T=\frac{Z}{\sqrt{U/\nu}}
has a t distribution with \nu degrees of freedom. Its density is
f(t)= \frac{\Gamma((\nu+1)/2)} {\sqrt{\nu\pi}\Gamma(\nu/2)} \left(1+\frac{t^2}{\nu}\right)^{-(\nu+1)/2}.
For \nu>1, E[T]=0. For \nu>2,
\operatorname{Var}(T)=\frac{\nu}{\nu-2}.
The t distribution has heavier tails than the standard normal. As \nu grows, it approaches the standard normal.
F
If U_1 \sim \chi^2_{\nu_1} and U_2 \sim \chi^2_{\nu_2} are independent, then
F=\frac{U_1/\nu_1}{U_2/\nu_2}
has an F distribution with \nu_1 and \nu_2 degrees of freedom. Its density can be written as
f(x)= \frac{(\nu_1/\nu_2)^{\nu_1/2}x^{\nu_1/2-1}} B(\nu_1/2,\nu_2/2)\left(1+\frac{\nu_1}{\nu_2}x\right)^{(\nu_1+\nu_2)/2}}, \qquad x>0.
For \nu_2>2,
E[F]=\frac{\nu_2}{\nu_2-2}.
For \nu_2>4,
\operatorname{Var}(F) = \frac{2\nu_2^2(\nu_1+\nu_2-2)} {\nu_1(\nu_2-2)^2(\nu_2-4)}.
The F distribution appears in variance comparisons, analysis of variance, and nested regression tests.
Relationships between distributions
Distribution families are connected. These links explain why the same functions appear throughout statistical inference.
If X \sim \operatorname{Binomial}(n,p) and n is large enough for the tails to be well-behaved, then
\frac{X-np}{\sqrt{np(1-p)}} \approx N(0,1).
A continuity correction often improves the approximation:
P(X \le k) \approx \Phi\left(\frac{k+0.5-np}{\sqrt{np(1-p)}}\right).
The rule of thumb np \ge 10 and n(1-p) \ge 10 is only a screening device. Tail probabilities can still be poor when p is near 0 or 1.
The chi-square distribution is a sum of squared normals:
\chi^2_\nu = Z_1^2+\cdots+Z_\nu^2.
The t distribution divides a standard normal by an estimated standard error:
T=\frac{Z}{\sqrt{U/\nu}}, \qquad U \sim \chi^2_\nu.
The F distribution is a ratio of two scaled chi-square variables:
F=\frac{U_1/\nu_1}{U_2/\nu_2}.
These constructions show why z, t, \chi^2, and F procedures are so closely related. They also show why degrees of freedom matter: they describe how much independent information enters an estimated variance.
Joint distributions, marginals, and conditionals
A joint distribution describes two or more random variables together. For discrete variables,
p_{X,Y}(x,y)=P(X=x,Y=y).
The marginal distribution of X sums over Y:
p_X(x)=\sum_y p_{X,Y}(x,y).
The conditional distribution of Y given X=x is
p_{Y \mid X}(y \mid x) = \frac{p_{X,Y}(x,y)}{p_X(x)}.
For continuous variables, sums become integrals. If f_{X,Y}(x,y) is a joint density, then
f_X(x)=\int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dy,
and
f_{Y \mid X}(y \mid x) = \frac{f_{X,Y}(x,y)}{f_X(x)}.
The bivariate normal distribution is the main continuous example in applied statistics. If X and Y have means \mu_X,\mu_Y, standard deviations \sigma_X,\sigma_Y, and correlation \rho, then the conditional distribution of Y given X=x is normal with mean
E[Y \mid X=x] = \mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x-\mu_X),
and variance
\operatorname{Var}(Y \mid X=x) = \sigma_Y^2(1-\rho^2).
This formula has the same linear shape as simple regression. In the bivariate normal case, the conditional mean is exactly linear and the conditional variance does not depend on x.
The standard normal and z-scores
If X \sim N(\mu,\sigma^2), then the standardized variable
Z=\frac{X-\mu}{\sigma}
has a standard normal distribution. A z-score tells how many standard deviations an observation is from the mean.
Standard normal probabilities use the CDF
\Phi(z)=P(Z \le z).
Quantiles invert the CDF. If \Phi(z_p)=p, then z_p is the pth quantile. For example, z_{0.975}\approx 1.96.
Numerical work should use library functions for CDFs and inverse CDFs. Direct integration and table lookup are less accurate and less reproducible. Floating-point arithmetic can also lose precision in extreme tails. For very small upper-tail probabilities, use survival functions when available rather than subtracting from one. For example, 1-\Phi(8) can round to zero in some settings, while a tail-specific function can still return a meaningful value.
Heavy tails and why they matter
A heavy-tailed distribution assigns more probability to extreme values than a normal distribution does. Heavy tails change how averages behave. They can make sample means volatile, standard deviations unstable, and outlier rules misleading.
The Cauchy distribution is the canonical warning. It has density
f(x)=\frac{1}{\pi(1+x^2)}.
It is symmetric around zero, but it has no finite mean and no finite variance. The sample average of Cauchy draws does not settle down to a fixed value as the sample size grows.
The log-normal distribution has a finite mean and variance, but its right tail can be long enough that a few observations dominate the sample mean. Income, firm size, claims, and biological concentrations often look closer to log-normal than normal.
The t distribution is a useful cousin of the normal. With low degrees of freedom, it has much heavier tails. For \nu=3, the variance exists but is three times larger than the normal variance with the same center and scale. For \nu=1, the t distribution is Cauchy and has no mean.
Heavy tails are not automatically bad. They may be the data-generating reality. The problem is using thin-tailed methods while pretending extremes are rare.
Worked example: estimating a normal distribution from a small sample
Suppose we observe a small sample of a continuous measurement:
48,\ 51,\ 52,\ 49,\ 55,\ 53,\ 50,\ 54.
Let n=8. The sample mean is
\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_i=51.5.
The sample variance is
s^2=\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2=6.0,
so s \approx 2.449.
If we treat the measurement as normally distributed and use the estimated mean and standard deviation as plug-in values, then
P(X>56) \approx 1-\Phi\left(\frac{56-51.5}{2.449}\right).
The z-score is about 1.837, so the probability is about 0.033.
For inference about the mean with such a small sample, use a t distribution rather than pretending \sigma is known. For a predictive probability under a fitted normal model, the plug-in normal calculation is a model-based approximation.
Excel
Put the eight observations in A2:A9.
# Sample estimates
=AVERAGE(A2:A9)
=VAR.S(A2:A9)
=STDEV.S(A2:A9)
# Pr(X > 56) under fitted normal model
=1-NORM.DIST(56, AVERAGE(A2:A9), STDEV.S(A2:A9), TRUE)
# Standard normal CDF and inverse CDF
=NORM.S.DIST(1.96, TRUE)
=NORM.S.INV(0.975)
# General normal quantile
=NORM.INV(0.95, AVERAGE(A2:A9), STDEV.S(A2:A9))
# Other distribution calculations
=BINOM.DIST(6, 10, 0.5, FALSE)
=T.DIST(2.1, 7, TRUE)
=F.DIST(3.2, 4, 12, TRUE)
=CHISQ.DIST(10, 5, TRUE)
# PDF values for plotting a fitted normal curve
# Put x values in C2:C42, for example 44, 44.5, ..., 64
=NORM.DIST(C2, AVERAGE($A$2:$A$9), STDEV.S($A$2:$A$9), FALSE)
To plot the fitted PDF in Excel, put a grid of x values in one column and the density formula in the next column. Insert a column chart or scatter plot with smooth lines. The vertical axis is density, not probability. Probability is area under the curve.
Stata
clear all
input x
48
51
52
49
55
53
50
54
end
summarize x
scalar mu = r(mean)
scalar s = r(sd)
display "Pr(X > 56) = " 1 - normal((56 - mu)/s)
display "Phi(1.96) = " normal(1.96)
display "z_0.975 = " invnormal(0.975)
display "Pr(Binomial(10,.5)=6) = " binomialp(10, 6, .5)
display "Pr(Binomial(10,.5)<=6) = " binomial(10, 6, .5)
display "Pr(t_7 <= 2.1) = " t(7, 2.1)
display "Pr(F_4,12 <= 3.2) = " F(4, 12, 3.2)
display "Pr(chi2_5 <= 10) = " chi2(5, 10)
range grid 44 64 81
gen fitted_pdf = normalden((grid - mu)/s)/s
twoway line fitted_pdf gridThe density calculation uses the standard normal density and rescales it by s. Stata’s normal() and invnormal() are the usual tools for normal CDFs and quantiles.
R
x <- c(48, 51, 52, 49, 55, 53, 50, 54)
mu <- mean(x)
s <- sd(x)
mu
var(x)
s
1 - pnorm(56, mean = mu, sd = s)
pnorm(1.96)
qnorm(0.975)
dnorm(56, mean = mu, sd = s)
pbinom(6, size = 10, prob = 0.5)
dbinom(6, size = 10, prob = 0.5)
pt(2.1, df = 7)
pf(3.2, df1 = 4, df2 = 12)
pchisq(10, df = 5)
grid <- seq(44, 64, length.out = 200)
dens <- dnorm(grid, mean = mu, sd = s)
plot(grid, dens, type = "l", xlab = "x", ylab = "density")
# Optional tidy plotting
# library(ggplot2)
# data.frame(grid, dens) |>
# ggplot(aes(grid, dens)) +
# geom_line() +
# labs(x = "x", y = "density")Base R has CDF, quantile, and density functions for most standard distributions. The naming pattern is consistent: p for CDF, q for quantile, d for density or mass, and r for random generation.
Julia
using Distributions
using Statistics
using Plots
x = [48, 51, 52, 49, 55, 53, 50, 54]
mu = mean(x)
s = std(x)
mu
var(x)
s
fitdist = Normal(mu, s)
1 - cdf(fitdist, 56)
cdf(Normal(), 1.96)
quantile(Normal(), 0.975)
pdf(fitdist, 56)
cdf(Binomial(10, 0.5), 6)
pdf(Binomial(10, 0.5), 6)
cdf(TDist(7), 2.1)
cdf(FDist(4, 12), 3.2)
cdf(Chisq(5), 10)
grid = range(44, 64, length = 200)
dens = pdf.(fitdist, grid)
plot(grid, dens, xlabel = "x", ylabel = "density", legend = false)Julia’s Distributions package uses the same pattern across families: create a distribution object, then call pdf, cdf, quantile, or rand.
Common traps
The first trap is confusing a PMF with a PDF. A PMF gives probabilities at points. A PDF gives density at points, and probabilities come from areas. A density can be greater than one without violating probability rules.
A second trap is using a normal distribution where a t distribution is appropriate. If the sample is small and the population variance is unknown, inference about a normal mean uses t degrees of freedom. The difference may be small for large samples, but it is visible when n is small.
A third trap is treating the chi-square distribution as symmetric. It is bounded below by zero and right-skewed. Reporting a symmetric normal-style interval for a variance can give impossible lower bounds.
Another problem is ignoring support. A normal model for a probability, count, or positive duration can produce impossible predictions. Those predictions may be harmless near the center but damaging in tails, simulations, or decision rules.
Parameterization mistakes also matter. Exponential and gamma distributions may use rate or scale. Geometric distributions may count trials or failures. Negative binomial distributions vary across theory books and software packages. Always read the function documentation before comparing results across tools.
Reporting checklist
Name the distribution family and state its parameters. Do not report “normal” without a mean and standard deviation or variance.
State the support. This prevents impossible predictions from slipping through unnoticed.
For discrete variables, report the PMF or the probability statement being computed. For continuous variables, report densities and interval probabilities separately.
Give the mean and variance when they exist. If they do not exist, say so directly.
When using an approximation, name the exact target and the approximation. For example, a normal approximation to a binomial count should report n, p, and any continuity correction.
For small-sample inference about a mean, report the degrees of freedom for the t distribution.
For chi-square and F calculations, report the degrees of freedom in the order used by the software.
When estimating parameters from data, distinguish known parameters from plug-in estimates.
For tail probabilities, use software CDF or survival functions rather than hand-coded subtraction when tails are extreme.
References
Johnson, Norman L., Samuel Kotz, and N. Balakrishnan. 1994. Continuous Univariate Distributions. 2nd ed. Wiley.
Johnson, Norman L., Adrienne W. Kemp, and Samuel Kotz. 2005. Univariate Discrete Distributions. 3rd ed. Wiley.
Casella, George, and Roger L. Berger. 2002. Statistical Inference. 2nd ed. Duxbury.
Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. Springer.
Distributions.jl contributors. Distributions.jl Documentation. JuliaStats.
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing.