Bayesian inference

The Bayesian framework

Frequentist inference treats the parameter \theta as a fixed unknown and the data as random. Bayesian inference flips the convention. The parameter is itself a random variable, and the analyst encodes prior knowledge about it as a probability distribution \pi(\theta). Given data y generated by a likelihood p(y \mid \theta), the posterior is what you believe about \theta after seeing the data: \pi(\theta \mid y) = \frac{p(y \mid \theta) \, \pi(\theta)}{p(y)} \propto p(y \mid \theta) \, \pi(\theta) The marginal likelihood p(y) = \int p(y \mid \theta) \pi(\theta) \, d\theta is a normalizing constant that does not depend on \theta and can be ignored for most purposes. The slogan: posterior \propto prior \times likelihood.

This single line carries the entire framework. Every Bayesian computation, no matter how elaborate, is an attempt to summarize \pi(\theta \mid y). Sometimes the posterior is available in closed form (conjugate models). More often it is not, and we approximate it by sampling (Markov chain Monte Carlo) or by variational methods. The advantage over frequentist procedures is interpretive: a 95 percent credible interval is a statement about \theta given the data and the prior, not about the long-run behavior of a procedure across hypothetical repetitions. The cost is that the prior has to come from somewhere, and choosing it is an irreducible part of the analysis.

Conjugate priors

A prior is conjugate to a likelihood when the posterior belongs to the same family as the prior. Conjugate pairs are convenient because they give closed-form posteriors and clean updating rules. We work through the three that appear most often.

Beta-Binomial: a proportion

Suppose y \sim \text{Binomial}(n, p) and we place a \text{Beta}(\alpha, \beta) prior on p. The Beta density on (0, 1) is \pi(p) = \frac{p^{\alpha - 1} (1 - p)^{\beta - 1}}{B(\alpha, \beta)} The likelihood is proportional to p^y (1 - p)^{n - y}. Multiplying, \pi(p \mid y) \propto p^{\alpha + y - 1} (1 - p)^{\beta + n - y - 1} which is the kernel of a \text{Beta}(\alpha + y, \beta + n - y) density. The posterior parameters are the prior parameters incremented by the count of successes and failures. The prior acts like \alpha + \beta pseudo-observations: \alpha prior successes and \beta prior failures. A \text{Beta}(1, 1) prior is uniform on (0, 1) and adds the equivalent of two prior observations to the analysis.

Normal-Normal: a mean with known variance

Suppose y_1, \ldots, y_n \sim N(\mu, \sigma^2) with \sigma^2 known, and we place a N(\mu_0, \tau_0^2) prior on \mu. The posterior is normal: \mu \mid y \sim N\!\left( \mu_1, \tau_1^2 \right) with precision (the reciprocal of variance) adding linearly, \frac{1}{\tau_1^2} = \frac{1}{\tau_0^2} + \frac{n}{\sigma^2}, \qquad \mu_1 = \tau_1^2 \left( \frac{\mu_0}{\tau_0^2} + \frac{n \bar y}{\sigma^2} \right) The posterior mean is a precision-weighted average of the prior mean and the data mean. As n \to \infty, the data swamp the prior and \mu_1 \to \bar y. As \tau_0 \to \infty (a diffuse prior), the same limit holds for any fixed n. This is the cleanest case where Bayesian and frequentist answers coincide asymptotically.

Gamma-Poisson: a rate

Suppose y_1, \ldots, y_n \sim \text{Poisson}(\lambda) and the prior on \lambda is \text{Gamma}(\alpha, \beta) with density proportional to \lambda^{\alpha - 1} e^{-\beta \lambda}. The likelihood is proportional to \lambda^{\sum y_i} e^{-n \lambda}. The posterior is \lambda \mid y \sim \text{Gamma}\!\left( \alpha + \sum y_i, \, \beta + n \right) The posterior shape adds the sum of counts; the posterior rate adds the sample size. The prior contributes \alpha pseudo-events and \beta pseudo-units of exposure. A \text{Gamma}(0.001, 0.001) prior is a common “weak” choice that adds almost nothing to a moderate-sized dataset.

Point estimates from the posterior

A posterior distribution conveys more information than any single number. When a number is needed (for reporting, decision, or comparison), three summaries dominate.

Posterior mean. \hat\theta_{\text{mean}} = E[\theta \mid y]. Minimizes posterior expected squared error. The default unless the loss function says otherwise.

Posterior median. \hat\theta_{\text{med}} = \text{median}(\theta \mid y). Minimizes posterior expected absolute error. Less sensitive to long tails. Often preferred when the posterior is skewed.

Maximum a posteriori (MAP). \hat\theta_{\text{MAP}} = \arg\max_\theta \pi(\theta \mid y). The mode of the posterior. Coincides with the maximum likelihood estimate when the prior is flat. Can be misleading in high dimensions because the mode of a curved distribution is not a representative location.

For the Beta-Binomial with \text{Beta}(\alpha, \beta) prior and y successes in n trials, the posterior mean is (\alpha + y) / (\alpha + \beta + n) and the MAP is (\alpha + y - 1) / (\alpha + \beta + n - 2) when both are positive. The two estimates differ by a small bias and converge as n grows.

Credible intervals

A 1 - \alpha credible interval is a set C such that P(\theta \in C \mid y) = 1 - \alpha. Two constructions are standard.

Equal-tail interval. The interval (q_{\alpha/2}, q_{1 - \alpha/2}) between two quantiles of the posterior. Easy to compute, easy to explain, invariant to monotone transformations of the parameter. For a \text{Beta}(8, 4) posterior, the 95 percent equal-tail interval runs from the 2.5th to the 97.5th percentile of that distribution.

Highest posterior density (HPD) interval. The shortest set C such that P(\theta \in C \mid y) = 1 - \alpha. For a unimodal posterior, this is a single interval consisting of the values of \theta with highest posterior density. For a multimodal posterior, the HPD region can be a union of disjoint intervals. HPD is appealing because every point inside has higher density than every point outside, but it is not invariant under nonlinear transformations.

The contrast with frequentist confidence intervals is sharp. A 95 percent confidence interval is a random set that covers the true (fixed) \theta at least 95 percent of the time across repeated experiments. It does not say that the parameter falls inside the realized interval with probability 0.95. A 95 percent credible interval makes exactly that probability statement, conditional on the model and the prior. The two intervals often have similar numerical values for large samples and weak priors, which can encourage interpretive confusion. They are answers to different questions.

Markov chain Monte Carlo

For non-conjugate models, the posterior has no closed form. The trick that opened modern Bayesian computation is the realization that we do not need a closed form; we need a way to draw samples from the posterior. A Markov chain whose stationary distribution is \pi(\theta \mid y) produces such samples. The two foundational algorithms are Metropolis-Hastings and Gibbs.

Metropolis-Hastings

Start at some \theta^{(0)}. At step t, propose a candidate \theta^* from a proposal density q(\theta^* \mid \theta^{(t-1)}). Compute the acceptance probability \alpha = \min\!\left\{1, \, \frac{\pi(\theta^* \mid y) \, q(\theta^{(t-1)} \mid \theta^*)}{\pi(\theta^{(t-1)} \mid y) \, q(\theta^* \mid \theta^{(t-1)})} \right\} With probability \alpha, set \theta^{(t)} = \theta^*; otherwise \theta^{(t)} = \theta^{(t-1)}. The normalizing constant p(y) cancels in the ratio, so only the unnormalized posterior is needed. The chain explores high-probability regions because moves that increase posterior density are always accepted, and moves that decrease density are sometimes accepted (with probability shrinking as the proposed move worsens).

The proposal density is a tuning choice. A common default is a symmetric Gaussian random walk: \theta^* = \theta^{(t-1)} + \varepsilon with \varepsilon \sim N(0, \Sigma). The acceptance ratio simplifies to \pi(\theta^* \mid y) / \pi(\theta^{(t-1)} \mid y). If \Sigma is too small, the chain barely moves and mixes slowly. If \Sigma is too large, almost every proposal is rejected. A target acceptance rate around 0.234 is asymptotically optimal for random-walk Metropolis in high dimensions, with higher rates (around 0.44) for single-parameter cases.

Gibbs sampler

When the conditional distributions \pi(\theta_j \mid \theta_{-j}, y) are themselves easy to sample from (often the case in hierarchical models with conjugate components), Gibbs sampling cycles through the parameters one at a time. At each step, sample \theta_j from its full conditional given the current values of all other parameters and the data. The acceptance rate is 1: every Gibbs step is accepted by construction. Mixing can still be slow when parameters are highly correlated.

Hamiltonian Monte Carlo: a preview

Random-walk Metropolis explores by diffusion, which is slow in high dimensions. Hamiltonian Monte Carlo (HMC) augments the parameter \theta with a momentum variable r and simulates physical dynamics on the joint surface defined by -\log \pi(\theta \mid y) + \frac{1}{2} r^\top M^{-1} r. Long, directed trajectories replace short, random steps. The No-U-Turn Sampler (NUTS) automates the tuning. Stan, PyMC, and Turing.jl all use HMC or NUTS as their default engine. The cost is that HMC needs gradients of the log posterior, which restricts it to differentiable models.

Diagnostics

Sampling provides no guarantee that the chain has reached its stationary distribution. Several diagnostics are routine.

Trace plots. Plot \theta^{(t)} against t for each parameter. A healthy chain looks like a fat horizontal band (“hairy caterpillar”). Trending behavior, slow oscillation, or stuck regions all signal trouble.

\hat R (potential scale reduction). Run multiple chains from dispersed starting points. Compare within-chain and between-chain variance. Values close to 1 (say, less than 1.01) indicate the chains have converged to the same distribution. The original Gelman-Rubin statistic has been refined; the current standard is the split-\hat R of Vehtari et al. (2021).

Effective sample size (ESS). Autocorrelated samples carry less information than independent ones. ESS quantifies how many independent draws the chain is equivalent to. A common rule of thumb: at least 400 ESS per quantity of interest, ideally several thousand. Bulk-ESS and tail-ESS distinguish between the central body and the tails of the posterior, both of which need to be well-explored.

Autocorrelation. Plot the empirical autocorrelation of the chain at increasing lags. Slow decay indicates poor mixing. Thinning (keeping every k-th sample) is sometimes used to reduce storage but does not improve precision per unit of computation; running longer chains is almost always better.

Hierarchical models

Hierarchical models put structure on the prior itself. Suppose we have J groups (schools, hospitals, countries) and within-group means \theta_1, \ldots, \theta_J. A complete-pooling analysis ignores the groups: one mean for all. A no-pooling analysis estimates each \theta_j separately and ignores the others. The hierarchical compromise puts a common prior on the group means, \theta_j \sim N(\mu, \tau^2), \qquad \mu, \tau^2 \sim \pi(\mu, \tau^2) with \mu and \tau themselves estimated from the data. The posterior estimate for \theta_j shrinks toward the overall mean \mu, with the degree of shrinkage governed by \tau: small \tau means strong pooling, large \tau means weak pooling.

The benefit is what Gelman and Hill call partial pooling. Groups with few observations borrow strength from the population. Groups with many observations are mostly informed by their own data. The standard reference is Gelman and Hill (2007), Data Analysis Using Regression and Multilevel/Hierarchical Models. The point estimates from a hierarchical model often outperform both pooled and unpooled alternatives in out-of-sample prediction, a phenomenon known as Stein-type shrinkage.

Model comparison

Bayesian model comparison asks how strongly the data support one model over another. The classical answer is the Bayes factor: \text{BF}_{12} = \frac{p(y \mid M_1)}{p(y \mid M_2)} the ratio of marginal likelihoods. Jeffreys’ scale calibrates the result (BF > 10 is strong evidence, BF > 100 is decisive, and so on). Bayes factors have one well-known weakness: they are extremely sensitive to the prior on the parameters within each model. Diffuse priors that work fine for posterior inference can produce arbitrarily small Bayes factors because they spread the prior over implausible regions.

Modern Bayesian practice favors predictive criteria. The widely applicable information criterion (WAIC) estimates expected out-of-sample log predictive density. Leave-one-out cross-validation using Pareto-smoothed importance sampling (PSIS-LOO) provides a more reliable estimate, with diagnostic information about which observations are problematic. The standard reference is Vehtari, Gelman, and Gabry (2017). The loo package in R is the de facto implementation. Both WAIC and LOO can be computed from the same posterior samples that produced the analysis; no extra fitting is required.

Worked example: a Beta-Binomial posterior

Suppose we observe y = 7 successes in n = 10 Bernoulli trials and place a \text{Beta}(1, 1) (uniform) prior on the success probability p. The posterior is p \mid y \sim \text{Beta}(1 + 7, \, 1 + 10 - 7) = \text{Beta}(8, 4) Posterior mean: 8 / (8 + 4) = 0.6667. Posterior median: approximately 0.677 (no clean closed form). MAP: (8 - 1) / (8 + 4 - 2) = 0.70. The 95 percent equal-tail credible interval is the pair of Beta quantiles at 0.025 and 0.975. We compute the same posterior four ways.

Excel

The Beta posterior is available directly via BETA.DIST and BETA.INV. No simulation needed for the closed-form case.

Set up the posterior parameters
B1   =1+7          ' alpha_post = 8
B2   =1+10-7       ' beta_post  = 4

Posterior mean and MAP
B3   =B1/(B1+B2)               ' posterior mean = 0.6667
B4   =(B1-1)/(B1+B2-2)         ' MAP = 0.70

95% equal-tail credible interval
B5   =BETA.INV(0.025, B1, B2)  ' lower bound, approx 0.3902
B6   =BETA.INV(0.975, B1, B2)  ' upper bound, approx 0.8911

Posterior density at p = 0.7 (for plotting)
C1   =0,  C2 = 0.01, ..., C101 = 1
D1   =BETA.DIST(C1, $B$1, $B$2, FALSE)
fill D1:D101 then plot scatter (C, D) for the posterior density curve

Posterior probability that p > 0.5
B7   =1 - BETA.DIST(0.5, B1, B2, TRUE)   ' approx 0.887

The Excel approach exploits the conjugate structure fully. For non-conjugate problems, Excel can do single-parameter Metropolis-Hastings in a few hundred rows but rapidly becomes unwieldy.

Stata

* Data: 7 successes in 10 trials, encoded as a 10-row dataset
clear
set obs 10
gen y = 0
replace y = 1 in 1/7

* Bayesian estimation via MCMC
bayesmh y, likelihood(dbernoulli({p})) ///
    prior({p}, beta(1, 1)) ///
    rseed(20260512) mcmcsize(20000) burnin(2000)

* Posterior summary
bayesstats summary

* 95% equal-tail credible interval (default in Stata)
* For HPD use: bayesstats summary, hpd

* Visualize convergence and posterior
bayesgraph diagnostics {p}
bayesgraph histogram {p}

Stata’s bayesmh defaults to a random-walk Metropolis-Hastings sampler with an adaptive proposal. For this trivial problem the closed-form Beta(8, 4) posterior is the truth and the sampler should reproduce it to several decimal places.

R

# Closed-form approach
alpha_post <- 1 + 7
beta_post  <- 1 + 10 - 7
qbeta(c(0.025, 0.975), alpha_post, beta_post)  # 0.3902 0.8911
alpha_post / (alpha_post + beta_post)          # posterior mean 0.6667

# MCMC via rstanarm
library(rstanarm)
y <- c(rep(1, 7), rep(0, 3))
fit <- stan_glm(
  y ~ 1,
  family = binomial(link = "logit"),
  prior_intercept = normal(0, 2.5),   # on logit scale
  chains = 4, iter = 4000, seed = 20260512
)

# Posterior on the logit scale; transform back to probability
post <- as.matrix(fit)
p_draws <- plogis(post[, "(Intercept)"])
quantile(p_draws, c(0.025, 0.5, 0.975))
mean(p_draws)

# Diagnostics
summary(fit)        # includes Rhat and n_eff per parameter
plot(fit, "trace")  # trace plots

# LOO-CV
library(loo)
loo_fit <- loo(fit)
print(loo_fit)

Note that the rstanarm example uses a logit-scale normal prior, not a Beta. To impose a Beta(1, 1) prior on p directly, write the model in Stan with a beta prior on a parameter declared <lower=0, upper=1>. The default rstanarm prior is mildly informative and gives a posterior close to but not identical to the conjugate Beta(8, 4) answer.

Julia

using Turing, Distributions, Random, StatsPlots
Random.seed!(20260512)

@model function coin(y)
    p ~ Beta(1, 1)
    for i in eachindex(y)
        y[i] ~ Bernoulli(p)
    end
end

y = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
chain = sample(coin(y), NUTS(), MCMCThreads(), 2000, 4)

# Posterior summary
summarystats(chain)
quantile(chain)

# Compare with closed form
using SpecialFunctions
posterior = Beta(8, 4)
quantile.(posterior, [0.025, 0.5, 0.975])
mean(posterior)

# Diagnostics
plot(chain)        # trace and density
ess_rhat(chain)    # ESS and R-hat

Turing.jl writes models in idiomatic Julia and supports HMC, NUTS, and several other samplers. The closed-form Beta(8, 4) provides ground truth: the posterior samples should match its mean (0.6667), median (\approx 0.677), and 95 percent quantiles (\approx 0.390, 0.891).

Common traps

Improper priors that lead to improper posteriors. A prior \pi(\theta) \propto 1 on an unbounded parameter space integrates to infinity. It is not a valid probability distribution. In some models the resulting posterior is still proper (the likelihood integrates), but not always. A flat prior on a variance parameter in a hierarchical model can produce an improper posterior, with samplers that appear to mix but actually drift to infinity. Always check that the joint posterior is integrable.

“Flat” priors are not always uninformative. A uniform prior on p \in (0, 1) corresponds to a logit-scale prior that is highly informative about the location of the logit. A uniform prior on a standard deviation puts most of its mass at moderate-to-large values, which can bias hierarchical models. The Jeffreys prior is one principled response; weakly informative priors (Gelman et al., 2008) are another. The takeaway is that flatness is a property of a parameterization, not of inference.

Ignoring \hat R. A posterior summary based on a chain that has not converged is a number with no statistical meaning. \hat R > 1.01 for any parameter of interest invalidates the analysis. Run more iterations, add chains, reparameterize, or rethink the model. Do not paper over divergence with a longer burn-in alone.

Treating credible intervals like confidence intervals. A 95 percent credible interval is a probability statement about \theta given the prior and the data. A 95 percent confidence interval is a frequentist guarantee about the procedure. Reporting a credible interval and interpreting it as a confidence interval (or vice versa) misrepresents what was computed. The two often agree numerically; the interpretation differs.

Using priors that depend on the data. Empirical Bayes is a respectable methodology, but its outputs are not strictly Bayesian credible intervals. They underestimate posterior uncertainty because the prior parameters were tuned to the data. Full hierarchical Bayes is the cleaner alternative when the same kind of information is to be borrowed across groups.

Reporting checklist

State the prior for every parameter and justify the choice. “Weakly informative” without specifics is not enough.
For MCMC: number of chains, iterations per chain, warmup/burn-in, sampler (NUTS, HMC, Gibbs, Metropolis), thinning if any.
Report \hat R and effective sample size (bulk and tail) for every parameter of interest.
Report posterior mean (or median), 95 percent credible interval, and the type of interval (equal-tail or HPD).
Posterior predictive checks: show plots or summary statistics comparing observed data to draws from the posterior predictive distribution.
For model comparison: report WAIC or LOO with standard errors of the difference, not just point estimates.
Provide the code or a reproducible script. Bayesian analyses are dense enough that reproducibility is a kindness.

References

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014). Bayesian Data Analysis (3rd ed.). Chapman and Hall/CRC. The standard graduate-level reference, known as BDA3.
McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and Stan (2nd ed.). CRC Press. A friendlier entry point with strong intuition-building.
Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. The standard reference on partial pooling and hierarchical Bayesian regression.
Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413 to 1432.
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bürkner, P.-C. (2021). Rank-normalization, folding, and localization: An improved \hat R for assessing convergence of MCMC. Bayesian Analysis, 16(2), 667 to 718.
Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXiv:1701.02434. The clearest pedagogical treatment of HMC and NUTS.
Stan Development Team. (2024). Stan Modeling Language: User’s Guide and Reference Manual. mc-stan.org. Reference for the underlying engine of rstanarm, brms, PyMC, and many others.