Probability foundations

Probability foundations

Why probability matters for applied statistics

Applied statistics starts with a modest claim: the data in front of us are one possible realization of a process that could have produced different data. We might observe a sample mean of 47.2, a regression coefficient of 1.8, or 12 successes in 40 trials. Those numbers are fixed after we collect them, but before collection they were uncertain. Probability gives language to that uncertainty.

The word “random” does not mean chaotic or unknowable. It means that we describe outcomes through a probability model. A probability model states which outcomes could occur and how likely they are. Once we have that model, we can ask ordinary applied questions with precision. How variable should a sample mean be if the null hypothesis were true? How likely is a false positive when a disease is rare? How much uncertainty should a confidence interval carry when observations are correlated?

A random variable is a numerical summary of uncertain outcomes. A survey response, a count of hospital readmissions, the waiting time until a customer leaves a queue, and an estimated treatment effect can all be treated as random variables before observation. After observation they become numbers. This before-and-after distinction matters. Probability describes the before. Statistics uses the after to learn about the process that generated the data.

Uncertainty is not a nuisance around the edges of applied work. It is the medium. Standard errors, p-values, likelihoods, posterior distributions, prediction intervals, and simulation all rest on probability statements. If the probability model is poor, the statistical conclusion usually inherits the problem.

Sample spaces and events

A probability model begins with a sample space. The sample space, written \Omega, is the set of all possible outcomes from a random experiment. For one roll of a fair die,

\Omega = \{1,2,3,4,5,6\}.

For two rolls of a die, \Omega contains ordered pairs:

\Omega = \{(i,j): i \in \{1,\ldots,6\}, j \in \{1,\ldots,6\}\}.

An event is a set of outcomes. If A is the event “the die shows an even number,” then

A = \{2,4,6\}.

Set notation keeps probability statements clean. The union A \cup B means A or B occurs. The intersection A \cap B means both occur. The complement A^c means A does not occur. The empty set \emptyset is the event with no outcomes.

For finite sample spaces, every subset of \Omega can be an event. For infinite sample spaces, especially continuous ones, probability theory needs a little more care. A sigma-algebra, usually written \mathcal{F}, is a collection of events that we are allowed to assign probabilities to. It must contain \Omega, it must contain complements of its events, and it must contain countable unions of its events. Those closure rules guarantee that if we can discuss A_1,A_2,\ldots, then we can also discuss “at least one of them occurs” and “none of them occurs.”

Most applied work does not require manipulating sigma-algebras by hand. The useful intuition is this: a probability model is not just a list of outcomes. It is a pair of objects, the outcomes and the events whose probabilities the model can evaluate. For continuous random variables, we normally use the Borel sets on the real line, which include intervals such as (a,b], their countable unions, and many related sets.

Probability axioms and basic properties

Kolmogorov’s axioms define probability as a function P from events to numbers. For a sample space \Omega and event collection \mathcal{F}:

P(A) \ge 0 \quad \text{for every } A \in \mathcal{F},

P(\Omega) = 1,

and for any countable sequence of disjoint events A_1,A_2,\ldots,

P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i).

The third axiom is countable additivity. It says that when events cannot occur together, the probability that one of them occurs is the sum of their probabilities.

Several familiar properties follow from these axioms. First, P(\emptyset)=0. Since \Omega and \emptyset are disjoint and \Omega \cup \emptyset=\Omega,

1 = P(\Omega) = P(\Omega) + P(\emptyset),

so P(\emptyset)=0.

Second, complements satisfy

P(A^c) = 1 - P(A).

Third, if A \subseteq B, then probability is monotone:

P(A) \le P(B).

To see this, write B as the disjoint union of A and B \cap A^c. Additivity gives

P(B) = P(A) + P(B \cap A^c) \ge P(A).

For events that may overlap, the addition rule is

P(A \cup B) = P(A) + P(B) - P(A \cap B).

We subtract P(A \cap B) because it appears once in P(A) and once in P(B).

Conditional probability and independence

Conditional probability updates an event’s probability after another event is known to have occurred. If P(B)>0, then

P(A \mid B) = \frac{P(A \cap B)}{P(B)}.

The denominator restricts the sample space to cases where B occurred. The numerator keeps the cases where both A and B occurred. Conditional probability is a ratio within a restricted set.

The multiplication rule follows immediately:

P(A \cap B) = P(A \mid B)P(B).

For several events,

P(A_1 \cap \cdots \cap A_k) = P(A_1)P(A_2 \mid A_1)P(A_3 \mid A_1 \cap A_2)\cdots P(A_k \mid A_1 \cap \cdots \cap A_{k-1}).

Events A and B are independent if

P(A \cap B) = P(A)P(B).

If P(B)>0, this is equivalent to

P(A \mid B) = P(A).

Knowing that B occurred does not change the probability of A. Independence is not the same as being disjoint. If two events are disjoint and both have positive probability, then the occurrence of one rules out the other. They are dependent in the strongest possible way.

Bayes’ theorem

Bayes’ theorem reverses a conditional probability. Starting from the definition of conditional probability,

P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}.

If A_1,\ldots,A_k partition the sample space, then

P(A_j \mid B) = \frac{P(B \mid A_j)P(A_j)} {\sum_{i=1}^{k} P(B \mid A_i)P(A_i)}.

The prior probability P(A_j) enters the numerator. The denominator averages the chance of observing B across all possible cases. This is where base rates enter the calculation.

Medical testing and base rates

Suppose a disease affects 1 percent of a population. A test has sensitivity 0.95, so P(+ \mid D)=0.95. It has specificity 0.90, so P(- \mid D^c)=0.90 and P(+ \mid D^c)=0.10.

The question patients often care about is P(D \mid +), not P(+ \mid D). Bayes’ theorem gives

P(D \mid +) = \frac{P(+ \mid D)P(D)} {P(+ \mid D)P(D) + P(+ \mid D^c)P(D^c)}.

Plugging in the numbers,

P(D \mid +) = \frac{0.95(0.01)}{0.95(0.01) + 0.10(0.99)} = \frac{0.0095}{0.1085} \approx 0.0876.

Even with a sensitive test, the probability of disease after one positive test is about 8.8 percent. The false positive rate matters because the disease is rare. In a population of 10,000 people, about 100 have the disease. The test correctly flags about 95 of them. Of the 9,900 people without the disease, about 990 receive a false positive. Most positive tests therefore come from people who do not have the disease.

A short Monty Hall calculation

In the Monty Hall problem, three doors hide one prize. You pick one door. The host, who knows where the prize is, opens a different door that does not contain the prize. You may stay or switch.

Let C be the event that your first choice is correct. Since the first choice is made before any information,

P(C) = \frac{1}{3}.

If your first choice is correct, switching loses. If your first choice is wrong, the host’s rule forces the only remaining unopened losing door to be avoided, so switching wins. Therefore

P(\text{switch wins}) = P(C^c) = \frac{2}{3}.

The host’s behavior is part of the probability model. If the host opened a door at random and sometimes revealed the prize, the calculation would change.

Random variables

A random variable is a function from the sample space to the real line. If X is the result of one die roll, then X(\omega)=\omega for \omega \in \{1,\ldots,6\}. If Y is the indicator of rolling a six, then

Y = \begin{cases} 1, & \text{if } X=6, \\ 0, & \text{otherwise.} \end{cases}

A discrete random variable has a countable set of possible values. Its probability mass function, or PMF, is

p_X(x) = P(X=x).

The PMF satisfies p_X(x) \ge 0 and \sum_x p_X(x)=1.

A continuous random variable is often described by a density function f_X(x). Probabilities are areas under the density:

P(a \le X \le b) = \int_a^b f_X(x)\,dx.

For a continuous random variable, P(X=x)=0 at any single point, even though intervals can have positive probability.

The cumulative distribution function, or CDF, applies to both discrete and continuous variables:

F_X(x) = P(X \le x).

The CDF is nondecreasing, right-continuous, and satisfies

\lim_{x \to -\infty} F_X(x)=0, \qquad \lim_{x \to \infty} F_X(x)=1.

Expectation and variance

The expectation of a discrete random variable is the probability-weighted average

E[X] = \sum_x x p_X(x),

when the sum is well-defined. For a continuous random variable with density f_X,

E[X] = \int_{-\infty}^{\infty} x f_X(x)\,dx.

Expectation is linear. For constants a and b,

E[aX+bY] = aE[X]+bE[Y],

whether or not X and Y are independent. This property drives much of applied statistics. A sample mean,

\bar{X}=\frac{1}{n}\sum_{i=1}^{n} X_i,

has expectation

E[\bar{X}] = \frac{1}{n}\sum_{i=1}^{n}E[X_i].

If each X_i has mean \mu, then E[\bar{X}]=\mu.

Variance measures spread around the mean:

\operatorname{Var}(X)=E[(X-E[X])^2].

An equivalent computational form is

\operatorname{Var}(X)=E[X^2]-(E[X])^2.

For constants a and b,

\operatorname{Var}(aX+b)=a^2\operatorname{Var}(X).

Covariance measures how two random variables move together:

\operatorname{Cov}(X,Y)=E[(X-E[X])(Y-E[Y])].

The variance of a sum is

\operatorname{Var}(X+Y) = \operatorname{Var}(X)+\operatorname{Var}(Y) +2\operatorname{Cov}(X,Y).

For n variables,

\operatorname{Var}\left(\sum_{i=1}^{n}X_i\right) = \sum_{i=1}^{n}\operatorname{Var}(X_i) +2\sum_{i<j}\operatorname{Cov}(X_i,X_j).

If the variables are independent, all covariance terms are zero. If they are positively correlated, the variance of the sum is larger than the independent calculation. This is why clustered data need clustered standard errors.

Moment generating functions and characteristic functions

The moment generating function of X is M_X(t)=E[e^{tX}], when the expectation exists in an open interval around 0. Derivatives at zero recover moments: M_X'(0)=E[X] and M_X''(0)=E[X^2]. Characteristic functions use complex numbers: \varphi_X(t)=E[e^{itX}]. They always exist because |e^{itX}|=1. In applied work, these functions matter most when deriving distributions of sums, proving limit theorems, or checking whether a proposed distributional result is correct. For routine data analysis, CDFs, densities, simulation, and likelihoods are usually more direct.

Worked example: rolling a fair die N times

Let X be the result of one fair die roll. The support is \{1,2,3,4,5,6\} and each outcome has probability 1/6.

The expected value is

E[X] = \sum_{x=1}^{6} x\frac{1}{6} = \frac{1+2+3+4+5+6}{6} = 3.5.

The second moment is

E[X^2] = \frac{1^2+2^2+3^2+4^2+5^2+6^2}{6} = \frac{91}{6}.

Therefore

\operatorname{Var}(X) = E[X^2]-(E[X])^2 = \frac{91}{6} - 3.5^2 = \frac{35}{12} \approx 2.9167.

Let S_N be the number of sixes in N independent rolls. Then

S_N \sim \operatorname{Binomial}\left(N,\frac{1}{6}\right).

The probability of at least one six is

P(S_N \ge 1) = 1 - P(S_N=0) = 1 - \left(\frac{5}{6}\right)^N.

For N=20, this is

1-\left(\frac{5}{6}\right)^{20} \approx 0.9739.

The sample mean of N rolls,

\bar{X}_N = \frac{1}{N}\sum_{i=1}^{N} X_i,

has

E[\bar{X}_N]=3.5, \qquad \operatorname{Var}(\bar{X}_N)=\frac{35}{12N},

when the rolls are independent.

Excel

The following formulas assume the number of rolls N is in cell B1, with 20 as an example. Put simulated die rolls in A2:A21.

B1: 20

# Simulate one fair die roll in A2, then fill down through A21
=ROUNDUP(6*RAND(), 0)

# Alternative with the same idea, using CEILING
=CEILING(6*RAND(), 1)

# Sample mean and sample variance of the simulated rolls
=AVERAGE(A2:A21)
=VAR.S(A2:A21)

# Exact theoretical mean and variance for one fair die
=SUMPRODUCT({1,2,3,4,5,6}, {1/6,1/6,1/6,1/6,1/6,1/6})
=SUMPRODUCT(({1,2,3,4,5,6}-3.5)^2, {1/6,1/6,1/6,1/6,1/6,1/6})

# Number of sixes in the simulated rolls
=COUNTIF(A2:A21, 6)

# Exact probability of at least one six in N rolls
=1-(5/6)^B1

# Same probability using the binomial CDF
=1-BINOM.DIST(0, B1, 1/6, TRUE)

# Probability of exactly 3 sixes in N rolls
=BINOM.DIST(3, B1, 1/6, FALSE)

# Normal approximation for the sample mean, Pr(mean <= 4)
=NORM.DIST(4, 3.5, SQRT((35/12)/B1), TRUE)

Excel recalculates RAND() whenever the sheet updates. For a stable classroom example, copy the simulated rolls and paste values before discussing the output.

Stata

clear all
set seed 20260512

local N = 20
set obs `N'

gen die = ceil(runiform()*6)
summarize die

gen is_six = die == 6
summarize is_six
count if is_six

display "Theoretical E[X] = " (1+2+3+4+5+6)/6
display "Theoretical Var(X) = " 35/12
display "Pr(at least one six) = " 1 - (5/6)^`N'

display "Pr(exactly 3 sixes) = " binomialp(`N', 3, 1/6)
display "Pr(no more than 3 sixes) = " binomial(`N', 3, 1/6)
display "Normal approx Pr(mean <= 4) = " normal((4 - 3.5)/sqrt((35/12)/`N'))

The command summarize die estimates the mean and variance from the simulated rolls. The theoretical commands give the target values from the probability model.

R

set.seed(20260512)

N <- 20
die <- sample(1:6, size = N, replace = TRUE)

mean(die)
var(die)

is_six <- die == 6
sum(is_six)
mean(is_six)

EX <- mean(1:6)
VarX <- mean((1:6 - EX)^2)

EX
VarX
1 - (5/6)^N

dbinom(3, size = N, prob = 1/6)
pbinom(0, size = N, prob = 1/6, lower.tail = FALSE)

# A normal approximation for the sample mean
pnorm(4, mean = EX, sd = sqrt(VarX / N))

# A chi-square probability, useful later for variance work
pchisq(10, df = 5)

R’s sample() gives a direct simulation of the die rolls. The binomial functions work with the count of sixes rather than the die value itself.

Julia

using Random
using Distributions
using Statistics

Random.seed!(20260512)

N = 20
die = rand(1:6, N)

mean(die)
var(die)

is_six = die .== 6
sum(is_six)
mean(is_six)

EX = mean(1:6)
VarX = mean((collect(1:6) .- EX).^2)

EX
VarX
1 - (5/6)^N

S = Binomial(N, 1/6)
pdf(S, 3)
1 - cdf(S, 0)

NormalApprox = Normal(EX, sqrt(VarX / N))
cdf(NormalApprox, 4)

Julia separates random number generation from distribution objects. rand(1:6, N) simulates rolls. Binomial(N, 1/6) creates the distribution for the number of sixes.

Common traps

The most common Bayes error is confusing P(A \mid B) with P(B \mid A). A medical test can have a high probability of being positive among people with a disease while the probability of disease among positive tests remains low. The direction of the conditioning bar carries the substance of the question.

The prosecutor’s fallacy is a version of the same error. A forensic expert may report that evidence is rare among innocent people, P(E \mid I) is small. That is not the same as saying the probability of innocence after observing the evidence, P(I \mid E), is small. The latter depends on prior odds and on how likely the evidence is under guilt.

Mutually exclusive events cannot occur together. Independent events do not change each other’s probabilities. If A and B are mutually exclusive with positive probabilities, then P(A \cap B)=0 but P(A)P(B)>0, so they are not independent.

Base rates are easy to ignore because the likelihood often sounds more diagnostic than it is. A rare disease, rare fraud event, or rare equipment failure can produce many false alarms even when a classifier has good sensitivity and specificity.

Another trap is reading a density value as a probability. For a continuous random variable, f_X(x) is not P(X=x). Probabilities come from integrating over intervals.

Finally, independence is a modeling assumption, not a default property of data. Repeated measurements on the same person, observations from the same school, and transactions from the same firm usually share unobserved features. Treating them as independent can make uncertainty look smaller than it is.

Reporting checklist

Before reporting a probability calculation, state the random experiment and the sample space in plain language.

Define the event or random variable being measured. Use notation such as A, B, X, and S_N only after the words are clear.

State whether probabilities are unconditional or conditional. If conditional, name the conditioning event.

Report the base rate when using Bayes’ theorem. Without it, the posterior probability cannot be checked.

For random variables, identify whether the distribution is discrete or continuous. Use PMF for discrete variables and PDF or density for continuous variables.

Give expectations and variances with units when the variable has units.

State independence assumptions. If the data are clustered, repeated, matched, or spatially connected, explain how dependence enters the analysis.

When using simulation, set the random seed and report the number of simulations or rolls.

When a formula has an exact value and a simulated estimate, report both. The exact value is the target. The simulation shows sampling variation.

References

Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. Springer.

Casella, George, and Roger L. Berger. 2002. Statistical Inference. 2nd ed. Duxbury.

Pitman, Jim. 1993. Probability. Springer.

Imai, Kosuke. 2017. Quantitative Social Science: An Introduction. Princeton University Press.

Ross, Sheldon M. 2014. A First Course in Probability. 9th ed. Pearson.