Point estimation: MOM and MLE

Why point estimation matters

Statistical inference seeks to move from a finite sample of data to conclusions about the broader population. In many applications, we want to summarize our knowledge of a population parameter using a single value. This value is a point estimate. While interval estimation provides a range of plausible values, the point estimate serves as our best guess or most likely value for the parameter under the given model.

A point estimator is a rule or formula, usually denoted as a function of the sample data X_1, X_2, \dots, X_n. We denote a parameter by \theta and its estimator by \hat{\theta} = W(X_1, \dots, X_n). Because the estimator depends on a random sample, it is itself a random variable with its own probability distribution, known as the sampling distribution. The specific value calculated from a particular dataset is the estimate. We choose estimators based on their sampling properties, such as how close they are on average to the true value and how much they vary across different samples. In a Master’s-level context, we are concerned not just with finding “a” number, but with finding an estimator that possesses optimal properties like unbiasedness, consistency, and efficiency.

The method of moments

The method of moments (MOM) is one of the oldest methods for deriving estimators. It relies on the intuition that population characteristics should be reflected in sample characteristics. Specifically, we equate population moments to sample moments and solve for the parameters of interest.

Intuition and approach

The k-th population moment is defined as the expected value of the random variable raised to the k-th power: \mu_k = E[X^k] The k-th sample moment is the average of the observed values raised to the k-th power: \hat{\mu}_k = \frac{1}{n} \sum_{i=1}^n X_i^k

If a model has m unknown parameters \theta_1, \theta_2, ..., \theta_m, we express the first m population moments as functions of these parameters. For many common distributions, these moments are well-known or can be derived using the moment-generating function. We then set these population moments equal to the corresponding sample moments: \mu_1(\theta_1, ..., \theta_m) = \hat{\mu}_1 \mu_2(\theta_1, ..., \theta_m) = \hat{\mu}_2 ... \mu_m(\theta_1, ..., \theta_m) = \hat{\mu}_m

Solving this system of equations for \theta yields the MOM estimators. For example, in a Normal distribution N(\mu, \sigma^2), the first population moment is \mu and the second is \mu^2 + \sigma^2. Setting these to \bar{X} and \frac{1}{n} \sum X_i^2 allows us to solve for \hat{\mu} and \hat{\sigma}^2.

When it works

MOM is often simpler to compute than other methods, particularly when the likelihood function is complex or difficult to maximize. It does not always require a full specification of the probability density function, only the existence of the relevant moments. However, MOM estimators are not always the most efficient. They may have higher variance than maximum likelihood estimators because they do not utilize all the information contained in the sample beyond the moments. They are generally consistent under mild conditions, meaning they converge to the true parameter as the sample size increases. This consistency follows from the Law of Large Numbers, which ensures that sample moments converge to their population counterparts.

Maximum likelihood estimation

Maximum likelihood estimation (MLE) is a more computationally intensive but theoretically powerful approach. It identifies the parameter values that make the observed data most probable under the assumed model.

The likelihood function

Given a sample X_1, ..., X_n of independent and identically distributed (i.i.d.) random variables from a distribution with density or mass function f(x; \theta), the joint density of the sample is the product of the individual densities: L(\theta | x_1, \dots, x_n) = \prod_{i=1}^n f(x_i; \theta) This function is the likelihood function. It is important to distinguish the likelihood from the density. While the density f(x; \theta) is a function of x given a fixed \theta, the likelihood L(\theta) is a function of \theta given the observed data.

Log-likelihood

In practice, it is usually easier to maximize the natural logarithm of the likelihood function, known as the log-likelihood: \ell(\theta) = \ln L(\theta) = \sum_{i=1}^n \ln f(x_i; \theta) Since the logarithm is a strictly increasing function, the value of \theta that maximizes the log-likelihood also maximizes the likelihood. The log-likelihood transformation is beneficial because it turns the product in the likelihood function into a sum, which is much easier to differentiate.

Common MLE examples

Bernoulli distribution

For a Bernoulli trial with success probability p, where X \in \{0, 1\}: f(x; p) = p^x (1-p)^{1-x} The log-likelihood for n observations is: \ell(p) = \sum_{i=1}^n [x_i \ln p + (1-x_i) \ln(1-p)] = (\sum x_i) \ln p + (n - \sum x_i) \ln(1-p) Taking the derivative with respect to p and setting it to zero: \frac{d\ell}{dp} = \frac{\sum x_i}{p} - \frac{n - \sum x_i}{1-p} = 0 Solving for p gives \hat{p} = \frac{1}{n} \sum x_i = \bar{X}, which is the sample proportion of successes.

Normal distribution

For X \sim N(\mu, \sigma^2), the log-likelihood for n observations is: \ell(\mu, \sigma^2) = -\frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 To find the MLEs, we take partial derivatives with respect to \mu and \sigma^2. For \mu: \frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2} \sum (x_i - \mu) = 0 \implies \hat{\mu} = \bar{X} For \sigma^2, we treat it as a single parameter \theta: \frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum (x_i - \mu)^2 = 0 \implies \hat{\sigma}^2_{MLE} = \frac{1}{n} \sum (x_i - \bar{X})^2 Note that this MLE of the variance is biased in finite samples, as it divides by n rather than n-1.

Poisson distribution

For a Poisson rate \lambda: f(x; \lambda) = \frac{e^{-\lambda} \lambda^x}{x!} The log-likelihood is: \ell(\lambda) = \sum [-\lambda + x_i \ln \lambda - \ln(x_i!)] = -n\lambda + (\sum x_i) \ln \lambda - \sum \ln(x_i!) Setting the derivative to zero: -n + \frac{\sum x_i}{\lambda} = 0 \implies \hat{\lambda} = \bar{X}

Properties of estimators

We evaluate the quality of estimators based on several rigorous criteria.

Consistency

An estimator \hat{\theta}_n is consistent if it converges in probability to the true parameter \theta as the sample size n tends to infinity. This means that with enough data, the estimator will eventually yield the correct value with certainty. Formally, for any \epsilon > 0: \lim_{n \to \infty} P(|\hat{\theta}_n - \theta| > \epsilon) = 0 MLEs are generally consistent under mild regularity conditions, such as the support of the distribution not depending on the parameter.

Asymptotic normality

Under certain conditions, MLEs are not only consistent but also follow a normal distribution in the limit: \sqrt{n}(\hat{\theta} - \theta) \xrightarrow{d} N(0, V) where V is the asymptotic variance. This property is crucial because it allows us to construct confidence intervals and perform hypothesis tests for parameters even when the exact finite-sample distribution of the estimator is unknown or complex.

Efficiency and Fisher Information

The Fisher Information I(\theta) is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter \theta. It is defined as the variance of the score function (the derivative of the log-likelihood): I(\theta) = E\left[ \left( \frac{\partial}{\partial \theta} \ln f(X; \theta) \right)^2 \right] Under regularity conditions, this is also equal to the negative expected value of the second derivative: I(\theta) = -E\left[ \frac{\partial^2}{\partial \theta^2} \ln f(X; \theta) \right] The Cramer-Rao Lower Bound (CRLB) provides a theoretical minimum for the variance of any unbiased estimator \hat{\theta}. Specifically: Var(\hat{\theta}) \geq \frac{1}{n I(\theta)} An estimator that achieves this lower bound is said to be efficient. While some unbiased estimators may not reach the CRLB in finite samples, MLEs are asymptotically efficient, meaning they reach the CRLB as n \to \infty.

Sufficient statistics

A statistic T(X) is sufficient for a parameter \theta if the conditional distribution of the sample X given T(X) does not depend on \theta. Intuitively, this means that T(X) contains all the information about \theta that is available in the data. Once we know the value of T(X), looking at the individual data points X_1, \dots, X_n adds no further knowledge about the parameter.

Factorization Theorem

The Fisher-Neyman Factorization Theorem provides a practical way to identify sufficient statistics. It states that T(X) is sufficient for \theta if and only if the joint density can be factored into two parts: f(x; \theta) = g(T(x), \theta) h(x) where g depends on the data only through the statistic T(x), and h is a function of the data that does not involve \theta. For example, in the Poisson distribution, the sum of the observations \sum X_i is a sufficient statistic for the rate \lambda.

Bias-variance tradeoff

The Mean Squared Error (MSE) is a common metric for evaluating the overall performance of an estimator. It decomposes into two components: MSE(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = Var(\hat{\theta}) + [Bias(\hat{\theta})]^2 where Bias(\hat{\theta}) = E[\hat{\theta}] - \theta. This decomposition highlights a fundamental tradeoff in statistics. An unbiased estimator (Bias = 0) may have a very high variance, while a slightly biased estimator might have a much lower variance, leading to a smaller overall MSE. We often prefer an estimator with a small bias if it yields a significant gain in precision. For instance, in high-dimensional settings, shrinkage estimators like Ridge or Lasso regression deliberately introduce bias to reduce variance and improve predictive performance.

Worked example: Exponential rate parameter

Suppose we have a sample of 100 observations from an exponential distribution with rate parameter \lambda. The density is f(x; \lambda) = \lambda e^{-\lambda x} for x \geq 0. We wish to estimate \lambda.

MOM Estimation

The first population moment (the mean) for an exponential distribution is E[X] = 1/\lambda. Equating this to the sample mean \bar{X}: 1/\lambda = \bar{X} \implies \hat{\lambda}_{MOM} = 1/\bar{X} The MOM estimator of the rate is simply the reciprocal of the sample average.

MLE Estimation

The log-likelihood for n observations is: \ell(\lambda) = \sum_{i=1}^n (\ln \lambda - \lambda x_i) = n \ln \lambda - \lambda \sum x_i To find the maximum, we take the derivative with respect to \lambda: \frac{d\ell}{d\lambda} = \frac{n}{\lambda} - \sum x_i = 0 \implies \hat{\lambda}_{MLE} = \frac{n}{\sum x_i} = 1/\bar{X} In this specific case, the MOM and MLE estimators are identical. This is often true for parameters that are simple functions of the mean in exponential family distributions.

Excel

To compute the MOM estimate, use the average of the data and take its reciprocal. For MLE, we can use the Solver add-in to numerically maximize the log-likelihood.

A1:A100: [Data]
B1: =1/AVERAGE(A1:A100)  (MOM Estimate)

For MLE via Solver:
C1: 0.5 (Initial guess for lambda)
D1: =LN(C$1)-C$1*A1 (Log-likelihood for first observation)
D1:D100: [Fill down]
E1: =SUM(D1:D100) (Total Log-likelihood)
Go to Data > Solver.
Set Objective: E1
To: Max
By Changing Variable Cells: C1
Constraint: C1 >= 0.0001
Click Solve.

Stata

Stata provides a direct way to calculate the mean and a flexible framework for maximum likelihood.

* Suppose data is in variable 'x'
* MOM
mean x
scalar lambda_mom = 1/_b[x]
display "MOM Estimate: " lambda_mom

* MLE using the built-in mlexp command
* This command numerically maximizes the expression
mlexp (ln({lambda}) - {lambda}*x)

R

R offers several ways to perform estimation, from manual calculation to dedicated optimization routines.

# Generate some data for the example
set.seed(42)
x <- rexp(100, rate = 0.5)

# MOM
lambda_mom <- 1 / mean(x)

# MLE using MASS package (standard for many distributions)
library(MASS)
fit <- fitdistr(x, "exponential")
lambda_mle <- fit$estimate

# MLE using optim for a custom likelihood
# We minimize the negative log-likelihood
ll_exp <- function(lambda, data) {
  if (lambda <= 0) return(Inf)
  -sum(dexp(data, rate = lambda, log = TRUE))
}
res <- optim(par = 1, fn = ll_exp, data = x, method = "Brent", lower = 0, upper = 10)
lambda_mle_optim <- res$par

Julia

Julia’s Distributions.jl package is highly optimized for these tasks.

using Distributions
using Statistics

# Generate sample data
d_true = Exponential(1/0.5) # Julia uses the scale (1/rate) parameterization
x = rand(d_true, 100)

# MOM (Manual)
lambda_mom = 1 / mean(x)

# MLE using Distributions.jl
# fit_mle returns a distribution object with estimated parameters
fit_obj = Distributions.fit_mle(Exponential, x)
# Extract the mean parameter and convert to rate
lambda_mle = 1 / params(fit_obj)[1]

println("MOM: ", lambda_mom)
println("MLE: ", lambda_mle)

Common traps

One common mistake involves confusing the likelihood function with the probability density function. While they share the same functional form, they are interpreted differently. The density is a function of the data given the parameter, used to predict future data. The likelihood is a function of the parameter given the data, used to estimate the parameter.

Another trap is failing to check parameter identification. If multiple parameter values lead to the same likelihood for any data, the parameter is not identified, and the MLE is not unique. This often happens in models with redundant variables or complex latent structures.

Researchers often use small-sample MLEs without acknowledging finite-sample bias. The most famous example is the MLE of the variance in a normal distribution, which is \frac{1}{n} \sum (x_i - \bar{X})^2. This estimator systematically underestimates the true variance. The unbiased estimator uses n-1 in the denominator to account for the fact that one degree of freedom is used to estimate the mean. While this bias disappears as the sample size grows, it can be substantial in small datasets.

Finally, one should ensure that the likelihood function is indeed being maximized. Numerical optimizers can sometimes get stuck in local maxima or fail to converge if the likelihood surface is flat or has multiple peaks. Always check the convergence status and consider using different starting values.

Reporting checklist

  • Specify the likelihood function or the specific moment equations used for estimation.
  • Report the point estimate clearly, along with its units of measurement.
  • Include a measure of uncertainty, such as the standard error or the Fisher Information.
  • State whether any small-sample corrections (like the n-1 adjustment for variance) were applied.
  • For numerical MLE, report the starting values used and confirm that the algorithm achieved convergence to a global maximum.
  • Discuss whether the assumptions required for consistency and efficiency are likely met in the given context.

References

  • Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury Press.
  • Lehmann, E. L., & Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer.
  • Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer.