Lecture 8

Bayesian Inference

Week 8 90 minutes Required reading: Jaynes Ch. 1-3

1. Probability as Extended Logic

From Logic to Plausible Reasoning

Traditional deductive logic is based on syllogisms that provide certain conclusions:

If A is true, then B is true

A is true


Therefore, B is true

If A is true, then B is true

B is false


Therefore, A is false

But real-world reasoning often requires handling uncertainty. Consider this scenario:

Suppose some dark night a policeman walks down a street. He hears a burglar alarm, looks across the street, and sees a jewelry store with a broken window. Then a gentleman wearing a mask comes crawling out of the window, carrying a bag full of jewelry. The policeman decides immediately that the gentleman is a thief. How does he decide this? — E.T. Jaynes, Probability Theory

Let's analyze this:

The policeman appears to use a weak syllogism:

If A is true, then B is likely

B is true


Therefore, A is likely

Key Question: Why does this reasoning seem so sound? How can we formalize this type of plausible reasoning?

Developing a Theory of Plausible Reasoning

To formalize reasoning under uncertainty, our theory must satisfy:

  1. Degrees of plausibility are represented by real numbers
  2. Qualitative correspondence with common sense
  3. Consistency:
    • All valid reasoning routes give the same result
    • Equivalent states of knowledge have equivalent plausibilities
Cox's Theorem

These requirements uniquely identify the rules of probability theory!

  • Richard Cox (1946) showed that any consistent system for plausible reasoning must follow the rules of probability
  • Probability theory is thus the unique extension of logic to handle uncertainty

The Rules of Probability

From Cox's axioms, we derive:

Fundamental Rules

  • Probability: $P(A|B)$ is the degree of plausibility of proposition A given that B is true
  • Product Rule: $P(A,B|C) = P(A|B,C)P(B|C)$
  • Sum Rule: $P(A|B) + P(\neg A|B) = 1$

By convention: $P(A) = 0$ means A is certainly false; $P(A) = 1$ means A is certainly true

That's it! These simple rules are sufficient for all of Bayesian statistics!

Notation and Continuous Variables

For continuous variables, we use probability densities:

Bayesian vs. Frequentist Interpretations

Frequentist

  • Probability = long-run frequency
  • Requires repeatable experiments
  • Assumes randomness is inherent
  • Cannot assign probabilities to hypotheses

Bayesian

  • Probability = degree of belief
  • Works for unique events
  • Uncertainty due to incomplete information
  • Can reason about any proposition

Example illustrating the difference:

Bayesian dice inference
A single proposition can have different probabilities depending on available information

Justifications for the Bayesian Approach

Axiomatic approach
Derivation from axioms for manipulating reasonable expectations (Cox's theorem)
Dutch book approach
If probability rules aren't followed for gambling odds, one party can always win
Decision theory approach
Every admissible statistical procedure is either Bayesian or a limit of Bayesian procedures

2. Bayesian Inference

What is Inference?

Statistical Inference

The act of deriving logical conclusions from premises when the premises are not sufficient to draw conclusions without uncertainty.

Example: The Urn Problem

Setup

Urn with colored balls
  • An urn contains 11 balls: $N_r$ red and $11 - N_r$ blue
  • We draw a ball, record its color, and replace it
  • Repeat 3 times, obtaining: R, B, R

Question: What is $P(N_r | d_1=R, d_2=B, d_3=R)$?

Solution

First, calculate the likelihood:

$$P(d_1=R, d_2=B, d_3=R | N_r) = \frac{N_r}{11} \times \frac{11-N_r}{11} \times \frac{N_r}{11} = \frac{N_r^2(11-N_r)}{11^3}$$

Apply the product rule to get:

$$P(N_r | R,B,R) = \frac{P(R,B,R | N_r) P(N_r)}{P(R,B,R)}$$

Where:

Posterior distribution for urn problem
Posterior distribution for number of red balls given observations R,B,R

Bayes' Theorem

The urn example illustrates the general form of Bayes' theorem:

Bayes' Theorem

$$\color{cyan}{P(\theta_M|D,M)} = \frac{\color{orange}{P(D|\theta_M,M)}\color{red}{P(\theta_M|M)}}{\color{lime}{P(D|M)}}$$

Where:

  • Posterior $P(\theta_M|D,M)$: Updated belief about parameters after seeing data
  • Likelihood $P(D|\theta_M,M)$: Probability of data given parameters
  • Prior $P(\theta_M|M)$: Initial belief about parameters
  • Evidence $P(D|M)$: Marginal likelihood, normalizing constant
Key insight: Bayes' theorem is just the product rule of probability rearranged! It's not a special assumption or axiom.

3. Prior Probabilities

What is a Prior?

Prior Probability

A prior is simply a probability — specifically, the probability of your hypothesis in the absence of the specific data you're about to analyze.

Key points about priors:

Priors for Discrete Variables

Poisson distributions
Poisson distributions for different rate parameters

Priors for Continuous Variables

For bounded continuous variables $a < x < b$:

For positive rate parameters $\lambda > 0$:

Improper Priors

Warning: Some priors cannot be normalized!
Improper uniform prior
A uniform prior on an unbounded domain cannot be normalized

Remember:

Log-normal distributions
Log-normal distributions as proper alternatives to improper priors

Which Prior is Best?

Only the person doing the analysis can answer this! Priors encapsulate expert knowledge (or its absence). This is your opportunity to contribute your expertise to the analysis.

4. Summarizing Uncertainty

Credible Intervals

Bayesian credible intervals summarize uncertainty in parameter estimates:

95% credible interval
95% credible interval [0.41, 0.91] contains 95% of the posterior probability

Credible Interval (Bayesian)

  • Given the data, 95% probability the parameter is in this interval
  • Direct probability statement
  • Depends on prior

Confidence Interval (Frequentist)

  • 95% of such intervals contain the true value
  • No probability statement about this specific interval
  • Prior-free (supposedly)
Note: For symmetric, unimodal distributions, credible intervals often coincide with confidence intervals, but their interpretations differ fundamentally.

5. Inference in Practice

The Computational Challenge

Bayes' theorem has a troublesome denominator:

$$P(\theta_M|D,M) = \frac{P(D|\theta_M,M)P(\theta_M|M)}{P(D|M)}$$

The evidence $P(D|M)$ requires integration:

$$P(D|M) = \int P(D|\theta_M,M)P(\theta_M|M)d\theta_M$$
Problem:
  • Rarely can this integral be solved analytically
  • For high-dimensional $\theta_M$, numerical integration is infeasible

Monte Carlo Methods

Monte Carlo Methods

Algorithms that produce random samples to characterize probability distributions.

Monte Carlo simulation
Named after the Monte Carlo casino — using randomness to solve deterministic problems

Rejection Sampling

One of the simplest Monte Carlo methods:

Rejection Sampling Algorithm

  1. Find an envelope distribution that bounds the target
  2. Sample uniformly under the envelope
  3. Accept samples under the target distribution
  4. Reject samples above the target
Rejection sampling
Green samples are accepted, red samples are rejected

The Curse of Dimensionality

Curse of dimensionality

As dimensions increase, the fraction of the envelope occupied by the target diminishes rapidly. For high-dimensional problems, rejection sampling becomes impractical.

Markov Chain Monte Carlo (MCMC)

The Metropolis-Hastings algorithm generates samples by creating a random walk:

MCMC random walk
MCMC explores the target distribution through a directed random walk

Key advantages:

MCMC Output

MCMC trace MCMC density
Left: MCMC trace showing parameter values over iterations. Right: Resulting density estimate

Convergence and Mixing

MCMC challenges:

MCMC burn-in
Burn-in period before chain converges to target distribution

Assessing Convergence

Compare multiple chains from different starting points:

Convergence testing
Multiple chains should converge to the same distribution

Assessing Mixing

Use the autocorrelation function:

Autocorrelation function
Autocorrelation shows how quickly samples become independent
Effective Sample Size (ESS)
$$N_{\text{eff}} = \frac{N}{\tau}$$

Where N is total samples and τ is the autocorrelation time. ESS estimates the number of independent samples.

Other Approaches

Monte Carlo Methods
  • Gibbs sampling
  • Hamiltonian Monte Carlo
  • Particle filtering
Deterministic Methods
  • Variational Bayes
  • Laplace approximation

Summary

This lecture introduced Bayesian inference as the extension of logic to handle uncertainty:

  1. Probability as logic: Cox's theorem shows probability theory uniquely extends deductive logic
  2. Bayes' theorem: Simply the product rule rearranged, provides a coherent framework for updating beliefs
  3. Priors: Necessary for any inference, encode background knowledge
  4. Computational methods: MCMC makes Bayesian inference practical for complex problems
  5. Advantages:
    • Direct probability statements about parameters
    • Coherent handling of uncertainty
    • Natural incorporation of prior knowledge
Further Reading:
  • E.T. Jaynes (2003) "Probability Theory: The Logic of Science"
  • MacKay (2003) "Information Theory, Inference, and Learning Algorithms"
  • Gelman et al. (2013) "Bayesian Data Analysis"

Check Your Understanding

  1. Why is probability theory the unique extension of logic for handling uncertainty?
  2. What's the difference between likelihood and posterior probability?
  3. Why are priors necessary for inference?
  4. What is the "curse of dimensionality" in rejection sampling?
  5. How do we assess MCMC convergence and mixing?
Previous Lecture Next Lecture