Lecture 10

The Coalescent

Week 10 90 minutes Required reading: Wakeley Ch. 1-3, Hein et al. Ch. 3-4

1. Introduction to the Coalescent

The coalescent is one of the most important theoretical frameworks in population genetics and molecular evolution. It provides a mathematical description of how genetic lineages merge backward in time.

The Coalescent

Data: a small genetic sample from a large background population

The coalescent:

  • Is a model of the ancestral relationships of a sample of individuals taken from a larger population
  • Describes a probability distribution on ancestral genealogies (trees) given a population history, N(t)
  • Can convert information from ancestral genealogies into information about population history and vice versa
  • Is a model of ancestral genealogies, not sequences, and its simplest form assumes neutral evolution
  • Can be thought of as a prior on the tree, in a Bayesian setting
Key insight: The coalescent allows us to link observable patterns in genetic data to the demographic history of populations.

Why the Coalescent Matters

The coalescent framework is crucial because it:

2. Theoretical Population Genetics Foundation

The Wright-Fisher Model

Most of theoretical population genetics is based on the idealized Wright-Fisher model of population, which makes several simplifying assumptions:

Wright-Fisher Model Assumptions
  • Constant population size N
  • Discrete generations (no overlapping generations)
  • Complete mixing (random mating)
  • Neutral evolution (no selection)
  • No mutation (in the basic model)
  • No migration (closed population)
Wright-Fisher model
The Wright-Fisher model: each generation is formed by random sampling with replacement from the previous generation
Note: For the purposes of this presentation, the population will be assumed to be haploid, as is the case for many pathogens. The diploid case follows similar principles with adjusted parameters.

From Forward to Backward in Time

Traditional population genetics works forward in time, tracking allele frequencies through generations. The coalescent takes a different approach:

This backward-time approach is computationally efficient because:

3. Kingman's n-Coalescent

John Kingman's groundbreaking work in 1982 established the mathematical foundation for coalescent theory.

Kingman's n-Coalescent

Consider tracing the ancestry of a sample of $k$ individuals from the present, back into the past. This process eventually coalesces to a single common ancestor (concestor) of the sample.

Kingman's n-coalescent describes the statistical properties of such an ancestry when $k$ is small compared to the total population size $N$.

Coalescent process
Tracing lineages backward in time until they coalesce to a common ancestor

The Coalescence of Two Lineages

Let's start with the simplest case: two randomly sampled individuals from a population of size $N$.

Two-Lineage Coalescence

  • By perfect mixing, the probability they share a common ancestor in the previous generation is $1/N$
  • The probability the common ancestor is $t$ generations back is:
    $$P(t) = \frac{1}{N}\left(1-\frac{1}{N}\right)^{t-1}$$
  • This is a geometric distribution with success rate $\lambda = 1/N$
  • Mean time to coalescence: $E[t] = N$ generations
  • Variance: $\text{Var}[t] = N^2(1-1/N) \approx N^2$ for large $N$
Two lineage coalescence
Two lineages coalescing in the Wright-Fisher model

The Coalescence of k Lineages

With $k$ lineages, the analysis becomes more complex but follows similar principles:

k-Lineage Coalescence

With $k$ lineages, there are $\binom{k}{2} = \frac{k(k-1)}{2}$ possible pairs that may coalesce.

  • Coalescence rate: $\lambda_k = \binom{k}{2}/N$
  • Mean time to first coalescence:
    $$E[t_k] = \frac{N}{\binom{k}{2}} = \frac{2N}{k(k-1)}$$
  • After coalescence, we have $k-1$ lineages remaining
Important assumption: This implicitly assumes that $N$ is much larger than $O(k^2)$, so that the probability of two coalescent events in the same generation is negligible.

The Complete Coalescent Process

The complete coalescent process for a sample of $n$ individuals involves successive coalescence events:

Lineages Coalescence Rate Expected Time
$n → n-1$ $\binom{n}{2}/N$ $\frac{2N}{n(n-1)}$
$n-1 → n-2$ $\binom{n-1}{2}/N$ $\frac{2N}{(n-1)(n-2)}$
... ... ...
$3 → 2$ $3/N$ $N/3$
$2 → 1$ $1/N$ $N$

Total expected time to the most recent common ancestor (TMRCA):

$$E[T_{MRCA}] = 2N\sum_{k=2}^n \frac{1}{k(k-1)} = 2N\left(1 - \frac{1}{n}\right)$$
Key result: As $n → ∞$, $E[T_{MRCA}] → 2N$. The expected TMRCA approaches $2N$ generations regardless of sample size!

4. The Coalescent as a Continuous-Time Process

From Discrete to Continuous Time

As the population size $N$ becomes large, the discrete-time Wright-Fisher model converges to a continuous-time process:

Geometric to exponential
The geometric distribution converges to the exponential distribution as population size increases

The Coalescent as a Diffusion Approximation

Kingman (1982) showed that as $N$ grows, the coalescent process converges to a continuous-time Markov chain.

Continuous-Time Coalescent

For $k$ lineages, the waiting time until the next coalescence follows an exponential distribution:

$$f(t_k) = \frac{\binom{k}{2}}{N}\exp\left(-\frac{\binom{k}{2}t_k}{N}\right)$$

For a specific pair of lineages to coalesce:

$$f(t_k) = \frac{1}{N}\exp\left(-\frac{\binom{k}{2}t_k}{N}\right)$$
Wright-Fisher to tree
From Wright-Fisher genealogy to coalescent tree representation

5. The Coalescent Density

Probability Density for a Genealogy

For a genealogy $g$ with $n$ samples and coalescent times $\mathbf{t} = \{t_2, t_3, \ldots, t_n\}$, we can write the probability density:

Coalescent Density Function

$$P(g|N) = \frac{1}{N^{n-1}}\prod_{k=2}^n\exp\left(-\frac{\binom{k}{2}t_k}{N}\right)$$

Where:

  • $t_k$ is the time interval during which there are $k$ lineages
  • The factor $1/N$ appears for each coalescence event (probability of specific pair coalescing)
  • The exponential term accounts for the waiting time with $k$ lineages

Example: Three-Sample Genealogy

For $n = 3$ samples with coalescence times $t_3$ (3→2 lineages) and $t_2$ (2→1 lineages):

$$P(g|N) = \frac{1}{N^2} \exp\left(-\frac{3t_3}{N}\right) \exp\left(-\frac{t_2}{N}\right)$$

This shows how both topology and branch lengths contribute to the genealogy probability.

Properties of the Coalescent Density

6. The Coalescent with Variable Population Size

Real populations rarely maintain constant size. The coalescent can be generalized to handle time-varying population sizes.

Variable Population Size Coalescent

Griffiths and Tavaré (1994) showed that for population size $N(t)$ varying with time:

Coalescent Density with Variable N(t)

The probability density for the first coalescence event at time $t$ given $n$ lineages:

$$f(t) = \frac{\binom{n}{2}}{N(t)} \exp\left(-\int_0^t \frac{\binom{n}{2}}{N(x)} dx\right)$$

Requirements:

  • The coalescent intensity function $1/N(t)$ must be integrable
  • $N(t) > 0$ for all $t$

Common Population Size Functions

Examples of N(t)

1. Exponential growth/decline:

$$N(t) = N_0 e^{-rt}$$

2. Logistic growth:

$$N(t) = \frac{K}{1 + \left(\frac{K}{N_0} - 1\right)e^{-rt}}$$

3. Bottleneck:

$$N(t) = \begin{cases} N_0 & \text{if } t < t_b \text{ or } t > t_b + \Delta t \\ N_b & \text{if } t_b \leq t \leq t_b + \Delta t \end{cases}$$

Effects of Population Size Changes

Different demographic scenarios leave characteristic signatures in genealogies:

Key insight: The shape of a genealogy contains information about historical population size changes. This is the basis for demographic inference from genetic data.

7. Applications and Extensions

The Coalescent as a Tree Prior

In Bayesian phylogenetics, the coalescent serves as a natural prior distribution on gene trees:

$$P(\text{tree}|\theta)$$

Where $\theta$ includes parameters like:

Extensions of the Basic Coalescent

Structured Coalescent
Accounts for population subdivision and migration
Selection at Linked Sites
Modifies coalescence rates due to hitchhiking effects
Recombination
Leads to the ancestral recombination graph (ARG)
Multiple Merger Coalescents
Allows more than two lineages to coalesce simultaneously

Practical Applications

The coalescent framework enables:

  1. Demographic inference: Estimating historical population sizes and growth rates
  2. Species delimitation: Distinguishing population structure from species boundaries
  3. Dating events: Estimating times of population splits or admixture
  4. Selection detection: Identifying deviations from neutral expectations
  5. Epidemiological inference: Tracking pathogen spread and transmission

Software Implementing Coalescent Models

  • BEAST/BEAST2: Bayesian phylogenetics with coalescent priors
  • MIGRATE: Estimation of migration rates and population sizes
  • ms/msprime: Coalescent simulation programs
  • LAMARC: Likelihood analysis with coalescent models
  • IMa2: Isolation with migration models

8. Visualizing the Coalescent Process

Understanding the coalescent often benefits from visualization and simulation. The interactive demonstration in the lecture allows exploration of:

Key Parameters to Explore

Observable Patterns

Through simulation, we can observe:

Exercise: Try simulating genealogies with different population size functions. Notice how bottlenecks create "hourglass" shaped genealogies, while expansion creates "star" phylogenies.

Summary

The coalescent is a fundamental framework in population genetics that:

  1. Provides a null model: Describes genealogies under neutral evolution in idealized populations
  2. Works backward in time: More efficient than forward simulations for genealogical questions
  3. Links demography to genealogy: Population size changes leave signatures in gene trees
  4. Has elegant mathematics:
    • Exponential waiting times between coalescence events
    • Rate depends on number of lineages and population size
    • Generalizes to variable population sizes
  5. Forms the basis for inference: Used as tree priors in Bayesian phylogenetics and for demographic inference
  6. Has many extensions: Can incorporate population structure, selection, recombination, and other biological realities
Key formula to remember: For $k$ lineages in a population of size $N$, the rate of coalescence is $\binom{k}{2}/N$, and the expected waiting time is $N/\binom{k}{2}$.
Recommended Reading:
  • Wakeley (2009) "Coalescent Theory: An Introduction" - Chapters 1-3
  • Hein, Schierup & Wiuf (2005) "Gene Genealogies, Variation and Evolution" - Chapters 3-4
  • Kingman (1982) "The coalescent" - Stochastic Processes and their Applications
  • Griffiths & Tavaré (1994) "Sampling theory for neutral alleles" - Phil Trans R Soc B

Check Your Understanding

  1. Why is the coalescent more efficient than forward-time simulations?
  2. What is the expected TMRCA for a large sample from a population of size N?
  3. How does a population bottleneck affect the rate of coalescence?
  4. Why do all labeled topologies have equal probability under the standard coalescent?
  5. What information about population history can be inferred from genealogy shape?
Previous Lecture Next Lecture