Master Equations and The Structured Coalescent

Generate layout for printing

Lecturer: Alexei Drummond

Chemical master equations

Consider the chemical reaction $$r_1 A_1 + r_2 A_2 + \ldots \overset{\lambda}{\longrightarrow} p_1 A_1 + p_2 A_2 + \ldots$$ where $r_i$ and $p_i$ specify the (integer) number of $A_i$ particles involved on the reactant and product sides, respectively.

Assuming the reaction is occuring in a well-mixed volume and that any interaction of the reactants has a fixed probability of resulting in a reaction, might use $X_i$ to describe the number of $A_i$ and have it evolve under a birth-death process with $$T(\vec{x})=\lambda\prod_i \frac{x_i!}{(x_i-r_i)!}$$

We call master equations describing these processes "chemical master equations".

Note that since the reaction notation uniquely specifies the transition rate $T(\vec{x})$, and the transition rate uniquely specifies the master equation, reaction notation can be used to precisely specify these kinds of birth-death processes.

Example: SIR model

In epidemiology, the stochastic SIR model (a compartmental model) can be described using the following pair of reactions:

\begin{align*} S + I &\overset{\beta}{\longrightarrow} 2I\\ I &\overset{\mu}{\longrightarrow} R\\ \end{align*}

The transition rates are $T_{inf}(n_S,n_I,n_R) = \beta n_S n_I$ with $\vec{v}_{inf}=(-1,1,0)$ and $T_{rem}(n_S,n_I,n_R) = \mu n_I$ with $\vec{v}_{rem}=(0,-1,1)$.

Master equation: $$\partial_t p(n_S,n_I,n_R)=\beta\left[(n_S+1)(n_I-1)p(n_S+1,n_I-1,n_R) - n_Sn_Ip(n_S,n_I,n_R)\right]\\ + \mu\left[(n_I+1)p(n_S,n_I+1,n_R-1)-n_Ip(n_S,n_I,n_R)\right]$$

Approximate mean dynamics: $$\partial_t \bar{n}_S = -\beta \bar{n}_S\bar{n}_I\\ \partial_t \bar{n}_I = \beta \bar{n}_S\bar{n}_I - \mu\bar{n}_I\\ \partial_t \bar{n}_R = \mu \bar{n}_I$$

Example: SIR model

Simulate! (Code here)

What is population structure?

A population is structured if its members possess one or more traits that affect who they interact with. (Which reactions they are involved in.)

Examples of characteristics that induce population structure:

Geographical location/position (phylogeography)
Species membership (gene characteristic)
Membership in epidemiological compartments
- exposed vs infectious
- age category
- ...

Structured Wright-Fisher model

Introduced by Notohara, 1990 (and probably others)
Applies the original Wright-Fisher model to structured populations.
Assumes a single population is divided into subpopulations (demes) of size $N_i$ for $i\in {1\ldots D}$.
Allows for weak migration between demes at rate $q_{ij}$.
Sub-population sizes are unaffected by migration in the long term: imagine islands in an archipelago which have each equilibrated to their own carrying capacities.

Structured Wright-Fisher model

A $D=2$ structured WF generation proceeds as follows:

$P(n'_{ba}|N_b)=\binom{n_b}{n'_{ba}}q_{ba}^{n'_{ba}}(1-q_{ba})^{n_b-n'_{ba}}$

The structured coalescent

Backward in time genealogical process corresponding to taking the coalescent limit of the structured WF process.
There are multiple routes to derivation! (It can be considered a structured WF process limit, but doesn't have to be.)

Coalescence rate in deme $i$:: $$T_{c}^{(i)}(\vec{k})=\binom{k_i}{2}\frac{1}{N_ig}$$
Migration rate $i\rightarrow j$ (backward time):: $$T_{m}^{(ij)}(\vec{k})=k_i m_{ij}$$

Note that $m_{ij} = g^{-1} q_{ji}\frac{N_j}{N_i}$.

The effect of structure on tree shape

Signal for structure remains, even when tree is not coloured.

Failure to account for structure can bias phylogenetic analyses.

(Trees simulated using this MASTER script.)

Two-deme structured coalescent trees

$N_1 = N_2 = 1000$, $m_{12} = m_{21} = 0.0008$. There are 15 samples from each deme, all sampled at the same time.

Two-deme structured coalescent trees

A standard phylogenetic inference method would infer just the tree. Here we show the true trees, tips annotated with known demes.

Expected time to coalesce

For a symmetric 2 deme model ($N_1=N_2=N$ and $m_{12}=m_{21}=m$) we can derive the expected time to coalescence of two lineages:

Define $\tau_d$ and $\tau_s$ as coal. time when lineages are in different/same demes respectively.
Lineages in distinct demes cannot coalesce: $$\langle\tau_d\rangle = \frac{1}{2m} + \langle\tau_s\rangle$$
Lineages in the same deme have probability $2m/(2m + 1/Ng)$ of migrating to distinct demes following the $1/(2m+1/Ng)$ waiting period: $$\langle\tau_s\rangle = \frac{1}{2m+1/Ng} + \frac{1/Ng}{2m+1/Ng} 0 + \frac{2m}{2m+1/Ng}\langle\tau_d\rangle$$
Solving this recursion relation yields: $\langle\tau_s\rangle = 2Ng$ and $\langle\tau_d\rangle = \frac{1}{2m} + 2Ng$

Expected time to coalesce from same deme is independent of $m$! However, the variance is a (decreasing) function of $m$.

Recap: Bayesian Phylogenetic Inference

The usual phylogenetic posterior is:

$$P(T,\mu,\theta|A) = \frac{1}{P(A)}P(A|T,\mu)P(T|\theta)P(\mu)P(\theta)$$

where

$A$ is the multiple sequence alignment,
$T$ is the tree,
$\mu$ are the substitution model parameters,
$\theta$ are the tree prior parameters.

SC inference: Modified tree prior

The standard phylogenetic posterior is modified:

\begin{align*} P(T,C,\mu,\vec{N},m|A,L) = &\frac{1}{P(A)}P(A|T,\mu)\\ &\times P(T,C|\vec{N},m,L)P(\mu)P(\theta) \end{align*}

$C$ are the tree colours (lineage locations),
$L$ are the sample locations.

The probability of a structured coalescent tree is:

\begin{align*} P(T,C|\vec{N},m,L)\propto&\exp\left[-\int_0^{t_r}\left(\sum_i\binom{k_i(t)}{2}\frac{1}{N_ig}+\sum_{ij}k_i(t)m_{ij}\right)\right]\\ &\times\left(\prod_{i}(\frac{1}{N_ig})^{\nu_i}\right)\left(\prod_{ij}(m_{ij})^{\kappa_{ij}}\right) \end{align*}

($\nu_i$ are the number of coalescences in $i$ and $\kappa_{ij}$ are the number of $i$ to $j$ migrations.)

Structured coalescent prior

As we have just seen, the structured coalescent density can be expressed as a product of terms over time intervals from the tips to the root of the ancestral genealogy.

In the standard panmictic coalescent, the number of intervals is known, but in the structured coalescent it's an unknown random variable.

Two the right is the prior distribution of the number of migration events in the two deme, 30 sample example.

Example of structured coalescent inference

New MCMC operators

Vaughan et al (2014)

Comparison to mugration model

Alternative framework for phylogeographic inference.
Introduced by Phillipe Lemey et al., 2009.
Treats location as "just another character" (migration $\leftrightarrow$ mutation)

\begin{align} P(T,\mu,\theta|A,L) =& \frac{1}{P(A)P(L)} P(A|T,\mu)P(L|T,M)\\ &\times P(T|\theta)P(\mu)P(\theta) \end{align}

$L$ are the sampled locations, and
$M$ is a matrix specifying the random walk.

Samples are assumed to be collected in a manner that is blind to their location.

Mugration models use sample location as data.
Just as for genetic data, non-random sampling procedures will bias results.

Implementations of SC inference

MIGRATE-n (Beerli and Palczewski, 2010)
Ewing sampler (Ewing et al., 2004)
MultiTypeTree (BEAST 2, Vaughan et al., 2014)
BASTA (approximate, BEAST 2, De Maio et al., 2015)
MASCOT (better approximation, BEAST 2, Mueller et al., 2017)

Chemical master equations

Example: SIR model

Example: SIR model

What is population structure?

Structured Wright-Fisher model

Structured Wright-Fisher model

The structured coalescent

The effect of structure on tree shape

Two-deme structured coalescent trees

Two-deme structured coalescent trees

Expected time to coalesce

Recap: Bayesian Phylogenetic Inference

SC inference: Modified tree prior

Structured coalescent prior

Example of structured coalescent inference

New MCMC operators

Comparison to mugration model

Implementations of SC inference

Recommended reading