The Coalescent with Recombination
Lecturer: Alexei Drummond

Recombination

  • Occurs in eukaryotes, bacteria and viruses.
  • Eukaryotic recombination occurs during meiosis via chromosomal crossover.
  • Bacterial recombination occurs via
    • phage-mediated transduction,
    • natural transformation, and
    • conjugation.
  • Viral recombination occurs when multiple strains infect a single cell. (Example: reassortment in influenza.)
Extremely important for phylogenetics: affects the validity of the "tree" assumption!

Wright-Fisher model with recombination

  • Consider an WF model with male and female diploid individuals.
  • Focus on a small segment of the genome.
  • Each child selects 1 male and 1 female parent uniformly at random.
  • With probability $r$ (which depends on the segment size) the two chromosome pairs of a parent are recombined before being passed on.

Wright-Fisher model with recombination

  • Since the specific pairing of chromosomes only matters over a single generation, in the long term the haploid approximation is very good:
  • Each child in $i+1$ selects a parent uniformly at random from generation $i$.
  • With probability $r$ an additional parent is selected.
  • In this case, a breakpoint chosen uniformly on the chromosome, and everything to either the left or right replaced by the homologous section of the second parent's chromosome.
  • (Equivalent to selecting 2 parents for each child and letting the resulting sequences recombine with some probability.)

Coalescent with recombination

For fixed recombination rate $\rho=r/g$ in the limit $r\ll 1$, $g\ll 1$ and $N\gg 1$, the genealogical process is the coalescent with recombination (Hudson, 1983):

  • Coalescence rate: $T_c(k)=\binom{k}{2}\frac{1}{Ng}$
  • Recombination rate: $T_r(k)=\rho k$.
  • Recombination break points chosen uniformly along sequence: everything to the left descends from one parent, everything else to the other.
  • Each site possess a local tree.
  • Local trees may find MRCAs (grey sites) before grand (G)MRCA of the process.

CwR as a birth-death process

The number of surviving ancestral lineages $k$ evolves under the birth-death process corresponding to the following pair of reactions:

\begin{align*} 2L &\overset{\chi}{\longrightarrow} L\\ L &\overset{\rho}{\longrightarrow} 2L \end{align*}

where $\chi=1/Ng$. The deterministic approximation for the evolution of $\langle k\rangle=\langle k\rangle$ is \begin{align*} \partial_t \langle k\rangle &\simeq -\frac{1}{2N}\langle k\rangle(\langle k\rangle-1) + \rho\langle k\rangle\\ &\propto \langle k\rangle(1-\langle k\rangle/k_c) \end{align*} where $k_c=1+2N\rho$.

$\langle k\rangle$ stabilizes at $k_c$, but noise eventually drives $k$ to 1 (GMRCA).

CwR example ARG

(Simulated using this MASTER script.)

Coalescent with gene conversion

A gene conversion event refers to the replacement of a single contiguous stretch of sequence with a homologous stretch from a different parent. Incorporated into a modified CwR process by Wiuf and Hein (2000).

  • Allows similar patterns of site ancestry but with fewer events.
  • Model applicable to prokaryotic recombination.
  • Coalescence and recombination rates as for CwR.
  • Conversions initiate at randomly chosen starting sites and extend for $d$ sites where $P(d)=(1-\delta^{-1})^{d-1}\delta^{-1}$ (geometric distribution).

Sequentially Markov Coalescent (SMC)

  • Sites between breakpoints exponentially distributed with rate $\rho T$ where $T$ is the total length of the current local tree.
  • Neglects some possible recombinations, e.g. those that do not affect the data.

Population inference using SMC

  • Li &l; Durbin (2011) developed an SMC-based HMM on pairs of alignments.
  • Baum-Welch used to estimate parameters and hidden states (local TMRCAs).
  • TMRCA distribution used to produce estimates of population size dynamics.

Bayesian inference of ARGs

Can easily write down an expression for the posterior distribution for the ARG given an alignment:

$$P(G,\rho,N,\mu|A) = \frac{1}{P(A)}P(A|G,\mu)P(G|\rho,N)P(\rho,N,\mu)$$
  • $G$ is the recombination graph,
  • $\rho$ is the recombination rate.

In practice, this is non-trivial since:

  1. the likelihood for $G$, $P(A|G,\mu)$, is invariant under many features of $G$,
  2. the likelihood surface often contains many distinct peaks, and
  3. the state space of ARGs is huge: much larger than for trees.

Despite this, many approximate algorithms exist.

Algorithms for Bayesian ARG inference

Summary

  • Recombination is an ever-present feature of real evolution.
  • Ignoring recombination when it is present can lead to biased phylogenetic inferences.
  • Coalescent models for recombination produce Ancestral Recombination Graphs (ARGs).
  • Coalescent models are derivable from modified Wright-Fisher process.
  • Sequentially Markovian approximation can be used to construct an HMM where TMRCAs are hidden states.
  • Full Bayesian inference of ARGs is an ongoing research topic.

Recommended Reading