The multi-species coalescent
Lecturer: Alexei Drummond

The murky boundary between population genetics and phylogenetics

There has been increased interest in analyses of closely related species, where the effect of population genetic processes, such as the coalescent can't be ignored.

  • Different gene trees can have different topologies due to incomplete lineage sorting
  • Divergence times of species can be overestimated due to ancestral polymorphism
  • Sometimes the exact species identities of individuals are not known
  • Sometimes researchers identify species based on a split in a single gene tree.

Enter the multi-species coalescent.

All these patterns are the consequence of "lineage drift" within a single population

Gene trees and species trees

Why Ancestral Polymorphisms?

Probabilities of Monophyly, Paraphyly and Polyphyly

Overestimating divergence times

  • Typically we use gene phylogenies to estimate species phylogenies
  • But divergence time for genes will be longer than that of species.

Overestimating divergence times

From Edwards and Beerli, 2000, Evolution 54: 1839-1854

The multispecies coalescent

Population sizes on a species tree

Define \(N_i\) to be the effective population size at the present for the species \(i \in \{1,2,\dots,n\}\), and \(A_i\) the ancestral effective population of species \(i\) at the time of the species origin.

Then for all ancestral species \(j \in \{ n+1, n+2, \dots, 2n\}\) (represented by internal branches in the species tree) define \(N_j = A_{\text{left}(j)} + A_{\text{right}(j)}\), where \(\text{left}(i)\) is the left descendent of species \(i\) and \(\text{right}(i)\) is the right descendent species.

\[A_k \sim \text{Exp}(\Theta), \quad 1 \leq k \leq 2n\] \[N_i \sim \text{Gamma}(2, \Theta), \quad 1 \leq i \leq n\] \[N_j = A_{\text{left}(j)} + A_{\text{right}(j)}, \quad n < j < 2n\]

Population sizes prior

Species divergence times prior

For a species tree of \(n\) species, define \(T_i\) to be the time at which the species tree goes from having \(i\) to \(i-1\) species, back in time. Additionally define \(\tau_i = T_i - T_{i+1}, i \in \{2,\dots,n-1\}\) and \(\tau_n = T_n\).

The Yule speciation prior supposes a uniform rate species birth (\(\lambda\)) on all lineages, implying a prior of:

\[\tau_i \sim \text{Exp}(1/i\lambda)\]

More complex species tree priors that admit species extinction (Birth-death prior; Gernhard, 2008) and incomplete sampling (Birth-death-sampling; Stadler, 2009) are also possible.

All of these species tree priors imply a uniform prior on labelled histories.

Coalescent prior for gene trees

Consider a single species in the species tree, spanned by \(k = u-v\) coalescent intervals (and a final interval without a coalescent event). \(t_k\) is the time during which there are \(k\) lineages. Define \(N(s)\) as the population size of this species at time \(s\). Define \(s_i = \sum_{k=u}^it_k\). The prior density for each interval ending in a coalescent is:

\[f(t_k) = {\frac{1}{N(s_k)}}{\exp\left(-{\int \limits_{s_{k-1}}^{s_k} \frac{\binom{n}{2}}{N(x)} dx }\right)}\]

Coalescent prior for gene trees

The coalescent prior density for the final interval that does not end in a coalescent event is:

\[f(t_v) = {\exp\left(-{\int \limits_{s_{v-1}}^{s_v} \frac{\binom{n}{2}}{N(x)} dx }\right)}\]

Define \(f_g(g | S)\) to be the total coalescent density for gene tree \(g\), obtained from the product of all the intervals over all species in the species tree \((S)\).

Posterior probability of multi-species coalescent

If we additionally define \(\color{orange}{\Pr(D_g | g)}\) as the phylogenetic likelihood of the sequence data for gene tree \(g\), and \(\color{red}{P(S | \lambda, \Theta)}\) as the prior on the species tree times and population sizes, then the posterior distribution over gene trees, species tree and other parameters is:

$${\color{#007f7f}{P(\mathbf{G},S,\lambda,\Theta | \mathbf{D})}} = \frac{1}{\color{green}Z} \color{orange}{\left[ \prod_{g \in \mathbf{G}} \Pr(D_g|g) \right]} \color{red}{\left[\prod_{g\in\mathbf{G}} P(g|S)\right] P(S | \lambda, \Theta) P(\lambda) P(\Theta)}$$

where \(\mathbf{G}\) is all the gene trees, \(\mathbf{D}\) is all the gene alignments, \(\color{red}{P(\lambda)}\) is the prior for the speciation rate, \(\color{red}{P(\Theta)}\) is the prior on the ancestral population size parameters, and \(\color{green}Z = P(\mathbf{D})\) is the unknown normalising constant.

Four gene trees inside a 3-species tree

A rapid radiation of 7 species

Species tree

Species tree and one gene tree

Gene concatenation - a terrible idea

Species tree

"Gene" tree from concatenated genes

Species tree of Pocket Gophers

Data from (Belfiore, 2008)

27 individuals, 7 loci (12 from T. bottae, 23 from others, 1 from outgroup)

Full Bayesian inference under the multi-species coalescent

Data from (Belfiore, 2008), software implemented in BEAST by Heled and Drummond (2010)

Open questions

  • Are there better priors for the population sizes?
  • What are efficient proposal distributions for MCMC on the multispecies coalescent?
  • What about uncertain species identification?
    • How do we characterize the prior distribution on species associations? (geography, morphology...)
  • What about uncertain numbers of species?!
    • How do we characterize the hypothesis space over a species trees with a random number of species, and gene tree tips with an uncertain species identity?
    • Reversible-jump, yes, but what proprosal distributions? ;-) What about Bayesian stochastic variable selection?