Time-trees and Relaxed phylogenetics

Generate layout for printing

Lecturer: Alexei Drummond

Tree space

A two-dimensional space representing all possible time-trees for the topology ((1,2),3).
x and y, are the two inter-coalescent intervals($t_{root} = x + y$).
Three trees are displayed, with their arithmetic mean tree.
Dashed lines show the shortest distances to mean (i.e. deviations from the mean).

Tree space

The simplest non-trivial tree space for time-trees ($n=3$)
Each non-degenerate tree topology is a two-dimensional space.
These subspaces meet at a shared edge representing the star tree, which is a one-dimensional subspace (parameter is the age of the root)
Dashed lines are shortest distances between the four displayed trees.

Another space of tip-labeled time-trees

Projection of tree space on 4 taxa

This is a projection of a part of 4-taxa tree space. The full number of subspaces is actually 18, but we only show 6 of them here. Also each subspace is actually a cube, but we show them as squares by fixing the tip-most interval.

Tree space

Tree space is a complex space made up of a subspace for each tree topology.
For time trees of $n$ taxa each subspace is positive $R^{n-1}$ or an $(n-1)$-dimensional simplex, depending on the parameterization.
Tree space has a combinatorial component that can be described by a graph whose vertices are distinct tree topologies and whose edges connect topologies whose subspaces share a face.
Tree space can be explored by random walks on the graph connecting tree topologies, but also by larger jumps like subtree-prune-and-regraft (SPR).
The continuous component of tree space describe the divergence times on each tree must also be explored in Bayesian phylogenetic inference.

Genetic distance = rate $\times$ time

The strict molecular clock parameterization

The "substitution tree" is in units of expected substitutions, i.e. genetic distances.

Non-identifiability of rate and time

Identifiability via node calibrations

Suppose fossil evidence shows the common ancestor of species A, B and C lived 25-35 Mya.

With a strict molecular clock, only the age (range) of a single node in the tree is needed in order to interpolate and extrapolate the ages of all other divergence times.

Once a known node age like this "calibrates" the tree, the genetic distances can be separated into an absolute rate and divergence times.

Identifiability via leaf calibrations

Suppose sample C is ancient DNA from subfossil remains dated to 20 thousand years ago.

With a strict molecular clock, calibration is possible with data from non-contemporaneous taxa (e.g. ancient DNA, or samples from different years in a rapidly evolving virus).

Again, once one or more non-contemporaneous leaf nodes like this "calibrate" the time scale of the tree, the genetic distances can be separated into an absolute rate and divergence times.

Genetic distance = rate $\times$ time

The relaxed molecular clock parameterization

The "substitution tree" is in units of expected substitutions, i.e. genetic distances.

Genetic distance = rate $\times$ time

The relaxed molecular clock parameterization

The "substitution tree" is in units of expected substitutions, i.e. genetic distances.

Non-identifiability of rate and time

Bayesian phylogenetic posterior with a molecular clock

Strict molecular clock:

$$T = \color{darkgreen}{\mu} \times \color{red}{g}$$

$$P(\color{red}{g}, \color{darkgreen}{\mu}, \theta|D) = \frac{1}{\Pr(D)}\Pr(D|T)P(\color{red}{g}|\theta)P(\color{darkgreen}{\mu})P(\theta)$$

Relaxed molecular clock:

$$T = \color{darkgreen}{\vec{\mu}} \star \color{red}{g}$$

$$P(\color{red}{g}, \color{darkgreen}{\vec{\mu}},\theta|D) = \frac{1}{\Pr(D)}\Pr(D|T)P(\color{red}{g}|\theta)P(\color{darkgreen}{\vec{\mu}})P(\theta)$$

where $P(\color{darkgreen}{\vec{\mu}})$ is the (model-based) prior for how much rate variation you allow.
The phylogenetic likelihood only depends on the substitution tree $T$: $\Pr(D|T)$.
The tree prior only depends on the time-tree $\color{red}{g}$: $P(\color{red}{g}|\theta)$.
By fixing $\color{darkgreen}{\vec{\mu}} = 1$ we get a time-tree ($\color{red}{g}$) in units of substitutions (i.e. genetic distance).

How many parameters are in each of these models?

Unrooted tree, no molecular clock constraint

unrooted substitution tree $T$ has $2n-3$ random variables, one for each branch length

Strict molecular clock:

$\color{red}{g}$ has $n-1$ random variables, one for the age of each internal node.
$\color{darkgreen}{\mu}$ is a one-dimensional parameter
total of $n$ dimensions

Relaxed molecular clock:

$\color{red}{g}$ has $n-1$ random variables, one for the age of each internal node.
$\color{darkgreen}{\vec{\mu}}$ contains $2n-2$ rate parameters
total of $3n-3$ dimensions
Only identifiable by priors on rates and times: $P(\color{darkgreen}{\vec{\mu}})$, $P(\color{red}{g}|\theta)$

What models of rate variation should we consider?

Autocorrelated models of rate variation assume the rate evolves down the tree:

$$P(\color{darkgreen}{\vec{\mu}}) = \prod_i P(\mu_i | \mu_{\text{parent}(i)})$$

e.g. $\text{log}(\mu_i) \sim \text{Normal}(\text{log}(\mu_{\text{parent}(i)}), \sigma t_i)$, where $t_i$ is the length of time separating the $i$'th node and its parent in the time tree and $\sigma$ is the rate of evolution of the rate of evolution per unit time in log space.

Uncorrelated models of rate variation assume each branch has a rate drawn independently from a distribution

$$P(\color{darkgreen}{\vec{\mu}}) = \prod_i P(\mu_i | \nu)$$

where $\nu$ are parameters of the distribution they are drawn from, e.g. $\text{log}(\mu_i) \sim \text{Normal}(M, S)$, where $M$ is the mean of the log rate and $S$ is the standard deviation of the log-rate.

Bayesian MCMC differences

The main model in MrBayes samples unrooted trees with no molecular clock.

Operator proposals have to be able to move from any unrooted tree to any other.
Branch length proposals can propose any positive branch length for any branch.

The main model in BEAST samples rooted time trees with a strict or relaxed molecular clock.

Operators have to maintain constraints, so parameterise the tree on a time scale in which all leaves are fixed to their known age and all moves on tree topology and divergence times must maintain that constraint.
Natural to operate on divergence times instead of branch lengths.

Conclusions

The geometry of (time-)trees is understudied and advances could lead to new phylogenetic inference methods and better posterior summaries.
Both leaf- and node-calibration can produce time-trees in calendar units, (and new methods that directly model the sampling process can also be used for calibration).
Relaxed molecular clocks have many benefits over unconstrained models for phylogenetic inference
- They appear to estimate the phylogenetic tree more accurately on real data sets
- They automatically provide estimates of a root position, without the need for an outgroup
- They automatically provide estimates of relative divergence dates, or absolute divergence dates when calibration information is available