1. Understanding Tree Space
Before diving into molecular clocks, we need to understand the mathematical space in which phylogenetic trees exist. This "tree space" has unique geometric properties that affect how we search for optimal trees and summarize posterior distributions.
Tree Space for Time Trees
Consider the simplest non-trivial case: time trees for 3 taxa.
Tree Subspace Properties
- Each tree topology defines a subspace
- For time trees with $n$ taxa, each subspace is $(n-1)$-dimensional
- The dimensions can be parameterized as:
- Inter-coalescent intervals: Forms a hypercube (all intervals independent)
- Node heights: Forms a simplex (heights must respect temporal ordering)
- Trees can be averaged within a subspace (arithmetic mean shown)
- Distances between trees can be computed (Euclidean distance shown as dashed lines)
The Complete Tree Space
Key features of tree space:
- Each non-degenerate tree topology is a two-dimensional space
- These subspaces meet at shared edges representing degenerate topologies
- The star tree (all three taxa diverge simultaneously) is a one-dimensional subspace
- The parameter for the star tree is just the age of the root
Key insight: Tree space has both discrete (topology) and continuous (branch lengths/node times) components. MCMC must explore both!
Alternative Visualizations of Tree Space
Parameterization matters: When using node heights as parameters, the constraint that parent nodes must be older than child nodes creates simplices. In contrast, inter-coalescent intervals are independent, creating hypercubes.
Higher-Dimensional Tree Spaces
For 4 taxa, the tree space becomes more complex:
Tree Space Complexity
As the number of taxa increases:
- 3 taxa: 3 ranked topologies, 2D subspaces
- 4 taxa: 18 ranked topologies, 3D subspaces
- 5 taxa: 180 ranked topologies, 4D subspaces
For 4 taxa: There are 15 rooted topologies total. Of these:
- 12 are "caterpillar" trees (fully pectinate) - each has only 1 ranking
- 3 are balanced trees - each has 2 possible rankings of the two internal nodes on the same side of the root
- Total: 12 × 1 + 3 × 2 = 18 ranked topologies
Note: Each subspace corresponds to a ranked topology where the temporal order of all coalescence events is specified. When parameterized by inter-coalescent intervals, each subspace forms a hypercube (all intervals can vary independently). When parameterized by node heights, the subspaces form simplices due to temporal ordering constraints.
Implications for Phylogenetic Inference
Understanding tree space helps us appreciate:
- Complexity of the search problem: Tree space grows super-exponentially with taxa
- Need for specialized moves: MCMC operators must handle both discrete topology changes and continuous parameter updates
- Challenges in summarization: Averaging trees across topologies is problematic
- Local optima: The discrete nature creates barriers between topology subspaces
3. The Identifiability Problem
Non-identifiability of Rate and Time
Fundamental problem: From genetic distances alone, we cannot separate rate from time. If we double all times and halve the rate, we get the same likelihood!
Solutions to the Identifiability Problem
To separate rate and time, we need additional information:
1. Node Calibrations
Use fossil or biogeographic evidence to constrain node ages:
Node Calibration in Practice
- Fossil provides minimum age (fossil must be younger than clade)
- Biogeography can provide maximum age (e.g., island age)
- Often specified as probability distributions (e.g., lognormal)
- Multiple calibrations improve precision
2. Tip Calibrations
Use samples from different time points:
Applications of Tip Calibration
- Ancient DNA: Subfossil remains, museum specimens
- Rapidly evolving pathogens: Virus samples from different years
- Laboratory evolution: Samples from known time points
Key principle: With either node or tip calibration, knowing even one absolute time allows us to estimate all other times under a strict clock.
4. Relaxed Molecular Clocks
The strict molecular clock is often too restrictive. Relaxed clocks allow rate variation across branches.
Relaxed Clock Parameterization
Relaxed Molecular Clock
Instead of a single rate $\mu$, we have a vector of rates $\vec{\mu} = (\mu_1, \mu_2, ..., \mu_{2n-2})$
The substitution tree is computed as: $T = \vec{\mu} \star g$
Where $\star$ denotes element-wise multiplication of rates and branch durations
Alternative Relaxed Clock Visualization
Identifiability Under Relaxed Clocks
Critical issue: Relaxed clocks are only identifiable through their priors! The likelihood alone cannot distinguish between fast rates with short times vs. slow rates with long times.
6. Models of Rate Variation
Autocorrelated Models
Rates evolve along the tree, with child rates similar to parent rates:
Autocorrelated Rate Model
$$P(\textcolor{darkgreen}{\vec{\mu}}) = \prod_i P(\mu_i | \mu_{\text{parent}(i)})$$
Common implementation: Lognormal autocorrelated model
$$\log(\mu_i) \sim \text{Normal}(\log(\mu_{\text{parent}(i)}), \sigma \sqrt{t_i})$$
Where:
- $t_i$ = time duration of branch $i$
- $\sigma$ = rate of evolution of rates (volatility parameter)
Uncorrelated Models
Rates are drawn independently for each branch:
Uncorrelated Rate Model
$$P(\textcolor{darkgreen}{\vec{\mu}}) = \prod_i P(\mu_i | \nu)$$
Common implementations:
- Lognormal: $\log(\mu_i) \sim \text{Normal}(M, S)$
- Exponential: $\mu_i \sim \text{Exponential}(\lambda)$
- Gamma: $\mu_i \sim \text{Gamma}(\alpha, \beta)$
Autocorrelated
- Biologically motivated
- Smooth rate changes
- Good for closely related species
- Can extrapolate rates
Uncorrelated
- More flexible
- Allows sudden rate changes
- Good for diverse datasets
- Simpler to implement