Phylogenetics
Lecturer: Alexei Drummond

Molecules as Documents of Evolutionary History

  • Macromolecules contain information about the processes and history that formed them
    HIV-1 (UK.) ATCGGATGCTAAAGCATATGACACAGAGGTACATAATGTTT
    HIV-1 (USA) ATCAGATGCTAGAGCTTATGATACAGAGGTACA---TGTTT
    
    
  • However, this information is not complete, so the full history must be inferred
  • One of the aims of computational biology is to decipher the information held in molecular sequences about the process and history of evolution

Phylogenetics

  • Similarity (due to homology) is viewed as evidence of common ancestry
    • Homology: the result of inheritance from a common ancestor
  • Phylogenetic trees are used to portray relationships based upon common ancestry
  • Monophyletic groups (clades) - contain species which are more closely related to each other than to any outside of the group
  • Phylogenetics has in recent years become a statistical science based on probabilistic models of evolution (more on this in later lectures).

A typical molecular phylogenetic analysis

  1. Collect homologous sequences
  2. Construct a multiple sequence alignment
  3. Phylogeny reconstruction
  4. Test the reliability of the estimated phylogeny
  5. Interpretation and application of phylogenies

Applications of phylogenetics

  • Inferring relationships among species and genes
  • Estimating divergence times
  • Identifying functional elements in comparisons of genomic sequences
  • Detecting molecular adaptation
  • Forensics
  • Studying the emergence and spread of viral pandemics
  • many more...

Types of phylogenies and representations

Bifurcating (binary) and multifurcating trees

  • In a rooted tree a polytomy is a node with more than 2 children.
  • In an unrooted tree a polytomy is a node of degree 4 or greater.

Rooted and unrooted trees

Many of the early phylogeny-reconstruction methods developed were unable to identify the root of the tree, so unrooted trees were inferred. This includes the maximum parsimony method, as well as those distance, likelihood and Bayesian methods that do not assume a molecular clock.

Rooting trees using an outgroup

The same unrooted tree

Anatomy of a tree

How many trees are there?

For $n$ taxa there are:
$$T_n^{(R)}=(2n−3)(2n−5)\dots(3)(1)$$
rooted, binary trees

$n$#trees
415enumerable by hand
5105enumerable by hand on a rainy day
6945enumerable by computer
710395still searchable very quickly on computer
8135135about the number of hairs on your head
92027025greater than the population of Auckland
1034459425$\approx$ upper limit for exhaustive search
20$8.2 \times 10^{21}$$\approx$ upper limit of branch-and-bound searching
48$3.21 \times 10^{70}$$\approx$ the number of particles in the Universe
136$2.11 \times 10^{267}$number of trees to choose from in the "Out of Africa" data (Vigilante et al. 1991)

Counting unrooted trees with stepwise addition algorithm

$$T_3 = 1 \rightarrow T_4 = 1 \times 3 \rightarrow T_5 = 1 \times 3 \times 5$$
$T_n = 1 \times 3 \times 5 \times \dots \times (2n−5)$ unrooted trees with $n$ leaves

$T_{20} = 2.2 \times 10^{20}$, $T_{53} \approx$ number particles in the universe

The topological distance between two trees

The partition distance (Robinson and Foulds, 1981) is defined as the total number of bipartitions that are in one tree but not in the other.
Each internal branch defines a bipartition (split) of species.
The distance between these two trees is 2.
The partition distance ranges from 0 to $2(n − 3)$ for $n$ species

Phylogenetic reconstruction

There are essentially two types of data for phylogenetic tree estimation:
  • Distance data, usually stored in a distance matrix, e.g. DNA×DNA hybridisation data, morphometric differences, immunological data, pairwise genetic distances
  • Character data, usually stored in a character array, e.g. multiple sequence alignment of DNA sequences, morphological characters.
Distances
ABCDE
A03565
B30476
C54054
D67501
E56410
Characters
123456789
A100011011
B010011111
C001000111
D000100000
E000000000

Phylogenetic reconstruction

There are a huge number of possible trees even for small data sets.
We have three options:
  • Construct a tree according to some clustering algorithm
  • Assign a goodness of fit criterion (an objective function) and find the tree(s) which optimise(s) this criterion
  • Admit it is a statistical estimation problem and find the most probable phylogenies under a statistical model.

Phylogenetic reconstruction

Clustering algorithms

Common clustering algorithms are Neighbour-joining and UPGMA (Unweighted Pair-Group Method using Arithmetic averages). The clustering algorithms are usually very fast, and simple but there is no explicit optimality criterion, so
  • the method provides no measure of how good the tree is!
  • we do not get any idea about other potential trees - were there any better trees?

Clustering algorithms

  • The UPGMA and neighbor-joining (NJ) methods are both greedy heuristics which join, at each step, the two closest* sub-trees that are not already joined.
  • They are based on the minimum evolution principle.
  • An important concept in both of these methods is a pair of neighbors, which is defined as two nodes that are connected via a single node:
* NJ uses rate-corrected distances

UPGMA example

Distances
ABCD
A0
B80
C790
D1214110
$$d_{B(AC)} = (d_{BA} + d_{BC} )/2 = (8 + 9)/2 = 8.5$$ $$d_{D(AC)} = (d_{DA} + d_{DC})/2 = (12 + 11)/2 = 11.5$$

UPGMA example

Distances
ACBD
AC0
B8.50
D11.5140
$$d_{(ABC)D} =(d_{AD} +d_{BD} +d_{CD})/3=(12+14+11)/3 \approx 12.33$$

UPGMA example

Distances
ABCD
ABC0
D12.330
UPGMA produces a ultrametric rooted tree and assumes that evolution is clock-like, i.e. it assumes that the rate of substitution is the same on all branches of the tree.
A tree is ultrametric when all leaves have the same age/height.

UPGMA weaknesses

Distances
ABCD
A0
B80
C790
D1214110
There is a (non clock-like) tree with a different topology that fits the distance matrix perfectly!

Neighbour-joining algorithm
(Saitou and Nei, 1987)

  • Most widely-used distance-based method for phylogenetic reconstruction
  • UPGMA illustrated that it is not enough to pick the closest neighbors (at least when there is rate heterogeneity across branches)
  • Idea: take into account averaged distances to other leaves as well
  • Produces an unrooted tree

The basic idea

  • We start by computing the “average distance” from i to every other taxon $$r_i = \frac{1}{n−2} \sum_j d_{ij}$$
  • We then compute new corrected distances for all pairs of $(i,j)$: $$d_{ij}^∗ = d_{ij} − r_i − r_j$$
  • We are effectively pushing each node $i$ closer to all other nodes by $r_i$.

The basic idea

$$d_{ij}^∗ = d_{ij} − r_i − r_j$$
The effect is to correct for long branches.

$$d= \begin{bmatrix} 0 & 8 &7 & 12 \\ 8 & 0 & 9 & 14 \\ 7 & 9 & 0 & 11 \\ 12 & 14 & 11 & 0 \end{bmatrix}$$
$$r = [13.5, 15.5, 13.5, 18.5]$$
$$d^{*}= \begin{bmatrix} 0 & -21 & -20 & -20 \\ -21 & 0 & -20 & -20 \\ -20 & -20 & 0 & -21 \\ -20 & -20 & -21 & 0 \end{bmatrix}$$
This is the result of the Neighbour-joining algorithm on same distance matrix as used in the UPGMA example. AB and CD are grouped instead of AC.

Neighbour joining

  • We use an algorithm very similar to UPGMA to select the two closest nodes, $i$ and $j$, based on the corrected distances ($d^*$).
  • We join these into a cluster and make a new node $k$ to correspond to their ancestor
  • the distance to the new node $k$ is computed by $$d_{ik} = \frac{1}{2}(d_{ij} +r_i − r_j)$$
  • Nodes $i$ and $j$ are removed from the pool and replaced by $k$. The uncorrected distances are updated based on the $d_{ik}$ calculation and $d^*$ is then recalculated.
  • See Higgs and Attwood (2005; pp166-169) for details.

Time complexity of the clustering algorithms

  • Both of these clustering-based algorithms take $O(n^3)$ time once we have the distance matrix.
  • There are n steps and in each step we do:
    1. find the smallest distance
    2. join these two taxa
    3. compute the distance from the new ancestor to all others
  • Step 1 takes $O(n^2)$ and the other two steps take $O(n)$
Note: There is an alternative (and much harder to follow) formulation of the UPGMA algorithm that takes $O(n^2)$

Least squares distance-based phylogenetics

$d_{ij}$HumanChimpGorillaOrangutan
Human0
Chimp0.09650
Gorilla0.11400.11800
Orangutan0.18490.20090.19470

$\begin{align*} S = &(d_{12} - \hat{d}_{12})^2 + (d_{13} - \hat{d}_{13})^2 + \\ &(d_{14} - \hat{d}_{14})^2 + (d_{23} - \hat{d}_{23})^2 + \\ &(d_{24} - \hat{d}_{24})^2 + (d_{34} - \hat{d}_{34})^2 \end{align*}$

$$\hat{d}_{12} = t_1 + t_2, \quad \hat{d}_{13} = t_1 + t_0 + t_3, \dots$$
There are six data points (distances) and five free parameters (branch lengths) to numerically optimize so as to minimise S.

Least squares distance-based phylogenetics

Tree$t_0$$t_1$ (H)$t_2$ (C)$t_3$ (G)$t_4$ (O)S
((H,C),G,O)0.0088400.0432660.0532800.0589080.1357950.000035
((H,G),C,O)0.00.462120.0562270.0618540.1387420.000140
((H,O),C,G)0.00.462120.0562270.0618540.1387420.000140
(H,G,C,O)0.00.462120.0562270.0618540.1387420.000140

Reading: Trees and Distance Methods

  • Bioinformatics and Molecular Evolution, Higgs and Attwood (2005), sections 8.1 and 8.3
  • Computational Molecular Evolution, Ziheng Yang (2006), sections 3.1-3.3
  • Biological Sequence Analysis, Durbin et al (1998), sections 7.1-7.3
  • An Introduction to Bioinformatics Algorithms, Jones and Pevzner (2004), Chapter 10