Phylogenetics

Generate layout for printing

Lecturer: Alexei Drummond

Molecules as Documents of Evolutionary History

Macromolecules contain information about the processes and history that formed them

HIV-1 (UK.) ATCGGATGCTAAAGCATATGACACAGAGGTACATAATGTTT
HIV-1 (USA) ATCAGATGCTAGAGCTTATGATACAGAGGTACA---TGTTT

However, this information is not complete, so the full history must be inferred
One of the aims of computational biology is to decipher the information held in molecular sequences about the process and history of evolution

Phylogenetics

Similarity (due to homology) is viewed as evidence of common ancestry

Homology: the result of inheritance from a common ancestor

Phylogenetic trees are used to portray relationships based upon common ancestry
Monophyletic groups (clades) - contain species which are more closely related to each other than to any outside of the group
Phylogenetics has in recent years become a statistical science based on probabilistic models of evolution (more on this in later lectures).

A typical molecular phylogenetic analysis

Collect homologous sequences
Construct a multiple sequence alignment
Phylogeny reconstruction
Test the reliability of the estimated phylogeny
Interpretation and application of phylogenies

Applications of phylogenetics

Inferring relationships among species and genes
Estimating divergence times
Identifying functional elements in comparisons of genomic sequences
Detecting molecular adaptation
Forensics
Studying the emergence and spread of viral pandemics
many more...

Types of phylogenies and representations

Bifurcating (binary) and multifurcating trees

In a rooted tree a polytomy is a node with more than 2 children.
In an unrooted tree a polytomy is a node of degree 4 or greater.

Rooted and unrooted trees

Many of the early phylogeny-reconstruction methods developed were unable to identify the root of the tree, so unrooted trees were inferred. This includes the maximum parsimony method, as well as those distance, likelihood and Bayesian methods that do not assume a molecular clock.

Rooting trees using an outgroup

The same unrooted tree

Anatomy of a tree

How many trees are there?

For $n$ taxa there are:

$$T_n^{(R)}=(2n−3)(2n−5)\dots(3)(1)$$

rooted, binary trees

$n$	#trees
4	15	enumerable by hand
5	105	enumerable by hand on a rainy day
6	945	enumerable by computer
7	10395	still searchable very quickly on computer
8	135135	about the number of hairs on your head
9	2027025	greater than the population of Auckland
10	34459425	$\approx$ upper limit for exhaustive search
20	$8.2 \times 10^{21}$	$\approx$ upper limit of branch-and-bound searching
48	$3.21 \times 10^{70}$	$\approx$ the number of particles in the Universe
136	$2.11 \times 10^{267}$	number of trees to choose from in the "Out of Africa" data (Vigilante et al. 1991)

Counting unrooted trees with stepwise addition algorithm

$$T_3 = 1 \rightarrow T_4 = 1 \times 3 \rightarrow T_5 = 1 \times 3 \times 5$$
$T_n = 1 \times 3 \times 5 \times \dots \times (2n−5)$ unrooted trees with $n$ leaves

$T_{20} = 2.2 \times 10^{20}$, $T_{53} \approx$ number particles in the universe

The topological distance between two trees

The partition distance (Robinson and Foulds, 1981) is defined as the total number of bipartitions that are in one tree but not in the other.

Each internal branch defines a bipartition (split) of species.

The distance between these two trees is 2.
The partition distance ranges from 0 to $2(n − 3)$ for $n$ species

Phylogenetic reconstruction

There are essentially two types of data for phylogenetic tree estimation:

Distance data, usually stored in a distance matrix, e.g. DNA×DNA hybridisation data, morphometric differences, immunological data, pairwise genetic distances
Character data, usually stored in a character array, e.g. multiple sequence alignment of DNA sequences, morphological characters.

Distances

	A	B	C	D	E
A	0	3	5	6	5
B	3	0	4	7	6
C	5	4	0	5	4
D	6	7	5	0	1
E	5	6	4	1	0

Characters

	1	2	3	4	5	6	7	8	9
A	1	0	0	0	1	1	0	1	1
B	0	1	0	0	1	1	1	1	1
C	0	0	1	0	0	0	1	1	1
D	0	0	0	1	0	0	0	0	0
E	0	0	0	0	0	0	0	0	0

Phylogenetic reconstruction

There are a huge number of possible trees even for small data sets.

We have three options:

Construct a tree according to some clustering algorithm
Assign a goodness of fit criterion (an objective function) and find the tree(s) which optimise(s) this criterion
Admit it is a statistical estimation problem and find the most probable phylogenies under a statistical model.

Phylogenetic reconstruction

Clustering algorithms

Common clustering algorithms are Neighbour-joining and UPGMA (Unweighted Pair-Group Method using Arithmetic averages). The clustering algorithms are usually very fast, and simple but there is no explicit optimality criterion, so

the method provides no measure of how good the tree is!
we do not get any idea about other potential trees - were there any better trees?

Clustering algorithms

The UPGMA and neighbor-joining (NJ) methods are both greedy heuristics which join, at each step, the two closest* sub-trees that are not already joined.
They are based on the minimum evolution principle.
An important concept in both of these methods is a pair of neighbors, which is defined as two nodes that are connected via a single node:

* NJ uses rate-corrected distances

UPGMA example

Distances

	A	B	C	D
A	0
B	8	0
C	7	9	0
D	12	14	11	0

$$d_{B(AC)} = (d_{BA} + d_{BC} )/2 = (8 + 9)/2 = 8.5$$ $$d_{D(AC)} = (d_{DA} + d_{DC})/2 = (12 + 11)/2 = 11.5$$

UPGMA example

Distances

	AC	B	D
AC	0
B	8.5	0
D	11.5	14	0

$$d_{(ABC)D} =(d_{AD} +d_{BD} +d_{CD})/3=(12+14+11)/3 \approx 12.33$$

UPGMA example

Distances

	ABC	D
ABC	0
D	12.33	0

UPGMA produces a ultrametric rooted tree and assumes that evolution is clock-like, i.e. it assumes that the rate of substitution is the same on all branches of the tree.

A tree is ultrametric when all leaves have the same age/height.

UPGMA weaknesses

Distances

	A	B	C	D
A	0
B	8	0
C	7	9	0
D	12	14	11	0

There is a (non clock-like) tree with a different topology that fits the distance matrix perfectly!

Neighbour-joining algorithm
(Saitou and Nei, 1987)

Most widely-used distance-based method for phylogenetic reconstruction
UPGMA illustrated that it is not enough to pick the closest neighbors (at least when there is rate heterogeneity across branches)
Idea: take into account averaged distances to other leaves as well
Produces an unrooted tree

The basic idea

We start by computing the “average distance” from i to every other taxon $$r_i = \frac{1}{n−2} \sum_j d_{ij}$$
We then compute new corrected distances for all pairs of $(i,j)$: $$d_{ij}^∗ = d_{ij} − r_i − r_j$$
We are effectively pushing each node $i$ closer to all other nodes by $r_i$.

The basic idea

$$d_{ij}^∗ = d_{ij} − r_i − r_j$$

The effect is to correct for long branches.

$$d= \begin{bmatrix} 0 & 8 &7 & 12 \\ 8 & 0 & 9 & 14 \\ 7 & 9 & 0 & 11 \\ 12 & 14 & 11 & 0 \end{bmatrix}$$
$$r = [13.5, 15.5, 13.5, 18.5]$$
$$d^{*}= \begin{bmatrix} 0 & -21 & -20 & -20 \\ -21 & 0 & -20 & -20 \\ -20 & -20 & 0 & -21 \\ -20 & -20 & -21 & 0 \end{bmatrix}$$

This is the result of the Neighbour-joining algorithm on same distance matrix as used in the UPGMA example. AB and CD are grouped instead of AC.

Neighbour joining

We use an algorithm very similar to UPGMA to select the two closest nodes, $i$ and $j$, based on the corrected distances ($d^*$).
We join these into a cluster and make a new node $k$ to correspond to their ancestor
the distance to the new node $k$ is computed by $$d_{ik} = \frac{1}{2}(d_{ij} +r_i − r_j)$$
Nodes $i$ and $j$ are removed from the pool and replaced by $k$. The uncorrected distances are updated based on the $d_{ik}$ calculation and $d^*$ is then recalculated.
See Higgs and Attwood (2005; pp166-169) for details.

Time complexity of the clustering algorithms

Both of these clustering-based algorithms take $O(n^3)$ time once we have the distance matrix.
There are n steps and in each step we do:
1. find the smallest distance
2. join these two taxa
3. compute the distance from the new ancestor to all others
Step 1 takes $O(n^2)$ and the other two steps take $O(n)$

Note: There is an alternative (and much harder to follow) formulation of the UPGMA algorithm that takes $O(n^2)$

Least squares distance-based phylogenetics

$d_{ij}$	Human	Chimp	Gorilla	Orangutan
Human	0
Chimp	0.0965	0
Gorilla	0.1140	0.1180	0
Orangutan	0.1849	0.2009	0.1947	0

$\begin{align*} S = &(d_{12} - \hat{d}_{12})^2 + (d_{13} - \hat{d}_{13})^2 + \\ &(d_{14} - \hat{d}_{14})^2 + (d_{23} - \hat{d}_{23})^2 + \\ &(d_{24} - \hat{d}_{24})^2 + (d_{34} - \hat{d}_{34})^2 \end{align*}$

$$\hat{d}_{12} = t_1 + t_2, \quad \hat{d}_{13} = t_1 + t_0 + t_3, \dots$$

There are six data points (distances) and five free parameters (branch lengths) to numerically optimize so as to minimise S.

Least squares distance-based phylogenetics

Tree	$t_0$	$t_1$ (H)	$t_2$ (C)	$t_3$ (G)	$t_4$ (O)	S
((H,C),G,O)	0.008840	0.043266	0.053280	0.058908	0.135795	0.000035
((H,G),C,O)	0.0	0.46212	0.056227	0.061854	0.138742	0.000140
((H,O),C,G)	0.0	0.46212	0.056227	0.061854	0.138742	0.000140
(H,G,C,O)	0.0	0.46212	0.056227	0.061854	0.138742	0.000140

Reading: Trees and Distance Methods

Bioinformatics and Molecular Evolution, Higgs and Attwood (2005), sections 8.1 and 8.3
Computational Molecular Evolution, Ziheng Yang (2006), sections 3.1-3.3
Biological Sequence Analysis, Durbin et al (1998), sections 7.1-7.3
An Introduction to Bioinformatics Algorithms, Jones and Pevzner (2004), Chapter 10

Molecules as Documents of Evolutionary History

Phylogenetics

A typical molecular phylogenetic analysis

Applications of phylogenetics

Types of phylogenies and representations

Bifurcating (binary) and multifurcating trees

Rooted and unrooted trees

Rooting trees using an outgroup

The same unrooted tree

Anatomy of a tree

How many trees are there?

Counting unrooted trees with stepwise addition algorithm

The topological distance between two trees

Phylogenetic reconstruction

Phylogenetic reconstruction

Phylogenetic reconstruction

Clustering algorithms

Clustering algorithms

UPGMA example

UPGMA example

UPGMA example

UPGMA weaknesses

Neighbour-joining algorithm (Saitou and Nei, 1987)

The basic idea

The basic idea

Neighbour joining

Time complexity of the clustering algorithms

Least squares distance-based phylogenetics

Least squares distance-based phylogenetics

Reading: Trees and Distance Methods

Neighbour-joining algorithm
(Saitou and Nei, 1987)