Lecture 5

Introduction to Phylogenetics

Week 5 90 minutes Required reading: Yang Ch. 3, Higgs & Attwood Ch. 8

1. Molecules as Documents of Evolutionary History

Molecular sequences contain a wealth of information about evolutionary processes and history. By comparing sequences from different organisms, we can infer their evolutionary relationships.

Example: HIV-1 Sequence Variation

HIV-1 (UK)  ATCGGATGCTAAAGCATATGACACAGAGGTACATAATGTTT
HIV-1 (USA) ATCAGATGCTAGAGCTTATGATACAGAGGTACA---TGTTT
        

The differences (highlighted) between these sequences tell us about their evolutionary divergence.

Key points about molecular evolution:

What is Phylogenetics?

Phylogenetics

The study of evolutionary relationships among organisms or genes, typically represented as trees showing patterns of descent from common ancestors.

Core concepts in phylogenetics:

A Typical Phylogenetic Analysis

  1. Collect homologous sequences - Identify sequences that share common ancestry
  2. Construct multiple sequence alignment - Align sequences to identify homologous positions
  3. Phylogeny reconstruction - Infer the tree using various methods
  4. Test reliability - Assess confidence in the estimated phylogeny
  5. Interpretation and application - Use the phylogeny to answer biological questions

Applications of Phylogenetics

Phylogenetic methods have diverse applications across biology:

2. Types and Anatomy of Phylogenetic Trees

Tree Representations

Types of phylogenetic tree representations
Different ways to represent the same phylogenetic relationships

Phylogenetic trees can be drawn in various formats:

Bifurcating vs. Multifurcating Trees

Bifurcating and multifurcating trees
Comparison of fully resolved (bifurcating) and partially resolved (multifurcating) trees
Polytomy
  • In a rooted tree: A node with more than 2 children
  • In an unrooted tree: A node of degree 4 or greater

Polytomies can represent either uncertainty in relationships or rapid diversification events.

Rooted vs. Unrooted Trees

Rooted versus unrooted trees
The same evolutionary relationships shown as rooted and unrooted trees

Key differences:

Rooting Trees Using an Outgroup

Rooting trees with outgroup
Using an outgroup (distantly related species) to determine the root position
Tip: The outgroup should be clearly outside the group of interest but still closely related enough to align sequences reliably.

Tree Anatomy

Anatomy of a phylogenetic tree
Key components of a phylogenetic tree

Essential tree terminology:

3. The Universe of Possible Trees

How Many Trees Are There?

The number of possible tree topologies grows explosively with the number of taxa. For $n$ taxa, the number of:

Rooted, binary trees:

$$T_n^{(R)} = (2n-3)(2n-5)\cdots(3)(1) = \frac{(2n-3)!}{2^{n-2}(n-2)!}$$

Tree Numbers in Perspective

n (taxa) # Rooted Trees Context
4 15 Enumerable by hand
5 105 Enumerable by hand on a rainy day
10 34,459,425 ≈ Upper limit for exhaustive search
20 8.2 × 10²¹ ≈ Upper limit for branch-and-bound
48 3.2 × 10⁷⁰ ≈ Number of particles in the universe

Counting Unrooted Trees

Stepwise addition algorithm
Building trees by stepwise addition shows how tree numbers grow

The number of unrooted trees can be calculated using stepwise addition:

Computational Challenge: The vast number of possible trees makes exhaustive searching impossible for even moderate numbers of taxa. This is why we need heuristic search strategies.

Measuring Tree Differences

Robinson-Foulds Distance

The partition distance between two trees is the total number of bipartitions (splits) that are in one tree but not the other.

Robinson-Foulds distance example
Each internal branch defines a bipartition of taxa. The RF distance counts non-shared bipartitions.

Properties of the Robinson-Foulds distance:

4. Overview of Phylogenetic Reconstruction

Types of Data

Phylogenetic reconstruction uses two main types of data:

Distance Data

Pairwise dissimilarities stored in a distance matrix

ABCDE
A03565
B30476
C54054
D67501
E56410

Character Data

Discrete states for each taxon at each position

123456789
A100011011
B010011111
C001000111
D000100000
E000000000

Reconstruction Approaches

Phylogenetic reconstruction methods
Overview of major phylogenetic reconstruction approaches

Given the enormous number of possible trees, we have three main strategies:

  1. Clustering algorithms: Build tree using a specific algorithm (fast but no optimality criterion)
  2. Optimality criteria: Define a score and find the tree(s) that optimize it
  3. Statistical inference: Find the most probable trees under an evolutionary model

5. Clustering Methods

Overview of Clustering Approaches

Clustering algorithms like UPGMA and Neighbor-Joining are:

Limitations:
  • No explicit optimality criterion
  • No measure of how good the tree is
  • No information about alternative trees

UPGMA (Unweighted Pair Group Method with Arithmetic Mean)

UPGMA is one of the simplest tree-building methods:

UPGMA Algorithm

  1. Start with each taxon as a separate cluster
  2. Find the pair of clusters with smallest distance
  3. Join them into a new cluster
  4. Calculate distances from new cluster to all others using average distances
  5. Repeat until all taxa are joined

UPGMA Example

Step-by-step UPGMA

Initial distances:

ABCD
A08712
B80914
C79011
D1214110

Step 1: Join A and C (smallest distance = 7)

New distances:

  • $d_{B(AC)} = (d_{BA} + d_{BC})/2 = (8 + 9)/2 = 8.5$
  • $d_{D(AC)} = (d_{DA} + d_{DC})/2 = (12 + 11)/2 = 11.5$
UPGMA final tree
Final UPGMA tree showing ultrametric property (all leaves equidistant from root)
Key assumption: UPGMA assumes a molecular clock (constant evolutionary rate), producing an ultrametric tree where all leaves are equidistant from the root.

UPGMA Weaknesses

UPGMA can fail when evolutionary rates vary:

UPGMA failure case
A non-clock-like tree with different topology that fits the distance matrix perfectly

Neighbor-Joining (NJ) Algorithm

Saitou and Nei (1987) developed NJ to address UPGMA's limitations:

Key Innovation

NJ corrects for unequal evolutionary rates by considering the average distance from each taxon to all others.

The NJ Algorithm

  1. Compute "average distance" for each taxon:
    $$r_i = \frac{1}{n-2} \sum_j d_{ij}$$
  2. Calculate corrected distances:
    $$d_{ij}^* = d_{ij} - r_i - r_j$$
  3. Join the pair with smallest $d_{ij}^*$
  4. Calculate branch lengths to new node:
    $$d_{ik} = \frac{1}{2}(d_{ij} + r_i - r_j)$$
Neighbor-joining result
NJ correctly groups AB and CD, unlike UPGMA which incorrectly grouped AC

Time Complexity

Both UPGMA and NJ have $O(n^3)$ time complexity:

Note: An optimized UPGMA implementation can achieve $O(n^2)$ complexity, though it's more complex to implement.

6. Least Squares Distance Methods

Unlike clustering methods, least squares approaches have an explicit optimality criterion: minimize the difference between observed and tree-implied distances.

The Least Squares Criterion

Primate Example

$d_{ij}$HumanChimpGorillaOrangutan
Human00.09650.11400.1849
Chimp00.11800.2009
Gorilla00.1947
Orangutan0

For any tree topology with branch lengths, we can calculate the sum of squared differences:

$$S = \sum_{i<j} (d_{ij} - \hat{d}_{ij})^2$$

Where:

Tree with branch lengths
Tree-implied distances are sums of branch lengths along paths

Optimization Process

For the primate example:

Results for Different Topologies

Tree Topology Least Squares Score (S)
((Human,Chimp),Gorilla,Orangutan) 0.000035
((Human,Gorilla),Chimp,Orangutan) 0.000140
((Human,Orangutan),Chimp,Gorilla) 0.000140
Least squares trees comparison
Different tree topologies with optimized branch lengths and their fit to the data

Advantages and Limitations

Advantages:

Limitations:

Summary

This lecture introduced the fundamentals of phylogenetic analysis:

  1. Evolutionary information: Molecular sequences contain historical information that can be decoded through phylogenetic analysis
  2. Tree basics: Understanding tree anatomy, rooted vs. unrooted trees, and different representations
  3. Tree space: The number of possible trees grows explosively, making exhaustive searches impossible
  4. Distance methods:
    • Clustering (UPGMA, NJ): Fast but no optimality measure
    • Least squares: Explicit criterion but computationally intensive
  5. Key challenges: Accounting for rate variation, searching vast tree space, and quantifying uncertainty
Recommended Reading:
  • Higgs & Attwood (2005) "Bioinformatics and Molecular Evolution" - Sections 8.1, 8.3
  • Yang (2006) "Computational Molecular Evolution" - Sections 3.1-3.3
  • Durbin et al. (1998) "Biological Sequence Analysis" - Sections 7.1-7.3
  • Saitou & Nei (1987) "The neighbor-joining method" - Mol Biol Evol 4:406-425

Check Your Understanding

  1. Why do unrooted trees contain less information than rooted trees?
  2. How does the number of possible trees change as you add more taxa?
  3. What assumption does UPGMA make that NJ does not?
  4. How do clustering methods differ from optimality-based methods?
  5. What information do we need to root an unrooted tree?

Phylogenetic Software

  • MEGA: User-friendly GUI for distance and parsimony methods
  • RAxML: Fast maximum likelihood phylogenetic inference
  • IQ-TREE: Efficient phylogenetic software with model selection
  • BEAST: Bayesian phylogenetic inference with molecular dating
  • FigTree: Graphical viewer for phylogenetic trees
Previous Lecture Next Lecture