Introduction to Phylogenetics - Bayesian phylogenetics lectures

1. Molecules as Documents of Evolutionary History

Molecular sequences contain a wealth of information about evolutionary processes and history. By comparing sequences from different organisms, we can infer their evolutionary relationships.

Example: HIV-1 Sequence Variation

HIV-1 (UK)  ATCGGATGCTAAAGCATATGACACAGAGGTACATAATGTTT
HIV-1 (USA) ATCAGATGCTAGAGCTTATGATACAGAGGTACA---TGTTT

The differences (highlighted) between these sequences tell us about their evolutionary divergence.

Key points about molecular evolution:

Macromolecules contain information about the processes and history that formed them
This information is incomplete, so the full history must be inferred
Computational biology aims to decipher the information held in molecular sequences about evolutionary processes and history

What is Phylogenetics?

Phylogenetics

The study of evolutionary relationships among organisms or genes, typically represented as trees showing patterns of descent from common ancestors.

Core concepts in phylogenetics:

Homology: Similarity due to inheritance from a common ancestor
Phylogenetic trees: Branching structure representing evolutionary relationships based on common ancestry
Monophyletic groups (clades): Groups containing species more closely related to each other than to any species outside the group
Statistical inference: Modern phylogenetics uses probabilistic models of evolution

A Typical Phylogenetic Analysis

Collect homologous sequences - Identify sequences that share common ancestry
Construct multiple sequence alignment - Align sequences to identify homologous positions
Phylogeny reconstruction - Infer the tree using various methods
Test reliability - Assess confidence in the estimated phylogeny
Interpretation and application - Use the phylogeny to answer biological questions

Applications of Phylogenetics

Phylogenetic methods have diverse applications across biology:

Inferring relationships among species and genes
Estimating divergence times
Identifying functional elements through comparative genomics
Detecting molecular adaptation
Forensics and paternity testing
Studying the emergence and spread of viral pandemics
Conservation biology and biodiversity assessment

2. Types and Anatomy of Phylogenetic Trees

Tree Representations

Different ways to represent the same phylogenetic relationships

Phylogenetic trees can be drawn in various formats:

Rectangular (dendogram): Branch lengths shown horizontally
Circular: Taxa arranged in a circle
Radial: Branches radiate from center

Bifurcating vs. Multifurcating Trees

Comparison of fully resolved (bifurcating) and partially resolved (multifurcating) trees

Polytomy

In a rooted tree: A node with more than 2 children
In an unrooted tree: A node of degree 4 or greater

Polytomies can represent either uncertainty in relationships or rapid diversification events.

Rooted vs. Unrooted Trees

The same evolutionary relationships shown as rooted and unrooted trees

Key differences:

Rooted trees show the direction of evolution and identify the common ancestor
Unrooted trees show relationships but not evolutionary direction
Many methods (parsimony, distance, likelihood without molecular clock) produce unrooted trees

Rooting Trees Using an Outgroup

Using an outgroup (distantly related species) to determine the root position

Tip: The outgroup should be clearly outside the group of interest but still closely related enough to align sequences reliably.

Tree Anatomy

Key components of a phylogenetic tree

Essential tree terminology:

Nodes: Points where branches meet (internal nodes = ancestors, tips/leaves = observed taxa)
Branches/Edges: Lines connecting nodes (represent evolutionary lineages)
Branch lengths: Can represent time or amount of evolutionary change
Root: The common ancestor of all taxa in the tree
Tips/Leaves: The observed taxa (species, genes, etc.)

3. The Universe of Possible Trees

How Many Trees Are There?

The number of possible tree topologies grows explosively with the number of taxa. For $n$ taxa, the number of:

Rooted, binary trees: $$T_n^{(R)} = (2n-3)(2n-5)\cdots(3)(1) = \frac{(2n-3)!}{2^{n-2}(n-2)!}$$

Tree Numbers in Perspective

n (taxa)	# Rooted Trees	Context
4	15	Enumerable by hand
5	105	Enumerable by hand on a rainy day
10	34,459,425	≈ Upper limit for exhaustive search
20	8.2 × 10²¹	≈ Upper limit for branch-and-bound
48	3.2 × 10⁷⁰	≈ Number of particles in the universe

Counting Unrooted Trees

Building trees by stepwise addition shows how tree numbers grow

The number of unrooted trees can be calculated using stepwise addition:

Start with 3 taxa: only 1 possible unrooted tree
Each new taxon can be added to any branch
Formula: $T_n = 1 × 3 × 5 × \cdots × (2n-5)$ for $n$ taxa

Computational Challenge: The vast number of possible trees makes exhaustive searching impossible for even moderate numbers of taxa. This is why we need heuristic search strategies.

Measuring Tree Differences

Robinson-Foulds Distance

The partition distance between two trees is the total number of bipartitions (splits) that are in one tree but not the other.

Each internal branch defines a bipartition of taxa. The RF distance counts non-shared bipartitions.

Properties of the Robinson-Foulds distance:

Ranges from 0 (identical trees) to $2(n-3)$ for $n$ taxa
Simple to compute but can be sensitive to small changes
Treats all bipartitions equally (no weighting by branch length)

4. Overview of Phylogenetic Reconstruction

Types of Data

Phylogenetic reconstruction uses two main types of data:

Distance Data

Pairwise dissimilarities stored in a distance matrix

	A	B	C	D	E
A	0	3	5	6	5
B	3	0	4	7	6
C	5	4	0	5	4
D	6	7	5	0	1
E	5	6	4	1	0

Character Data

Discrete states for each taxon at each position

	1	2	3	4	5	6	7	8	9
A	1	0	0	0	1	1	0	1	1
B	0	1	0	0	1	1	1	1	1
C	0	0	1	0	0	0	1	1	1
D	0	0	0	1	0	0	0	0	0
E	0	0	0	0	0	0	0	0	0

Reconstruction Approaches

Overview of major phylogenetic reconstruction approaches

Given the enormous number of possible trees, we have three main strategies:

Clustering algorithms: Build tree using a specific algorithm (fast but no optimality criterion)
Optimality criteria: Define a score and find the tree(s) that optimize it
Statistical inference: Find the most probable trees under an evolutionary model

5. Clustering Methods

Overview of Clustering Approaches

Clustering algorithms like UPGMA and Neighbor-Joining are:

Usually very fast (polynomial time)
Simple to implement and understand
Based on greedy heuristics (make locally optimal choices)

Limitations:

No explicit optimality criterion
No measure of how good the tree is
No information about alternative trees

UPGMA (Unweighted Pair Group Method with Arithmetic Mean)

UPGMA is one of the simplest tree-building methods:

UPGMA Algorithm

Start with each taxon as a separate cluster
Find the pair of clusters with smallest distance
Join them into a new cluster
Calculate distances from new cluster to all others using average distances
Repeat until all taxa are joined

UPGMA Example

Step-by-step UPGMA

Initial distances:

	A	B	C	D
A	0	8	7	12
B	8	0	9	14
C	7	9	0	11
D	12	14	11	0

Step 1: Join A and C (smallest distance = 7)

New distances:

$d_{B(AC)} = (d_{BA} + d_{BC})/2 = (8 + 9)/2 = 8.5$
$d_{D(AC)} = (d_{DA} + d_{DC})/2 = (12 + 11)/2 = 11.5$

Final UPGMA tree showing ultrametric property (all leaves equidistant from root)

Key assumption: UPGMA assumes a molecular clock (constant evolutionary rate), producing an ultrametric tree where all leaves are equidistant from the root.

UPGMA Weaknesses

UPGMA can fail when evolutionary rates vary:

A non-clock-like tree with different topology that fits the distance matrix perfectly

Neighbor-Joining (NJ) Algorithm

Saitou and Nei (1987) developed NJ to address UPGMA's limitations:

Key Innovation

NJ corrects for unequal evolutionary rates by considering the average distance from each taxon to all others.

The NJ Algorithm

Compute "average distance" for each taxon: $r_i = \frac{1}{n-2} \sum_j d_{ij}$
Calculate corrected distances: $d_{ij}^* = d_{ij} - r_i - r_j$
Join the pair with smallest $d_{ij}^*$
Calculate branch lengths to new node: $d_{ik} = \frac{1}{2}(d_{ij} + r_i - r_j)$

NJ correctly groups AB and CD, unlike UPGMA which incorrectly grouped AC

Time Complexity

Both UPGMA and NJ have $O(n^3)$ time complexity:

$n$ steps (joining operations)
Each step requires finding minimum distance: $O(n^2)$
Updating distances: $O(n)$

Note: An optimized UPGMA implementation can achieve $O(n^2)$ complexity, though it's more complex to implement.

6. Least Squares Distance Methods

Unlike clustering methods, least squares approaches have an explicit optimality criterion: minimize the difference between observed and tree-implied distances.

The Least Squares Criterion

Primate Example

$d_{ij}$	Human	Chimp	Gorilla	Orangutan
Human	0	0.0965	0.1140	0.1849
Chimp		0	0.1180	0.2009
Gorilla			0	0.1947
Orangutan				0

For any tree topology with branch lengths, we can calculate the sum of squared differences:

S = \sum_{i<j} (d_{ij} - \hat{d}_{ij})^2

Where:

$d_{ij}$ = observed distance between taxa $i$ and $j$
$\hat{d}_{ij}$ = tree-implied distance (sum of branch lengths on path)

Tree-implied distances are sums of branch lengths along paths

Optimization Process

For the primate example:

6 data points (pairwise distances)
5 free parameters (branch lengths)
Use numerical optimization to minimize $S$

Results for Different Topologies

Tree Topology	Least Squares Score (S)
((Human,Chimp),Gorilla,Orangutan)	0.000035 ✓
((Human,Gorilla),Chimp,Orangutan)	0.000140
((Human,Orangutan),Chimp,Gorilla)	0.000140

Different tree topologies with optimized branch lengths and their fit to the data

Advantages and Limitations

Advantages:

Explicit optimality criterion
Can compare different topologies
Provides branch length estimates

Limitations:

Still need to search tree space (NP-hard problem)
Assumes distances are measured without error
No statistical framework for uncertainty

Summary

This lecture introduced the fundamentals of phylogenetic analysis:

Evolutionary information: Molecular sequences contain historical information that can be decoded through phylogenetic analysis
Tree basics: Understanding tree anatomy, rooted vs. unrooted trees, and different representations
Tree space: The number of possible trees grows explosively, making exhaustive searches impossible
Distance methods:
- Clustering (UPGMA, NJ): Fast but no optimality measure
- Least squares: Explicit criterion but computationally intensive
Key challenges: Accounting for rate variation, searching vast tree space, and quantifying uncertainty

Recommended Reading:

Higgs & Attwood (2005) "Bioinformatics and Molecular Evolution" - Sections 8.1, 8.3
Yang (2006) "Computational Molecular Evolution" - Sections 3.1-3.3
Durbin et al. (1998) "Biological Sequence Analysis" - Sections 7.1-7.3
Saitou & Nei (1987) "The neighbor-joining method" - Mol Biol Evol 4:406-425

Check Your Understanding

Why do unrooted trees contain less information than rooted trees?
How does the number of possible trees change as you add more taxa?
What assumption does UPGMA make that NJ does not?
How do clustering methods differ from optimality-based methods?
What information do we need to root an unrooted tree?

Phylogenetic Software

MEGA: User-friendly GUI for distance and parsimony methods
RAxML: Fast maximum likelihood phylogenetic inference
IQ-TREE: Efficient phylogenetic software with model selection
BEAST: Bayesian phylogenetic inference with molecular dating
FigTree: Graphical viewer for phylogenetic trees