Lecture 2

Sequence Homology

Week 2 90 minutes Required reading: Ch. 2-3

1. Introduction to Sequence Homology

Definition of Homology
Homology is the existence of a shared ancestry between a pair of biological traits or sequences.

Understanding homology is fundamental to phylogenetic analysis. When we observe similarity between sequences, we form hypotheses about their evolutionary relationships.

Key Concepts

Types of Homologs

Different evolutionary events can lead to different types of homologous relationships:

Note: The distinction between these types of homology is crucial for accurate phylogenetic reconstruction, as they imply different evolutionary scenarios.

Examples from Nature

The globin family provides excellent examples of homologous proteins across diverse organisms:

Ring-tailed lemur beta globin
Ring-tailed lemur beta globin structure
Goldfish beta globin
Goldfish beta globin structure
Soybean leghemoglobin
Soybean leghemoglobin structure

2. Pairwise Alignment

The Goal of Pairwise Alignment

Aligning one sequence with another allows us to assess the homology between the two sequences.

Alignment serves several crucial purposes in sequence analysis:

Dot Plots

A simple visual method for assessing pairwise homology. Consider these two sequences:

  1. "decided to build three ships, three Arks in space"
  2. "decided to build three Arks in space"
Dot plot visualization
Dot plot showing diagonal lines indicating runs of matching sites

Scoring Alignments

Alignments are evaluated using scoring systems that assign numeric values to each column:

x' = a - c g g - t s
y' = a w - g c c t t

Column types and their typical scores:

Scoring Methods

Two main approaches are used for scoring alignments:

1. Model-based Scoring

Based on evolutionary models, particularly log-odds scoring:

$$s(a,b) = \frac{1}{\lambda}\log \frac{p_{a,b}}{f_a f_b}$$

Where:

2. Empirical Scoring

Based on observed substitution patterns in real sequences:

Important: Different matrices are appropriate for different levels of sequence similarity. Choosing the right matrix is crucial for accurate alignment.

Evolutionary Interpretation

For homologous sequences with divergence time \(t\) and mutation rate \(\mu\):

Match probabilities
Variation of match/mismatch probabilities with evolutionary distance \(d = \mu t\)
Match scores
Variation of match/mismatch scores with evolutionary distance

Gap Penalties

Gap penalty illustration
Illustration of gap penalties in sequence alignment

Two main approaches for penalizing gaps in alignments:

  1. Linear gap penalty:
    $$\gamma(g) = -gd$$
    where \(d\) is the gap penalty and \(g\) is the gap length
  2. Affine gap penalty:
    $$\gamma(g) = -d - (g-1)e$$
    where \(d\) is the gap opening penalty and \(e\) is the gap extension penalty

3. Pairwise Alignment Algorithms

Dynamic Programming

Dynamic programming is a powerful method for solving combinatorial optimization problems. It guarantees finding the optimal solution and is based on the principle of optimality:

A sub-optimal solution of a sub-problem cannot be part of the optimal solution of the full problem.
Principle of optimality
Illustration of the principle of optimality in dynamic programming

Key features of dynamic programming:

Needleman-Wunsch Algorithm

The Needleman-Wunsch algorithm (1970) performs global alignment using dynamic programming.

Algorithm Overview

Given sequences X = (x₁, x₂, ..., xₘ) and Y = (y₁, y₂, ..., yₙ), where m and n are the lengths of sequences X and Y respectively, define F(i,j) as the score of the best alignment between the first i characters of X and the first j characters of Y.

Recurrence Relation

$$F(i,j) = \max\begin{cases} F(i-1,j-1) + s(x_i,y_j) & \text{match/mismatch}\\ F(i,j-1) - d & \text{gap in X}\\ F(i-1,j) - d & \text{gap in Y} \end{cases}$$

Initialization

Biological Application

The Needleman-Wunsch algorithm is particularly useful when:

  • Comparing full-length protein sequences from different species
  • Aligning complete genes to study synteny
  • Example: Aligning human and mouse insulin genes to study conservation

Computational Complexity

  • Time complexity: O(mn) where m and n are sequence lengths
  • Space complexity: O(mn) for the full matrix
  • Traceback: O(m+n)

Interactive Demo

Try the Needleman-Wunsch algorithm with your own sequences:

Launch Interactive Demo →

Smith-Waterman Algorithm

The Smith-Waterman algorithm computes local alignment, finding the best alignment of subsequences while ignoring scores of regions on either side.

Smith-Waterman illustration
Local alignment finds high-scoring subsequences

Key Difference

The main difference from Needleman-Wunsch is the addition of a fourth option in the recurrence:

$$F(i,j) = \max\begin{cases} F(i-1,j-1) + s(x_i,y_j)\\ F(i,j-1) - d\\ F(i-1,j) - d\\ 0 & \text{start new alignment} \end{cases}$$

This allows the algorithm to:

Smith-Waterman example
Example of Smith-Waterman algorithm finding local alignments

Specialized Variants

Repeated Matches

For finding multiple occurrences of a motif in a longer sequence:

Repeated match example
Finding repeated motifs using modified dynamic programming

Overlap Matches

For sequence assembly and finding overlapping sequences:

Overlap match example
Overlap alignment for sequence assembly

Summary and Next Steps

In this lecture, we've covered the fundamental concepts of sequence homology and pairwise alignment:

These concepts form the foundation for:

Further Reading:
  • Durbin et al. (1998) "Biological Sequence Analysis" - Chapters 2-3
  • Felsenstein (2004) "Inferring Phylogenies" - Chapter 11
  • Original papers: Needleman & Wunsch (1970), Smith & Waterman (1981)
Next Lecture