Lecture 3

Multiple Sequence Alignment

Week 3 90 minutes Required reading: Durbin et al. Ch. 6

1. What is a Multiple Sequence Alignment?

Multiple sequence alignment (MSA) is the process of aligning three or more biological sequences (protein or nucleic acid) to identify regions of similarity that may indicate functional, structural, or evolutionary relationships.

Ebola virus alignment example
Example of a multiple sequence alignment of Ebola virus sequences
Formal Definition

Given sequences $X^{(1)},\ldots,X^{(N)}$ of lengths $n_1,\ldots,n_N$, we seek aligned sequences $A^{(1)},\ldots,A^{(N)}$ of length $n\geq\max\{n_i\}$ such that:

  • We can obtain $X^{(i)}$ from $A^{(i)}$ by removing gap characters
  • No columns contain all gaps
  • The score of the alignment is optimal

Notation

Throughout this lecture, we'll use the following notation:

Sequence $i$:
$X^{(i)} = (x^{(i)}_1,x^{(i)}_2,\ldots,x^{(i)}_{n_i})$
Row $i$ in alignment:
$A^{(i)} = (a^{(i)}_1,a^{(i)}_2,\ldots,a^{(i)}_n)$
Column $j$ in alignment:
$A_j = (a_j^{(1)}, a_j^{(2)}, \ldots, a_j^{(N)})$

Why Multiple Sequence Alignment?

Multiple sequence alignments are essential for:

2. Scoring Multiple Sequence Alignments

Sum of Pairs (SP) Scoring

The most commonly used scoring function for MSAs is the sum of pairs method, which calculates the total score as the sum of all pairwise alignment scores.

Column Score

For column $i$ in the alignment:

$$S(A_i) = \sum_{j=1}^N\sum_{k=j+1}^N s(a_i^{(j)},a_i^{(k)})$$

where $s(a,b)$ is the pairwise substitution score between characters $a$ and $b$.

Example

Consider the following alignment:

A-CTCAT
A-GTC-T
ACGTC-T

Let's calculate scores for specific columns:

Problems with Sum of Pairs Scoring

Theoretical Issues with SP Scoring:
  • Substitution scores were derived as log-odds scores for pairwise comparisons
  • The mathematically correct approach would use log-odds scores for triples, quadruples, etc.
  • No probabilistic justification for the sum of pairs approach

Mathematically, the issue is that:

$$s(a,b,c) = \log\frac{p_{abc}}{f_af_bf_c} \neq \log\frac{p_{ab}}{f_af_b} + \log\frac{p_{ac}}{f_af_c} + \log\frac{p_{bc}}{f_bf_c}$$

Tree-based Scoring

A more biologically motivated approach considers the evolutionary tree relating the sequences:

Tree-based vs sum of pairs scoring
Comparison of sum of pairs scoring versus tree-based scoring

Tree-based scores are thought to be more biologically appropriate, but they have practical limitations:

Despite its theoretical limitations, sum of pairs scoring is almost always used in practice due to its computational simplicity.

3. Multidimensional Dynamic Programming

The Naïve Approach

We can extend the pairwise alignment dynamic programming approach to multiple sequences. Define $F(i_1,i_2,\ldots,i_N)$ as the score of the best alignment up to positions $i_1, i_2, \ldots, i_N$ in sequences $1, 2, \ldots, N$.

The recurrence relation becomes:

$$F(i_1,i_2,\ldots,i_N) = \max \begin{cases} F(i_1-1,i_2-1,\ldots,i_N-1) + S(x^{(1)}_{i_1},x^{(2)}_{i_2},\ldots,x^{(N)}_{i_N})\\ F(i_1, i_2-1, \ldots,i_N-1) + S(-,x^{(2)}_{i_2},\ldots,x^{(N)}_{i_N})\\ F(i_1-1, i_2, i_3-1, \ldots,i_N-1) + S(x^{(1)}_{i_1},-,x^{(3)}_{i_3},\ldots,x^{(N)}_{i_N})\\ \vdots\\ \text{(all $2^N-1$ combinations of gaps and characters)} \end{cases}$$
Multidimensional dynamic programming
Visualization of the multidimensional dynamic programming matrix

Computational Complexity

The naïve dynamic programming approach has prohibitive computational requirements:

Complexity Analysis:
  • Space complexity: $O(n^N)$ - need to store $\prod_{q=1}^{N}n_q$ values
  • Time complexity: $O(2^N n^N)$ - each cell requires maximizing over $2^N-1$ possibilities

Example Calculation

For an alignment of 5 sequences, each 100 characters long:

  • Space needed: $101^5 \approx 10^{10}$ numbers
  • With 32-bit integers: ~39 GB of memory
  • Time: Must compute $10^{10}$ cells, each requiring comparison of $2^5-1 = 31$ values

The MSA Algorithm

Carrillo & Lipman (1988) developed an improved algorithm that uses bounds on pairwise alignments to reduce the search space:

MSA algorithm search space reduction
The MSA algorithm reduces the search space by constraining pairwise alignment scores

While this is a significant improvement, it's still impractical for most real-world alignment problems with more than a few sequences.

4. Progressive Alignment Methods

Given the computational limitations of exact algorithms, practical MSA methods use heuristic approaches. The most successful of these is progressive alignment.

General Strategy

Progressive alignment strategy
Progressive alignment builds up the MSA by successively aligning sequences or groups

Key decisions in progressive alignment:

  1. Order of alignments: Which sequences to align first?
  2. Alignment strategy: Allow only sequence-to-group, or also group-to-group?
  3. Scoring method: How to score alignments involving groups?

Guide Trees

Progressive alignment algorithms use "guide trees" to determine the order of alignment:

Guide tree example
A guide tree determines the order in which sequences are progressively aligned
Note: Guide trees are rough phylogenetic trees used only to bootstrap the alignment process. They are not suitable for phylogenetic inference.

UPGMA Tree Construction

The Unweighted Pair Group Method with Arithmetic Mean (UPGMA) is commonly used to build guide trees:

UPGMA Example

Given the distance matrix:

ABCD
A-488
B-88
C-6
D-

Algorithm steps:

  1. Find minimum distance: d(A,B) = 4
  2. Join A and B to form cluster E
  3. Recalculate distances to E
  4. Repeat until all sequences are joined

The Feng-Doolittle Algorithm

Published in 1987, this was one of the first practical progressive alignment algorithms:

  1. Calculate pairwise distances: Align all pairs of sequences and convert alignment scores to evolutionary distances:
    $$d(k,l) = -\log\frac{S(A^{(k)},A^{(l)})-S_{\text{rand}}}{S_{\text{max}}-S_{\text{rand}}}$$
  2. Build guide tree: Use UPGMA or similar clustering algorithm
  3. Progressive alignment: Starting from the leaves, align sequences following the guide tree

Sequence-to-Group Alignment

Sequence to group alignment
Aligning a sequence to an existing group alignment

Group-to-Group Alignment

Group to group alignment
Aligning two group alignments together
"Once a gap, always a gap" rule: After alignment, gap symbols are replaced by a neutral character 'X'. This encourages gaps to occur in the same columns in subsequent alignments.

Progressive Misalignment

A major limitation of progressive alignment is that errors made early in the process cannot be corrected later:

Progressive misalignment example
Example of how early alignment decisions can lead to suboptimal final alignments

Profile Alignment

Modern progressive alignment methods use profile alignment to align groups of sequences:

Profile alignment
Profile alignment considers the frequency of each character at each position

The total alignment score becomes: $S(A) + S(B) + S(A \times B)$

CLUSTALW

Thompson, Higgins, and Gibson (1994) developed CLUSTALW, which became one of the most widely used MSA programs (with over 55,000 citations!).

CLUSTALW is essentially the Feng-Doolittle algorithm with profile alignment and many additional heuristics:

  1. Calculate all pairwise distances
  2. Construct guide tree using UPGMA or neighbor-joining
  3. Progressively align using sequence-sequence, sequence-profile, and profile-profile alignment

Modern Progressive Alignment Tools

  • Clustal Omega: Successor to CLUSTALW, faster and more accurate
  • MUSCLE: Uses iterative refinement after initial progressive alignment
  • MAFFT: Offers various algorithms optimized for different scenarios

5. Iterative Refinement Methods

Iterative refinement methods attempt to improve an initial alignment by making small changes and accepting improvements (hill climbing to a local optimum).

The Barton-Sternberg Algorithm (1987)

  1. Find the two most similar sequences and align them
  2. Find the sequence most similar to the profile of the current alignment and add it
  3. Repeat step 2 until all sequences are included
  4. Refinement phase: Remove each sequence and realign it to the profile of the others
  5. Repeat step 4 until convergence or a fixed number of iterations

Other Iterative Approaches

6. Practical Considerations and Limitations

General Limitations

Important considerations when using MSA algorithms:
  • These algorithms maximize match scores, but the "best" scoring alignment may not be biologically correct
  • Progressive alignments deteriorate as more sequences are added
  • Early mistakes in progressive alignment are frozen and cannot be corrected
  • Manual correction is often necessary for alignments of divergent sequences

Statistical Approaches: Fitting vs. Modeling

There are two philosophical approaches to sequence alignment:

Statistical Fitting

  • Count change frequencies in real data
  • Build empirical descriptions (e.g., BLOSUM62)
  • Use log-odds ratios for scoring
  • Apply in ad hoc algorithms (BLAST, ClustalW)

Probabilistic Modeling

  • Define evolutionary process models
  • Specify substitution and indel rates
  • Estimate parameters using likelihood/Bayesian methods
  • Co-estimate alignment and phylogeny

BAli-Phy: A Bayesian Approach

Suchard and Redelings (2006) developed BAli-Phy, which jointly estimates alignment and phylogeny using Bayesian inference:

BAli-Phy example
BAli-Phy provides uncertainty estimates for both alignment and phylogeny

This approach is philosophically optimal but computationally intensive, limiting its use to smaller datasets.

Summary

  1. Scoring: Multiple sequence alignments are typically scored using sum of pairs, despite theoretical limitations
  2. Exact algorithms: Dynamic programming can find optimal alignments but scales as $O(n^N)$, making it impractical for most problems
  3. Progressive alignment: Practical heuristic that builds alignments incrementally using a guide tree
  4. Iterative refinement: Can improve initial alignments but may still converge to local optima
  5. Trade-offs: Current methods balance biological accuracy with computational feasibility
  6. Future directions: Bayesian approaches offer principled solutions but need computational improvements
Recommended Reading:
  • Durbin et al. (1998) "Biological Sequence Analysis" - Chapter 6
  • Thompson et al. (1994) "CLUSTALW" - Nature Protocols
  • Notredame (2007) "Recent evolutions of multiple sequence alignment algorithms" - PLOS Comp Bio

Check Your Understanding

  1. Why is sum of pairs scoring theoretically problematic?
  2. What makes exact dynamic programming impractical for multiple sequences?
  3. How do guide trees influence progressive alignment quality?
  4. What is the "once a gap, always a gap" rule and why is it used?
  5. When might you choose iterative refinement over simple progressive alignment?
Previous Lecture Next Lecture