Multiple sequence alignment
Lecturer: Alexei Drummond

What is a Multiple Sequence Alignment?

Given sequences $X^{(1)},\ldots,X^{(N)}$ of lengths $n_1,\ldots,n_N$, seek $A^{(1)},\ldots,A^{(N)}$ of length $n\geq\max\{n_i\}$ such that

  • Obtain $X^{(i)}$ from $A^{(i)}$ by removing gap characters,
  • No columns contain all gaps,
  • The score of the alignment is optimal.

Notation

Sequence $i$
$X^{(i)} = (x^{(i)}_1,x^{(i)}_2,\ldots,x^{(i)}_{n_i})$
Row $i$ in alignment
$A^{(i)} = (a^{(i)}_1,a^{(i)}_2,\ldots,a^{(i)}_n)$
Column $j$ in alignment
$A_j = (a_j^{(1)}, a_j^{(2)}, \ldots, a_j^{(N)})$

Scoring Multiple Sequence Alignments

Scoring function: Sum of pairs (SP)

Column score:
$$S(A_i) = \sum_{j=1}^N\sum_{k=j+1}^N s(a_i^{(j)},a_i^{(k)})$$
     A-CTCAT
     A-GTC-T
     ACGTC-T
Column 3 score:
$s(C,G) + s(C,G) + s(G,G) = \text{match}+2\times\text{mismatch}$
Column 6 score:
$s(A,-) + s(A,-) + s(-,-) = -2\times\text{gap} + s(-,-) = -2\times\text{gap}$

Problems with SP scoring of MSAs

  • Substitution scores derived as log-odds scores for pairwise comparisons.
  • The "right thing" would be to construct log-odds scores for triples, quadruples, etc. (depending on the number of rows in the alignment). E.g. \begin{align*} s(a,b,c) & = \log\frac{p_{abc}}{f_af_bf_c}\\ & \neq \log\frac{p_{ab}}{f_af_b} + \log\frac{p_{ac}}{f_af_c} + \log\frac{p_{bc}}{f_bf_c} \end{align*}
  • No probabilistic justification!

Another way to think about the problem

Tree-based scores

Thought to be the most biologically appropriate, but

  • We don't know the tree.
  • We need to infer the characters on internal nodes to identify most parsimonious history.
  • There may be different trees for different parts of the alignment (recombination!)
Sum of pairs is almost always used in practice.

Multidimensional Dynamic Programming

Naïve approach

Define $F(i_1,i_2,\ldots,i_N)$ to be the score of the best alignment up to the subsequences ending in $x^{(1)}_{i_1}, x^{(2)}_{i_2}, \ldots, x^{(N)}_{i_N}$.

We can then find the following recurrence relation:

\begin{align*} F(i_1,i_2,\ldots,i_N)=\max\left\{\begin{array}{lcl} F(i_1-1,i_2-1,\ldots,i_N-1) & + & S(x^{(1)}_{i_1},x^{(2)}_{i_2},\ldots,x^{(N)}_{i_N})\\ F(i_1, i_2-1, \ldots,i_N-1) & + & S(-,x^{(2)}_{i_2},\ldots,x^{(N)}_{i_N})\\ F(i_1-1, i_2, i_3-1, \ldots,i_N-1) & + & S(x^{(1)}_{i_1},-,\ldots,x^{(N)}_{i_N}) \\ & \vdots &\\ F(i_1-1, i_2-1, \ldots,i_N) & + & S(x^{(1)}_{i_1},x^{(2)}_{i_2},\ldots,-)\\ F(i_1, i_2, i_3-1, \ldots,i_N) & + & S(-,-,\ldots,x^{(N)}_{i_N})\\ & \vdots &\\ F(i_1, i_2-1, \ldots, i_{N-1}-1, i_N) & + & S(-,x^{(2)}_{i_2},\ldots,-)\\ & \vdots & \end{array}\right. \end{align*}

Naïve approach

Computational burden of naïve DP approach

  • $F$ consists of $\prod_{q=1}^{N}n_q$ terms:
    • $O(n^N)$ space complexity.
  • Computing each element of $F$ requires maximizing over $2^N-1$ possibilities:
    • $O(2^Nn^N)$ time complexity.
    • (traceback negligible cost in comparison)
Even aside from the time cost, the space requirement for this algorithm is prohibitive. Storing $F$ for an alignment of 5$\times$100 character sequences requires $101^5\simeq 10^{10}$ numbers. Assuming 32 bit integers, this equates to $\sim$39 Gb of memory!

Improved DP algorithm

A big improvement, but still impractical for most alignment problems.

Multiple Alignment Software

Really need approximation methods.

Different techniques:

  1. Progressive global alignment of sequence starting with an alignment of the most similar sequences and then building a full alignment by adding more sequences.
  2. Iterative methods that make an initial alignment of groups of sequences and then iteratively refine the alignment to achieve a better result. (Barton-Sternberg, Simulated annealing, stochastic hill climbing, genetic algorithms.)
  3. Use of probabilistic models of the indel and substitution process to do statistical inference of alignment. ("Statistical alignment.")

Progressive alignment

General idea

Decisions:

  1. Order of alignments.
  2. Alignment of sequence to group only, or allow group to group.
  3. Method of alignment, scoring function.

Guide trees

  • Progressive alignment algorithms employ "guide trees" to guide the order in which sequences are progressively aligned.
  • In practice these are very rough phylogenetic trees (cheap to compute), unsuitable for serious phylogenetic inference but used instead simply to bootstrap the alignment process.

The UPGMA tree-building algorithm

  • Clustering algorithm based on pairwise distances.
  • Requires some distance $d(k,l)$ for every pair of items.

Distance matrix:

\begin{equation*} d = \begin{array}{cccc} & A & B & C & D \\ A & - & 4 & 8 & 8 \\ B & & - & 8 & 8 \\ C & & & - & 6 \\ D & & & & - \end{array} \end{equation*}

Distance matrix:

\begin{equation*} d = \begin{array}{cccc} & A & B & C & D \\ A & - & 4 & 8 & 8 \\ B & & - & 8 & 8 \\ C & & & - & 6 \\ D & & & & - \end{array} \end{equation*}

Distance matrix:

\begin{equation*} d = \begin{array}{cccc} & E & C & D \\ E & - & 8 & 8 \\ C & & - & 6 \\ D & & & - \end{array} \end{equation*}

Distance matrix:

\begin{equation*} d = \begin{array}{cccc} & E & C & D \\ E & - & 8 & 8 \\ C & & - & 6 \\ D & & & - \end{array} \end{equation*}

Distance matrix:

\begin{equation*} d = \begin{array}{cccc} & E & F \\ E & - & 8 \\ F & & - \end{array} \end{equation*}

Distance matrix:

\begin{equation*} d = \begin{array}{cccc} & E & F \\ E & - & 8 \\ F & & - \end{array} \end{equation*}

Feng-Doolittle: Overview

Progressive alignment algorithm published in 1987 (Feng and Doolittle, J. Mol Evol).

  1. Calculate diagonal matrix of $\binom{N}{2}$ distances between all pairs of $N$ sequences by standard pairwise alignment, converting raw alignment scores to approximate pairwise "distances": \[ d(k,l) = -\log\frac{S(A^{(k)},A^{(l)})-S_{\text{rand}}}{S_{\text{max}}-S_{\text{rand}}} \]
  2. Construct guide tree using UPGMA or some other clustering algorithm.
  3. Starting from the first internal node added to the tree, align the child nodes (which may be two sequences, a sequence and an alignment, or two alignments). Repeat for all other nodes in the order that they were added to the tree, until all sequences have been aligned.

Feng-Doolittle: Sequence$\leftrightarrow$Group alignment

Feng-Doolittle: Group$\leftrightarrow$Group alignment

Feng-Doolittle

  • After alignment is completed, gap symbols replaced by neutral "X" character.
  • Feng and Doolittle call this the rule of "once a gap, always a gap".
  • Encourages gaps to occur in same columns in subsequent alignments.

Progressive mis-alignment

Progressive mis-alignment

Profile alignment


Total alignment score: $S(A) + S(B) + S(A\times B)$

CLUSTALW

  • Thompson, Higgins, Gibson (1994) (55k citations!)
  • Widely used implementation of profile-based progressive multiple alignment
  • Essentially Feng-Doolittle, but uses profile alignment.

Overview:

  1. Calculate matrix of $\binom{N}{2}$ distances between all pairs of $N$ sequences.
  2. Construct a guide tree using a clustering algorithm such as UPGMA or neighbour-joining.
  3. Progressively align at internal nodes in order of decreasing similarity using sequence-sequence, sequence-profile and profile-profile alignment.

Employs many other heuristics.

Iterative refinement

I.e. "hill climbing". Slightly change solution to improve score. Converge to local optimum.

E.g. Barton-Sternberg (1987) multiple alignment:

  1. Find the two sequences with the highest pairwise similarity and align them using standard dynamic programming alignment.
  2. Find sequence most similar to a profile of the alignment of the first two and align using profile-sequence alignment.
  3. Repeat 2 until all sequences have been included.
  4. Remove sequence $X^{(1)}$ and realign it to a profile of the remaining aligned sequences $X^{(2)},\ldots,X^{(N)}$ by profile-sequence alignment. Repeat for sequences $X^{(2)},\ldots,X^{(N)}$.
  5. Repeat previous step a fixed number of times, or until the alignment score converges to some value.

Alignment: General considerations

  • These algorithms simply try to maximize the number of matches
    • Even the "best" alignment may not be the correct biological one.
  • Multiple alignments are done progressively
    • Such alignments get progressively worse as you add sequences.
    • Mistakes that occur during the alignment process are frozen in.
  • Unless the sequences are very similar you will almost certainly have to correct manually.
Is there a better alternative?

Statistics: fitting vs modeling

  • Statistical fitting of sequence variation:
    • Count frequencies of changes in real data sets
    • Build empirical statistical descriptions of the data (BLOSUM62)
    • Compare observed frequencies to well-defined null hypothesis for testing (log-odds ratio and scores)
    • Use scores in ad hoc algorithms for search and alignment (BLAST and ClustalX)
  • Probabilistic models of sequence evolution:
    • Describe a probabilistic model in terms of a process of evolution, rates of substitution, insertion and deletion.
    • Estimate parameters of the model and compare models using Bayes factors.
    • Use Bayesian inference to co-estimate alignment and evolutionary history, together with estimate uncertainties.

BAli-Phy

Suchard and Redelings, Bioinformatics (2006)

Summary

  1. Multiple sequence alignments usually scored using the sum of pairwise alignment scores. (Sum of pairs, SP.)
  2. Score can be maximized using dynamic programming, but scales as $\sim O(n^N)$ where $n$ is the average sequence length and $N$ is the number of sequences in the alignment.
  3. Lipman et al. improved this scaling, but still impractical for all but very small alignments.
  4. Progressive alignment (e.g. Feng-Doolittle) is a very approximate method based on building up alignment from successive pairwise alignments.
  5. Practical computationally but has many flaws, some of which are addressed via iterative refinement.
  6. Philosophically optimal approach is a Bayesian model-based joint inference of the MSA and phylogeny.

Recommended reading

  • Chapter 6 of "Biological Sequence Analysis", Durbin et al., CUP, 1998.