Aligning one sequence with another allows us to assess the homology between the two sequences.
Alignment allows us to break down the question of sequence similarity into a large number of smaller questions about character similarity.
Pairwise alignment is the basis of multiple sequence alignment which itself forms the basis for phylogenetic reconstruction from molecular data.
A simple means of assessing pairwise homology visually.
Consider the following pair of sequences:
Sub-optimal solution of a sub-problem cannot be part of the optimal solution of the full problem.
x = a c g g t s y = a w g c c t t
x' = a - c g g - t s y' = a w - g c c t t
x' = a - c g g - t s y' = a w - g c c t t
The score matrix (or substitution matrix) $s$ contains the column scores of every possible pair. A column with the character pair $a,b$ is given by the matrix entry $s(a,b)$.
What we want to know is whether two sequences are homologous (evolutionarily related) or not, so we want an alignment score that reflects that. Theory says that if you want to compare two hypotheses, a good score is the log-odds score: the logarithm of the ratio of the likelihoods of your two hypotheses. If we assume that each aligned residue pair is statistically independent of the others (biologically dubious, but mathematically convenient), the alignment score is the sum of the individual log-odds score for each aligned residue pair.Sean R Eddy, Nature Biotechnology, 2004
The numerator ($p_{ab}$) is the likelihood of the hypothesis we want to test: that these two residues are correlated because they’re homologous. Thus, $p_{ab}$ are the target frequencies: the probability that we expect to observe residues a and b alignment in homologous sequence alignments. The denominator is the likelihood of a null hypothesis: that these two residues are uncorrelated and unrelated, occurring independently.Sean R Eddy, Nature Biotechnology, 2004
Two main approaches:
Given sequences:
\begin{align*} X &= (x_1, x_2, \ldots, x_m)\\ Y &= (y_1, y_2, \ldots, y_n) \end{align*}Define $F(i,j)$ to be the score of the best alignment between
\begin{align*} &(x_1, x_2, \ldots, x_i)\\ &(y_1, y_2, \ldots, y_j)\\ \end{align*}Optimal alignment |
$y_1, y_2, y_3, \ldots, y_j$
$x_1, x_2, x_3, \ldots, x_i$
|
$F(i,j)$ |
Comes from... |
$y_1, y_2, y_3, \ldots, y_{j-1}$
$y_j$
$x_1, x_2, x_3, \ldots, x_{i-1}$
$x_i$
|
$F(i-1,j-1) + s(x_i,y_j)$ |
Or... |
$y_1, y_2, y_3, \ldots, y_{j-1}$
$y_j$
$x_1, x_2, x_3, \ldots, x_{i}$
$\phantom{x_i}$
|
$F(i,j-1) - d$ |
Or... |
$y_1, y_2, y_3, \ldots, y_{j}$
$\phantom{y_j}$
$x_1, x_2, x_3, \ldots, x_{i-1}$
$x_i$
|
$F(i-1,j) - d$ |
Therefore,
Online interactive demo
Given sequences:
\begin{align*} Y &= (y_1, y_2, \ldots, y_n)\\ X &= (x_1, x_2, \ldots, x_m) \end{align*}Define $F(i,j)$ to be the score of the best suffix alignment between
$(y_r, y_{r+1}, \ldots, y_j)$ where $r\leq j$and
$(x_s, y_{s+1}, \ldots, x_i)$ where $s\leq i$Includes empty alignment with score $0$.
Optimal alignment |
$y_s, y_{s+1}, y_{s+2}, \ldots, y_j$
$x_r, x_{r+1}, x_{r+2}, \ldots, x_i$
|
$F(i,j)$ |
Comes from... |
$y_s, y_{s+1}, \ldots, y_{j-1}$
$y_j$
$x_r, x_{r+1}, \ldots, x_{i-1}$
$x_i$
|
$F(i-1,j-1) + s(x_i,y_j)$ |
Or... |
$y_s, y_{s+1}, \ldots, y_{j-1}$
$y_j$
$x_r, x_{r+1}, \ldots, x_{i}$
$\phantom{x_i}$
|
$F(i,j-1) - d$ |
Or... |
$y_s, y_{s+1}, \ldots, y_{j}$
$\phantom{y_j}$
$x_r, x_{r+1}, \ldots, x_{i-1}$
$x_i$
|
$F(i-1,j) - d$ |
Or... |
$y_s, y_{s+1}, y_{s+2}, \ldots, y_{j}$
$x_r, x_{r+1}, x_{r+2}, \ldots, x_{i}$
|
$0$ |
$X$ is the (short) sequence containing the motif
$Y$ is the (long) target sequence
Given sequences:
\begin{align*} Y &= (y_1, y_2, \ldots, y_n)\\ X &= (x_1, x_2, \ldots, x_m) \end{align*}Define $F(i,j)$ (for $i\geq 1$) to be the best sum of match scores in
$(y_1, y_2, y_3, \ldots, y_j)$and
$(x_1, x_2, x_3, \ldots, x_i)$assuming $y_j$ is in a matched region and match ends in $x_i$ or $y_j$.
$F(i,j)$ has a different meaning here!
Row $0$ therefore marks unmatched regions and ends of matches in $Y$.
Don't penalize overhanging ends: $F(i,0)=F(0,j)=0$.
Otherwise,