Homology and alignment

Generate layout for printing

Lecturer: Alexei Drummond

Sequence Homology

Definition of Homology

Homology is the existence of a shared ancestry between a pair of biological traits or sequences.

Expect homologous sequence regions to display a degree of similarity.
The statement that two sequence regions are homolgous is an evolutionary hypothesis based on similarity: it is rarely possible to observe homology directly.
Homologous sequences are referred to as homologs.
Different kinds of homologs:
- Orthology: ancestral sequences separated by speciation.
- Paralogy: ancestral sequences seaparated by gene duplication.
- Xenology: homologs due to horizontal gene transfer.
- ...

Ring-tailed lemur beta globin

Goldfish beta globin

Soybean leghemoglobin

Pairwise Alignment

The goal of pairwise alignment

Aligning one sequence with another allows us to assess the homology between the two sequences.

Alignment allows us to break down the question of sequence similarity into a large number of smaller questions about character similarity.

Pairwise alignment is the basis of multiple sequence alignment which itself forms the basis for phylogenetic reconstruction from molecular data.

Dot Plots

A simple means of assessing pairwise homology visually.

Consider the following pair of sequences:

decided to build three ships, three Arks in space
decided to build three Arks in space

https://www.jwhitham.org/magrathea/adiff.html

Each row corresponds to a position on seq 1, each column to a position on seq 2.
Pixel $(i,j)$ is coloured if the characters at site $i$ on seq 1 and $j$ on seq 2 match.
Diagonal lines indicate runs of matching sites.

Dynamic Programming

Method for solving combinatorial optimization problems.
Guaranteed to give optimal solution.
Generalization of "divide-and-conquer".
Relies on principle of optimality:

Sub-optimal solution of a sub-problem cannot be part of the optimal solution of the full problem.

Problems suitable for solution by dynamic programming are said to have
optimal substructure.

Principle of Optimality

Keys to efficiency

Computation is carried out from the bottom-up.
Store all solutions to sub-problems in a table.
All possible sub-problems solved exactly once, beginning with smallest sub-problems.
Work up to original problem instance.
Only optimal solutions to sub-problems are used to compute solution to problem at next level.
Don't carry out computation in recursive, top-down manner
- same sub-problems would be solved many times.

Pairwise amino acid alignment

Sequences

            x = a c g g t s
            y = a w g c c t t

Alignment

            x' = a - c g g - t s
            y' = a w - g c c t t

Scoring Alignments

Numeric score associated with each column.
Total score given by sum of column scores.
Column types:
- Identical (+ve score)
- Conservative (+ve score)
- Non-conservative (-ve score)
- Gap (-ve score)

Alignment

            x' = a - c g g - t s
            y' = a w - g c c t t

The score matrix (or substitution matrix) $s$ contains the column scores of every possible pair. A column with the character pair $a,b$ is given by the matrix entry $s(a,b)$.

Scoring methods

Model-based
- Log-odds scoring
Empirical
- Often used for amino acid alignments
- PAM matrices
- BLOSUM matrices
- JTT
- WAG
Different matrices used depending on the level of similarity of the sequences.
- How do you know the similarity before constructing an alignment?

Log-odds matrices

What we want to know is whether two sequences are homologous (evolutionarily related) or not, so we want an alignment score that reflects that. Theory says that if you want to compare two hypotheses, a good score is the log-odds score: the logarithm of the ratio of the likelihoods of your two hypotheses. If we assume that each aligned residue pair is statistically independent of the others (biologically dubious, but mathematically convenient), the alignment score is the sum of the individual log-odds score for each aligned residue pair.

Sean R Eddy, Nature Biotechnology, 2004

Log-odds matrices

\begin{equation*} s(a,b) = \frac{1}{\lambda}\log \frac{p_{a,b}}{f_a f_b} \end{equation*}

The numerator ($p_{ab}$) is the likelihood of the hypothesis we want to test: that these two residues are correlated because they’re homologous. Thus, $p_{ab}$ are the target frequencies: the probability that we expect to observe residues a and b alignment in homologous sequence alignments. The denominator is the likelihood of a null hypothesis: that these two residues are uncorrelated and unrelated, occurring independently.

Sean R Eddy, Nature Biotechnology, 2004

Evolutionary interpretation of match/mismatch scores

$a,b$ homologous

$d=0.1$ is roughly 90% similarity
$d$ is average number of changes per site

$a,b$ non-homologous

Jukes-Cantor substitution model

All mutations equally likely
- $a\leftrightarrow b$ occur at the same rate for all character states $a$ and $b$.
- $a,b \in {A,C,G,T}$ for DNA
- $a,b \in {A,R,N,D,\ldots,W,Y,V}$ for amino acids
All characters equally likely (equal base frequencies)
- ${0.25, 0.25, 0.25, 0.25}$ for DNA
- ${0.05, \ldots, 0.05}$ for proteins

Interpretation of match/mismatch scores (DNA)

$P_{a=b}=\frac{1}{4} + \frac{3}{4}e^{-\frac{3}{4}d}$
\begin{align*} P_{a\neq b} = & 1-P_{a=b}\\ & \frac{3}{4} - \frac{3}{4}e^{-\frac{3}{4}d} \end{align*}

$d=0.1$ is roughly 90% similarity
$d$ is average number of changes per site

$\lim_{d\rightarrow\infty}P_{a=b}(d)=\frac{1}{4}$

$\lim_{d\rightarrow\infty}P_{a\neq b}(d)=\frac{3}{4}$

Log-odds match score

\begin{align*} s(a,a) &= \frac{1}{\lambda}\log\frac{P_{aa}(d)}{\lim_{d\rightarrow\infty}P_{aa}}\\ &=\frac{1}{\lambda}\log\frac{P_{aa}(d)}{1/4} \end{align*}

Log-odds mismatch score

\begin{align*} s(a,b) &= \frac{1}{\lambda}\log\frac{P_{ab}(d)}{\lim_{d\rightarrow\infty}P_{ab}}\\ &=\frac{1}{\lambda}\log\frac{P_{ab}(d)}{3/4} \end{align*}

Variation of match/mismatch probs with $d$

Variation of match/mismatch scores with $d$

What about gaps?

Two main approaches:

Linear score: $$\gamma(g) = -gd$$ where $d$ is the gap penalty
Affine score: $$\gamma(g) = -d - (g-1)e$$ where $e$ is the gap extension penalty

Pairwise Alignment Algorithms

Needleman & Wunsch algorithm

Dynamic programming algorithm for global alignment.
Needleman & Wunsch (1970), modified by Gotoh (1982)

Assumptions

Linear gap penalty $d$
Symmetric scoring matrix $s$: $$s(a,b)=s(b,a)$$ $$s(a,-)=s(-,a) = -d$$

Principle of Optimality

Given sequences:

\begin{align*} X &= (x_1, x_2, \ldots, x_m)\\ Y &= (y_1, y_2, \ldots, y_n) \end{align*}

Define $F(i,j)$ to be the score of the best alignment between

\begin{align*} &(x_1, x_2, \ldots, x_i)\\ &(y_1, y_2, \ldots, y_j)\\ \end{align*}

Dynamic Programming recurrences

Optimal alignment	$y_1, y_2, y_3, \ldots, y_j$ $x_1, x_2, x_3, \ldots, x_i$	$F(i,j)$
Comes from...	$y_1, y_2, y_3, \ldots, y_{j-1}$ $y_j$ $x_1, x_2, x_3, \ldots, x_{i-1}$ $x_i$	$F(i-1,j-1) + s(x_i,y_j)$
Or...	$y_1, y_2, y_3, \ldots, y_{j-1}$ $y_j$ $x_1, x_2, x_3, \ldots, x_{i}$ $\phantom{x_i}$	$F(i,j-1) - d$
Or...	$y_1, y_2, y_3, \ldots, y_{j}$ $\phantom{y_j}$ $x_1, x_2, x_3, \ldots, x_{i-1}$ $x_i$	$F(i-1,j) - d$

Therefore,

\begin{align*} F(i,j) = \max\left\{\begin{array}{l} F(i-1,j-1) + s(x_i,y_j)\\ F(i,j-1) - d\\ F(i-1,j) - d \end{array}\right. \end{align*}

Basis

$F(0,0) = 0$
$F(i,0) = F(i-1,0) + s(x_i,-)$
—, —, —, $\ldots$, —

$x_1, x_2, x_3, \ldots, x_i$
$F(0,j) = F(0,j-1) + s(-,y_j)$
$y_1, y_2, y_3, \ldots, y_j$

—, —, —, $\ldots$, —

Example

Online interactive demo

Smith-Waterman algorithm

Computes local alignment
Looks for best alignment of subsequences of X and Y, ignoring scores of regions on either side.

Principle of Optimality

Given sequences:

\begin{align*} Y &= (y_1, y_2, \ldots, y_n)\\ X &= (x_1, x_2, \ldots, x_m) \end{align*}

Define $F(i,j)$ to be the score of the best suffix alignment between

$(y_r, y_{r+1}, \ldots, y_j)$ where $r\leq j$

and

$(x_s, y_{s+1}, \ldots, x_i)$ where $s\leq i$

Includes empty alignment with score $0$.

Dynamic Programming recurrences

Optimal alignment	$y_s, y_{s+1}, y_{s+2}, \ldots, y_j$ $x_r, x_{r+1}, x_{r+2}, \ldots, x_i$	$F(i,j)$
Comes from...	$y_s, y_{s+1}, \ldots, y_{j-1}$ $y_j$ $x_r, x_{r+1}, \ldots, x_{i-1}$ $x_i$	$F(i-1,j-1) + s(x_i,y_j)$
Or...	$y_s, y_{s+1}, \ldots, y_{j-1}$ $y_j$ $x_r, x_{r+1}, \ldots, x_{i}$ $\phantom{x_i}$	$F(i,j-1) - d$
Or...	$y_s, y_{s+1}, \ldots, y_{j}$ $\phantom{y_j}$ $x_r, x_{r+1}, \ldots, x_{i-1}$ $x_i$	$F(i-1,j) - d$
Or...	$y_s, y_{s+1}, y_{s+2}, \ldots, y_{j}$ $x_r, x_{r+1}, x_{r+2}, \ldots, x_{i}$	$0$

\begin{align*} F(i,j) = \max\left\{\begin{array}{l} F(i-1,j-1) + s(x_i,y_j)\\ F(i,j-1) - d\\ F(i-1,j) - d\\ 0 \end{array}\right. \implies F(i,0) = F(0,j) = 0 \end{align*}

Example

Durbin, Eddy, Krogh, Mitchison, "Biological sequence analysis", Cambridge Uni Press, 1998

Repeated matches

For long sequences we may be interested in all local alignments
with significant score ($>$ threshold $T$)
e.g. copies of a repeated domain or motif in a protein.

$X$ is the (short) sequence containing the motif

$Y$ is the (long) target sequence

Method is asymmetric.

Principle of Optimality

Given sequences:

\begin{align*} Y &= (y_1, y_2, \ldots, y_n)\\ X &= (x_1, x_2, \ldots, x_m) \end{align*}

Define $F(i,j)$ (for $i\geq 1$) to be the best sum of match scores in

$(y_1, y_2, y_3, \ldots, y_j)$

and

$(x_1, x_2, x_3, \ldots, x_i)$

assuming $y_j$ is in a matched region and match ends in $x_i$ or $y_j$.

$F(i,j)$ has a different meaning here!

Ends of matches

$F(0,0) = 0$
$F(0,j)$ is the best sum of completed match scores to $(y_1, y_2,\ldots, y_j)$ assuming that $y_j$ is not in a matched region.

\begin{align*} F(0,j) = \max\left\{\begin{array}{l} F(0,j-1)\\ F(i,j-1) - T,~~i=1,\ldots,n \end{array}\right. \end{align*}

Row $0$ therefore marks unmatched regions and ends of matches in $Y$.

General recurrence

\begin{align*} F(i,j) = \max\left\{\begin{array}{l} F(0,j)\\ F(i-1,j-1) + s(x_i, y_j)\\ F(i,j-1) - d\\ F(i-1,j) - d \end{array}\right. \end{align*}

Example

Durbin, Eddy, Krogh, Mitchison, "Biological sequence analysis", Cambridge Uni Press, 1998

Overlap matches

Don't penalize overhanging ends: $F(i,0)=F(0,j)=0$.

Otherwise,

\begin{align*} F(i,j) = \max\left\{\begin{array}{l} F(i-1,j-1) + s(x_i,y_j)\\ F(i,j-1)-d\\ F(i-1,j)-d \end{array}\right. \end{align*}

Example

Durbin, Eddy, Krogh, Mitchison, "Biological sequence analysis", Cambridge Uni Press, 1998