Hidden Markov Models

Generate layout for printing

Lecturer: Alexei Drummond

Hidden Markov Models (HMMs)

Markov Chains

Markov Chains are discrete stochastic processes which obey the Markov property:
$$P(X_{i}|X_{i-1},X_{i-2},\ldots,X_0) = P(X_i|X_{i-1})$$
The index $i$ is often interpreted as "time", but it need not be.
Homogeneous Markov chains satisfy the additional requirement:
$$P(X_{i+1}=y|X_{i}=x)=P(X_i=y|X_{i-1}=x)$$
This allows the vector $\vec{P}_i$ of probabilities at $i$ to be expressed as $$\vec{P}_i=M\vec{P}_{i-1}=M^i\vec{P}_{0}$$ where $M$ is the matrix of transition probabilities.

Markov Chains as Finite State Machines

Homogeneous Markov chains can be represented as finite state machines.

Arrows can be labeled with transition probabilities to complete the description.

Modelling begin and end states

Including the begin and end states explicitly allows the length of the chain to be modelled.

If the transition probabilities to the end state are uniform, i.e. $M_{x\mathcal{E}}=p~~\forall x$ then the length has a geometric distribution:

$$P(L=l)=P(X_{l+1}=\mathcal{E})=(1-p)^{l}p$$

Markov Chains with hidden state

For HMMs, observations are tied to states probabilistically.
Consider $Y_1,Y_2,\ldots,Y_N$ to be the sequence of hidden states evolving according to the transition probability matrix $M$.
Observations $X_1,X_2,\ldots,X_N$ are each drawn from $P(X_i|Y_i)$.
Assuming homogeneity lets us define $$P(X_i=x|Y_i=y) =e_y(x)$$ to be the emission probability.

HMM example: occasionally dishonest casino

Run Demo

Finding optimal paths

Most probable paths

The most probable path $\vec{Y}$ through the hidden state space is:

\begin{align*} \vec{Y}^* &= \underset{\vec{Y}}{\text{argmax}}~P(\vec{Y}|\vec{X})\\ & = \underset{\vec{Y}}{\text{argmax}}~\frac{P(\vec{Y},\vec{X})}{P(\vec{X})}\\ & = \underset{\vec{Y}}{\text{argmax}}~P(\vec{Y},\vec{X}) \end{align*}

Question: Many possible paths!! How do we find the best one?
Answer: Dynamic Programming!

Define $v_k(i)$ as the joint probability of the most probable (sub) path ending with observation number $i$ in state $k$. If this is known for all states $k$, can compute: $$v_l(i+1) = e_l(x_{i+1})\max_{k}(v_k(i)M_{kl})$$

Optimal substructure!

Why "Dynamic Programming"?

The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term research in his presence. You can imagine how he felt, then, about the term mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word “programming”. I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying. I thought, let's kill two birds with one stone. Let's take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that it's impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It's impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to.
Richard Bellman, "Eye of the Hurricane: An Autobiography", 1984

Viterbi Algorithm

Initialization ($i=0$):: $v_{k}(0) = \delta_{k,k_0}$ where $k_0=\underset{k}{\text{argmax}}~P(Y_0=k)$
Recursion ($i = 1\ldots L$):
Termination:: $$P(\vec{X},\vec{Y}^*) = \max_k(v_k(L))$$ $$Y^*_L = \underset{k}{\text{argmax}}(v_k(L))$$
Traceback ($i = L\ldots 1$):: $$Y^*_{i-1} = \text{ptr}_i(Y^*_i)$$

(Optimal path through network whiteboard example.)

Detecting coin fairness

Run Viterbi Demo

Computing Probabilities

Forward Algorithm

This algorithm computes the probability of the sequence of observations given the model.
It is trivial to derive.

Define $f_k(i) = P(X_1,\ldots,X_i,Y_i=k)$.

Initialization: $$f_k(0) = P(Y_0=k)$$
Recursion ($i=1\ldots L$): $$f_l(i) = e_l(x_i)\sum_k f_k(i-1)M_{kl}$$
Termination: $P(\vec{X})=\sum_k f_k(L)$

Posterior state probabilities

It is interesting to know the posterior probability of hidden states at each site.

This is NOT the posterior distribution over paths. Stringing together the most probable states may not even produce a valid path!

Start by considering: \begin{align*} P(\vec{X}, Y_i=k) &= P(X_1,\ldots,X_i,Y_i=k)P(X_{i+1},\ldots,X_L|X_1,\ldots,X_{i},Y_i=k)\\ & = P(X_1,\ldots,X_i,Y_i=k)P(X_{i+1},\ldots,X_L|Y_i=k)\\ & \equiv f_k(i) b_k(i) \end{align*}

Use a similar recursion to the forward algorithm to find $b_k(i)$.

Then $$P(Y_i=k|\vec{X}) = \frac{f_k(i)b_k(i)}{P(\vec{X})}$$ where $P(\vec{X})$ is computed using either the forward or backward algorithms.

Backward Algorithm

This algorithm implements the recursion required to compute $b_k(i)$

Define $b_k(i) = P(X_{i+1},\ldots,X_L|Y_i=k)$.

Initialization: $$b_k(L+1) = 1$$
Recursion ($i=L-1\ldots 1$): $$b_l(i) = \sum_k M_{lk}e_k(x_{i+1})b_k(i+1)$$
Termination: $P(\vec{X})=\sum_k P(Y_1=k)e_k(x_1)b_k(1)$

Posterior fairness probabilities

Run Forward-Backward Demo

Parameter inference

Known state trajectory

When the hidden state trajectory $\vec{Y}$ is known, can use the following estimators for the transition and emission probabilities: $$M_{kl} = \frac{A_{kl}}{\sum_{l'}A_{kl'}}$$ where $\mathcal{A}_{kl}$ are the number of transitions between states $k$ and $l$ on the trajectory.

Similarly, $$e_{k}(x) = \frac{E_{k}(x)}{\sum_{k'}E_{k'}(x)}$$ where $E_k(x)$ is the number of emissions of $x$ from state $k$.

These are maximum likelihood (point) estimates. Can use pseudo-count offsets to avoid problems due to overfitting and insufficient data.

Unknown hidden states: Baum-Welch Algorithm

Initialization

Pick arbitrary model parameters.

Iterate

Set $A$ and $E$ to their pseudo-count values.
For each training sequence $\vec{X}^j$:
1. Calculate $f_k(i)$ for $\vec{X}^j$ using forward algorithm
2. Calculate $b_k(i)$ for $\vec{X}^j$ using backward algorithm
3. Add the contribution of $j$ to $A$ and $E$ using equations below.
Update values for $M$ and $e$ based on ML estimates.
Compute likelihood of new model.
If change in likelihood is "small enough", stop.

$$\langle A_{kl}\rangle = \sum_j\frac{1}{P(\vec{X}^j)}\sum_i f_k^j(i)M_{kl}e_l(x^{j}_{i+1})b^j_l(i+1)$$ $$\langle E_k(x)\rangle = \sum_j\frac{1}{P(\vec{X}^j)}\sum_{i \text{ s.t. } X^j_i=x}f_k^j(i)b_k^j(i)$$

HMMs in Bioinformatics

Pairwise alignment with HMMs

The following FSM describes an HMM that generates pairs of aligned sequences:

Most likely pairwise alignment

Viterbi algorithm for pair HMMs

Initialization: $v_M(0,0)=1$, $v_X(0,0) = v_Y(0,0) = 0$
$v_k(i,-1)=v_k(-1,j)=v_k(i,0)=v_k(0,j)=0$
Recurrence:: \begin{align*} v_M(i,j) &= p_{x_iy_j}\max\left\{\begin{array}{l} (1-2\delta-\tau)v_M(i-1,j-1)\\ (1-\epsilon-\tau)v_X(i-1,j-1)\\ (1-\epsilon-\tau)v_Y(i-1,j-1) \end{array}\right.\\ v_X(i,j) & = q_{x_i}\max\left\{\begin{array}{l} \delta v_M(i-1,j)\\ \epsilon v_X(i-1,j); \end{array}\right.\\ v_Y(i,j) & = q_{y_i}\max\left\{\begin{array}{l} \delta v_M(i,j-1)\\ \epsilon v_X(i,j-1); \end{array}\right. \end{align*}
Termination:: $v_E=\tau\max(v_M(n,m), v_X(n,m), v_Y(n,m))$

Log-odds formulation

Can produce (more!) exact analogues of the dynamic programming algorithms we saw earlier.
Needleman & Wunch essentially produces a Viterbi path through a pair HMM with the following correspondences:

\begin{align*} s(a,b) &= \log\frac{p_{ab}}{q_aq_b} + \log\frac{(1-2\delta-\tau)}{(1-\eta)^2}\\ d &= -\log\frac{\delta(1-\epsilon-\tau)}{(1-\eta)(1-2\delta-\tau)}\\ e &= -\log\frac{\epsilon}{1-\eta} \end{align*}
Here $\eta$ is the parameter determining the length of the alignment under the null model without the assumption of homology, used to construct log-odds values.)

Pairwise alignment probabilities

Pair HMMs aren't just useful for rationalizing Needleman & Wunch.
They allow us to compute the overall probability that sequences are related by any alignment: $$P(\vec{x},\vec{y}) = \sum_{\text{alignments }\alpha}P(\vec{x},\vec{y},\alpha)$$
Calculate this sum using the forward algorithm.
Can use this to evaluate the posterior probability that a given alignment is correct: $$P(\alpha|\vec{x},\vec{y})=\frac{P(\vec{x},\vec{y},\alpha)}{P(\vec{x},\vec{y})}$$
Can sample from this probability distribution!
Can also use the combined forward and backward algorithms to produce the probability that characters $x_i$ and $y_j$ are aligned: $$P(x_i\diamond y_j|\vec{x},\vec{y}) = \frac{f_M(i,j)b_M(i,j)}{P(\vec{x},\vec{y})}$$

Profile alignment with HMMs

Use a single HMM to model an entire multiple sequence alignment:

VGA--HAGEY
V----NVDEV
VEA--DVAGH
IAGADNGAGV

Parameters can be estimated using Baum-Welch.
Once we have this profile HMM we can use it to search for other sequences which fit the profile and thus assess their candidacy for inclusion in the family.

Summary

Hidden Markov models (HMMs) model the stochastic generation of observations from a hidden state evolving according to a Markov chain.
Many useful algorithms exist for working with HMMs:
- The Viterbi algorithm for finding the optimal path,
- the forward algorithm for finding the probability of the observations, and
- the backward algorithm which allows the posterior probability of hidden states to be computed.
Many applications of HMMs to bioinformatics:
- Pairwise alignment:
  - Needleman-Wunch can be considered an application of Viterbi to a pair HMM
  - Can compute alignment probabilities.
- Profile alignment:
  - Model MSA as an HMM.
  - Estimate parameters using Baum-Welch algorithm.
  - Allows much more accurate classification of alignments than Feng-Doolittle.
  - Implemented in Clustal$\Omega$.