$$P(X_{i}|X_{i-1},X_{i-2},\ldots,X_0) = P(X_i|X_{i-1})$$
$$P(X_{i+1}=y|X_{i}=x)=P(X_i=y|X_{i-1}=x)$$
Homogeneous Markov chains can be represented as finite state machines.
Arrows can be labeled with transition probabilities to complete the description.
Including the begin and end states explicitly allows the length of the chain to be modelled.
If the transition probabilities to the end state are uniform, i.e. $M_{x\mathcal{E}}=p~~\forall x$ then the length has a geometric distribution:
$$P(L=l)=P(X_{l+1}=\mathcal{E})=(1-p)^{l}p$$The most probable path $\vec{Y}$ through the hidden state space is:
\begin{align*} \vec{Y}^* &= \underset{\vec{Y}}{\text{argmax}}~P(\vec{Y}|\vec{X})\\ & = \underset{\vec{Y}}{\text{argmax}}~\frac{P(\vec{Y},\vec{X})}{P(\vec{X})}\\ & = \underset{\vec{Y}}{\text{argmax}}~P(\vec{Y},\vec{X}) \end{align*}Define $v_k(i)$ as the joint probability of the most probable (sub) path ending with observation number $i$ in state $k$. If this is known for all states $k$, can compute: $$v_l(i+1) = e_l(x_{i+1})\max_{k}(v_k(i)M_{kl})$$
Optimal substructure!
The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term research in his presence. You can imagine how he felt, then, about the term mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word “programming”. I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying. I thought, let's kill two birds with one stone. Let's take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that it's impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It's impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to.Richard Bellman, "Eye of the Hurricane: An Autobiography", 1984
(Optimal path through network whiteboard example.)
This algorithm computes the probability of the sequence of observations given the model.
It is trivial to derive.
Define $f_k(i) = P(X_1,\ldots,X_i,Y_i=k)$.
It is interesting to know the posterior probability of hidden states at each site.
This is NOT the posterior distribution over paths. Stringing together the most probable states may not even produce a valid path!
Start by considering: \begin{align*} P(\vec{X}, Y_i=k) &= P(X_1,\ldots,X_i,Y_i=k)P(X_{i+1},\ldots,X_L|X_1,\ldots,X_{i},Y_i=k)\\ & = P(X_1,\ldots,X_i,Y_i=k)P(X_{i+1},\ldots,X_L|Y_i=k)\\ & \equiv f_k(i) b_k(i) \end{align*}
Use a similar recursion to the forward algorithm to find $b_k(i)$.
Then $$P(Y_i=k|\vec{X}) = \frac{f_k(i)b_k(i)}{P(\vec{X})}$$ where $P(\vec{X})$ is computed using either the forward or backward algorithms.
This algorithm implements the recursion required to compute $b_k(i)$
Define $b_k(i) = P(X_{i+1},\ldots,X_L|Y_i=k)$.
When the hidden state trajectory $\vec{Y}$ is known, can use the following estimators for the transition and emission probabilities: $$M_{kl} = \frac{A_{kl}}{\sum_{l'}A_{kl'}}$$ where $\mathcal{A}_{kl}$ are the number of transitions between states $k$ and $l$ on the trajectory.
Similarly, $$e_{k}(x) = \frac{E_{k}(x)}{\sum_{k'}E_{k'}(x)}$$ where $E_k(x)$ is the number of emissions of $x$ from state $k$.
These are maximum likelihood (point) estimates. Can use pseudo-count offsets to avoid problems due to overfitting and insufficient data.
The following FSM describes an HMM that generates pairs of aligned sequences:
Use a single HMM to model an entire multiple sequence alignment:
VGA--HAGEY V----NVDEV VEA--DVAGH IAGADNGAGV