Model-based Phylogenetic Inference
Lecturer: Alexei Drummond
Modelling neutral sequence evolution
Why use models?
We need a model to relate what we observe (data) to what we want to
know (hypotheses and parameters).
Inference is not possible without a model!
We almost invariably choose
probabilistic models for molecular evolution, because we don't
know enough about mutation to construct a deterministic model.
Modelling genetic change
- Given two or more aligned nucleotide or amino acid sequences,
usually the first goal is to calculate some measure of sequence
similarity (or conversely distance).
- The easiest distance to compute is the p-distance: the number of
differences between two aligned sequences relative to their length.
- The p-distance is the Hamming distance normalized by the
length of the sequence. Therefore it is the proportion of positions
at which the sites differ.
- The p-distance can also be considered the probability that the two
sequences differ at a random site.
Modelling genetic change
p-distance
$p$-distance = 3/7 $\simeq 0.43$.
- Proportion of differences between two sequences.
- Usually underestimates the true distance: genetic (or evolutionary) distance $d$.
Multiple, parallel and back-substitutions
Relationship between $p$-distance and genetic distance
Continuous-time Markov chains (CTMCs)
- CTMCs are the continuous-time generalization of Markov chains
- state $X(t)$ is a function of a continuous time parameter.
- Obey the Chapman-Kolmogorov equation:
$$P(X(t_1)|X(t_0))=\sum_{X(t_i)}P(X(t_1)|P(X(t_i))P(X(t_i)|P(X(t_0))$$
Informally: you can find the probability of going from state $X(t_0)$ to $X(t_1)$ by averaging over all
intermediate states $X(t_i)$ at intermediate time $t_i$.
CTMC Mathematical Details (Optional)
This section is intended for mathematically-inclined individuals.
- Can be written in the following differential form:
\begin{align*}
\frac{d}{dt}p(x,t|x_0,t_0)&=\sum_{x'\neq x}\left[p(x',t|x_0,t_0)Q_{x'x}-p(x,t|x_0,_0)Q_{xx'}\right]\\
&=\sum_{x'}Q_{x'x}p(x',t|x_0,t_0)
\end{align*}
defining $Q_{xx}=-\sum_{x'\neq x}Q_{xx'}$.
- This form is known as the
Master Equation or
Kolmogorov forward equation. Linear, so solution is "simply" $\vec{p}(t)=\exp[Qt]\vec{p}(0)$.
Relationship between Q and P in CTMCs
Instantaneous Rate Matrix $Q$
- Describes rates of character changes in an infinitesimal time interval
- Characters can be nucleotides or amino acids
- Diagonal elements are defined such that the sum of each row is zero
Transition Probability Matrix $P(t)$
- Describes probabilities of character changes over time interval t
- Elements are probabilities of observing a specific character at time t, given the starting character at time 0
Relationship between $Q$ and $P(t)$
$P(t) = exp(Qt)$, where $exp(Qt)$ is the matrix exponential of the product of $Q$ and $t$
- If you know the instantaneous rates (Q), you can calculate the probabilities over any time interval $t$ using the matrix exponential
- Matrix exponential can be calculated using eigenvalue decomposition or Taylor series expansion
Stationary Distribution
- As $t$ approaches infinity, $P(t)$ converges to the stationary (or equilibrium) distribution.
- Represents long-term probabilities of observing each character, independent of the starting character.
CTMC example
Consider a system with two states (0 and 1) and a transition rate matrix
$$Q = \left[\begin{array}{cc}
- & 2 \\
1 & -
\end{array}\right]$$
Gives rise to the following trajectories:
Time $\Delta t$ spent in state before transition:
$$ P(\Delta t|x) = \lambda e^{-\lambda \Delta t}$$
where $\lambda = \sum_{x'\neq x}Q_{xx'} = -Q_{xx}$.
Time-reversible CTMCs and detailed balance
A CTMC is time-reversible if and only if it satisfies the property:
\begin{equation}
\pi_iQ_{ij} = \pi_jQ_{ji},
\end{equation}
or equivalently the detailed balance property: $$\pi_iP_{ij}(t) = \pi_jP_{ji}(t).$$
From this it can be shown that a CTMC is time-reversible iff there exists a row-vector of state probabilities $\Pi$ such that: \begin{equation} \Pi Q = 0 \end{equation}
Consider again a system with two states (0 and 1) and a transition rate matrix:
$$Q = \left[\begin{array}{cc}
-2 & 2 \\
1 & -1
\end{array}\right]$$
Then it is easy to show that $\Pi = [\frac{1}{3}, \frac{2}{3}]$ is a solution, so that this CTMC is time-reversible and $\Pi$ is the equilibrium distribution over states.
From rates to probabilities: matrix exponential
Transition probability matrix is the matrix exponential of the instantaneous rate matrix:
\[ P(t) = \exp(Q t) = \sum_{k = 0}^{\infty} \frac{ (Q t)^k}{k!} = \mathbf I + tQ + \frac{(t \mathbf Q)^2}{2} + \frac{(t \mathbf Q)^3}{6} + \ldots\]
Can sometimes calculate analytically
For general \(Q\), use numerical methods from numerical libraries.
The matrix exponential follows the rules we would hope for the exponential function. For example:
- \(\exp(\mathbf 0) = \mathbf I_n\)
- when \(\mathbf{AB} = \mathbf{BA}\), \(\exp(\mathbf{A+B}) =\exp(\mathbf{A})\exp(\mathbf{B})\)
- When \(\mathbf B\) is invertible, \(\exp(\mathbf{BAB}^{-1}) = \mathbf{B}\exp(\mathbf{A})\mathbf B^{-1}\)
- If \(\mathbf A = \mathrm{diag}(a_1,\ldots,a_n)\), then \(\exp(\mathbf A) = \mathrm{diag}(\exp(a_1),\ldots,\exp(a_n))\).
Matrix exponential example
Suppose there are two states, 0 and 1.
Let the rate matrix be
\[Q =
\renewcommand{\arraystretch}{1.25}
\left[ \begin{array}{cc}
-1 & 1 \\
0.6 & -0.6
\end{array} \right].\]
To get the probability matrix, use the matrix exponential:
\[ P(1) = \exp(1.Q) = \mathbf I + Q + \frac{ \mathbf Q^2}{2} + \frac{ \mathbf Q^3}{6} + \ldots = \renewcommand{\arraystretch}{1.25}
\left[ \begin{array}{cc}
0.501 & 0.499 \\
0.299 & 0.701
\end{array} \right] \]
So if we started in state 1 at time 0, at time 1 there is a 29.9% chance of being in state 0 and a 70.1% chance of being in state 1.
Matrix exponential example
By time 2, the preference for state 1 is obvious:
\[ P(2) = \exp(2Q) = \renewcommand{\arraystretch}{1.25}
\left[ \begin{array}{cc}
0.400 & 0.600 \\
0.360 & 0.640
\end{array} \right] \]
Remember, can calculate for any value of \(t\), not just integers:
\[ P(0.234) = \exp(0.234 \times Q) = \renewcommand{\arraystretch}{1.25}
\left[ \begin{array}{cc}
0.805 & 0.195 \\
0.117 & 0.883
\end{array} \right] \]
Matrix exponential example
If we ran it for an infinitely long period of time, get the same probabilities regardless of start state:
\[ P(\infty ) =\renewcommand{\arraystretch}{1.25}
\left[ \begin{array}{cc}
0.375 & 0.625 \\
0.375 & 0.625
\end{array} \right] \]
Each row is a copy of the equilibrium distribution or stationary distribution for the process.
Want this property of stationarity for models of DNA substitution so that
\[ P(\infty ) =\renewcommand{\arraystretch}{1.25}
\left[ \begin{array}{cccc}
\pi_A & \pi_C & \pi_G & \pi_T \\
\pi_A & \pi_C & \pi_G & \pi_T \\
\pi_A & \pi_C & \pi_G & \pi_T \\
\pi_A & \pi_C & \pi_G & \pi_T
\end{array} \right] \]
\(\pi = [\pi_A , \pi_C , \pi_G , \pi_T]\) is the stationary distribution of the bases.
Jukes-Cantor model of nucleotide substitution
$$
Q = \mu\left[\begin{array}{cccc}
-1 & \frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\
\frac{1}{3} & -1 & \frac{1}{3} & \frac{1}{3}\\
\frac{1}{3} & \frac{1}{3} & -1 & \frac{1}{3}\\
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} &-1
\end{array}\right]
$$
\begin{align*}
P(X(t)=j|X(0)=i) = P_{ij}(t) &= \left[e^{Qt}\right]_{ij}\\
&=\left\{\begin{array}{ll}
\frac{1}{4}+\frac{3}{4}e^{-\frac{4}{3}\mu t} & \text{for } i=j\\
\frac{1}{4}-\frac{1}{4}e^{-\frac{4}{3}\mu t} & \text{for } i\neq j
\end{array}\right.
\end{align*}
Genetic distance under Jukes-Cantor
The probability that a site is distinct after time $t$ is
\begin{align*}
p_{\text{diff}}&=\sum_{x'\neq x} P(X(t)=x'|X(0)=x)\\
&=\frac{3}{4} - \frac{3}{4}e^{-\frac{4}{3}\mu t}
\end{align*}
Equating $p_{\text{diff}}$ with the
$p$-distance and solving for $\mu t$ yields
$$\hat{d} = \widehat{\mu t} = -\frac{3}{4}\log(1-\frac{4}{3}p).$$
- Accounts for multiple, parallel, back-substitutions.
- Only correct if $p$ is exact: infinite sequences.
- Diverges if $p≥3/4$.
Other CTMC substitution models: Kimura 80
$$
Q = \left[\begin{array}{cccc}
\cdot & \alpha & \beta & \beta \\
\alpha & \cdot & \beta & \beta \\
\beta & \beta & \cdot & \alpha \\
\beta & \beta & \alpha &\cdot
\end{array}\right]
$$
- Also "Kimura two parameter" model.
- Allows different rates for transitions and transversions.
- Equilibrium base frequencies equal.
Other CTMC substitution models: HKY
$$
Q = \left[\begin{array}{cccc}
\cdot & \alpha\pi_G & \beta\pi_G & \beta\pi_G \\
\alpha\pi_A & \cdot & \beta\pi_A & \beta\pi_A \\
\beta\pi_C & \beta\pi_C & \cdot & \alpha\pi_C \\
\beta\pi_T & \beta\pi_T & \alpha\pi_T &\cdot
\end{array}\right]
$$
- Due to Hasegawa, Kishino, Yano (1985).
- Allows different transition transversion rates.
- Allows equilibrium base frequencies to differ.
- Most complex model for which analytical solutions are available.
Other CTMC substitution models: GTR
$$
Q = \left[\begin{array}{cccc}
\cdot & \alpha\pi_G & \beta\pi_G & \gamma\pi_G \\
\alpha\pi_A & \cdot & \delta\pi_A & \epsilon\pi_A \\
\beta\pi_C & \delta\pi_C & \cdot & \eta\pi_C \\
\gamma\pi_T & \epsilon\pi_T & \eta\pi_T &\cdot
\end{array}\right]
$$
- All models so-far have been reversible.
- GTR is the most general time-reversible model.
- Reversibility is a mathematical convenience: no biological reason!
- No useful analytical solution available (matrix exponential doesn't count).
Variable rates among sites
Gamma distribution rate heterogeneity allows for variation
in evolutionary rate among sites according to
$$f(r) = \frac{\alpha^{\alpha}}{\Gamma(\alpha)}r^{\alpha-1}e^{-\alpha r}$$
Analytical solution exists for JC distance with site rate heterogeneity:
$$\hat{d}=-\frac{3}{4}\alpha\left[1-\left(1-\frac{4}{3}p\right)^{-1/\alpha}\right]$$
Genetic distance variation between models
- HIV-1B vs HIV-O/SIVcpz/HIV-1C: env gene:
| $p$-distance | JC69 | K80 | Tajima-Nei |
HIV-O | 0.391 | 0.552 | 0.560 | 0.572 |
SIVcpz | 0.266 | 0.337 | 0.340 | 0.427 |
HIV-1C | 0.163 | 0.184 | 0.187 | 0.189 |
- HIV-1B vs HIV-O/HIV-1C: pol gene:
| $p$-distance | JC69 | K80 | Tajima-Nei |
HIV-O | 0.257 | 0.315 | 0.318 | 0.324 |
HIV-1C | 0.103 | 0.111 | 0.113 | 0.114 |
Least-squares Phylogenetic Inference
- Can use genetic distance as the basis for least-squares-based phylogenetic inference.
- Define cost of tree as the (weighted) sum of the squares of the
difference between the genetic distance on the tree and the
distance estimated from the sequence alignment:
$$C(T) = \sum_i\sum_j w_{ij}(d_{ij}(T)-\hat{d}_{ij})^2$$
- Search for tree $T$ with lowest $C(T)$.
Comments:
- Ignores correlations due to shared ancestry.
- Consistent estimator: converges to truth in the limit of infinitely long alignment.
The Likelihood Function
The likelihood for a parameter $\theta$ under model $M$ and given data $D$ is defined as
$L(\theta|D,M)\equiv P(D|\theta,M)$.
The likelihood of $\theta$ is NOT a probability distribution over $\theta$!
Example:
- Coin tossed 5 times (identical, independent): $D=(H,T,T,H,T)$.
- Probability of sequence:
\begin{align*}
P(D|f)&=f\times (1-f)\times(1-f)\times f\times(1-f)\\
&=f^2(1-f)^3\\
&=L(f|D)
\end{align*}
Maximum likelihood inference
Sensible methods of parameter estimation from empirical data often
correspond to values which maximize the likelihood function.
Maximum likelihood estimate of coin fairness parameter $f$ for $n$ heads:
$$\log L(f|D) = n\log f + (N-n)\log(1-f)$$
$$\left.\frac{\partial}{\partial f}\log L(f|D)\right|_{f=f_{ML}} = \frac{n}{f_{ML}} - \frac{N-n}{1-f_{ML}} = 0$$
$$\implies f_{ML} = \frac{n}{N}$$
What is the likelihood of a tree?
- $$L(T|A)\equiv P(A|T) = \sum_x\sum_w P(x)P(A|x)P(w|x)P(G|w)P(G|w)$$
- For $n$ taxa there are $4^{n-1}$ possible internal node states!
- Solve using Dynamic Programming!
Felsenstein's Pruning Algorithm
- Introduced by Joseph Felsenstein, 1973.
- Recursion based on the conditional likelihood of a subtree below node $k$ having state $s$: $L_k(s)$.
- For leaves, $L_k(s)=\delta_{s,l}$ where $l$ is the leaf state.
For internal nodes,
$$L_k(s) = \left(\sum_xP(x|s)L_{c_l}(x)\right)\left(\sum_yP(y|s)L_{c_r}(y)\right)$$
Time complexity for $m$ sites: $m(n-1)4^2$. This is the workhorse of computational phylogenetics.
Maximum likelihood for Phylogenetics
- PhyML (Guindon and Gascuel, 2003)
- Hill-climbing algorithm which attempts to find ML trees.
- RAxML (e.g. Stamatakis, 2005)
- Another hill-climbing algorithm, optimized for large trees.
- FastTree (Price, Dehal, Arkin, 2009)
- Uses efficient Neighbour-Joining algorithm to find starting tree.
- Updates tree using heuristics based on tree length.
- Inference from alignments of $>10^5$ sequences feasible.
Problems with ML
- Only ever gives point estimates: no indication of uncertainty.
- Addressed using more ad hocery: bootstrap algorithms.
- Lacks rigorous theoretical basis: $L(\theta|D)$ is not a probability!
- No clear way of incorporating additional (prior?) information into the analysis.
Need a better approach!
Summary
- The genetic distance between sequence pairs can be estimated using
a CTMC model for neutral molecular evolution.
- Addresses some of the problems with the parsimony approach.
- Have focused on 4 state DNA models, but it is possible to consider
models for proteins (20 amino acid states) and models that take
into account codons and the genetic code (64 states).
- Substitution model approaches to estimating genetic distance do not
deal with insertions and deletions of genetic material: they presuppose
a sequence alignment.
- Can be used within the context of a least-squares framework to find
best-fitting phylogenies.
- Better used as the basis for a tree likelihood.
- Likelihood can be used on its own for phylogenetic inference.