Leliaert et al., Euro. J. Phycology, 2014
Basic MCMC Theory
Think of MCMC like exploring a fitness landscape: current position = current tree & parameters, proposed moves = small "mutations", acceptance/rejection = selection. We don't need the whole landscape—just the relative heights of where we are vs. where we're going.
MCMC works by simulating a stochastic trajectory according to:
$$p_{i+1}(x)=\int p_i(x')W(x|x')\mathrm{d}x'$$
where
$$W(x'|x) = q(x'|x)\alpha(x'|x) + \left(1-\int q(x''|x)\alpha(x''|x)\mathrm{d}x''\right)\delta(x'-x)$$
A sufficient (but unnecessary!) condition for $\pi(x)$ to be the equilibrium distribution is detailed ballance, which implies:
$$\pi(x')q(x|x')\alpha(x|x')=\pi(x)q(x'|x)\alpha(x'|x)$$
which is satisfied by the following acceptance probability:
$$\alpha(x'|x)=\min\left[1, \frac{\pi(x')}{\pi(x)}\times\frac{q(x|x')}{q(x'|x)}\right]$$
Basic MCMC Theory (continued)
$$\alpha(x'|x)=\min\left[1, \frac{\pi(x')}{\pi(x)}\times\frac{q(x|x')}{q(x'|x)}\right]$$
- By tuning the acceptance probability, any probability distribution can be targeted.
- Only the ratio $\pi(x')/\pi(x)$ appears: normalization need not be known.
- The only requirement for $q(x'|x)$ is that it produce random walks that explore the entire state space for which $\pi(x)$ has support (cf. irreducibility).
- For high-dimensional state spaces, $W(x'|x)$ is usually decomposed
into several distinct transitions: $W(x'|x)=\sum_j W_j(x'|x)$ that each
modify some small part of the state and individually satisfy detailed
balance in the steady state.
- The ratio $q(x|x')/q(x'|x)$ is called the Hastings Ratio and accounts for any asymmetry in the proposal rates between $x$ and $x'$.
Proposal distributions/operators for Trees
Need to identify a set of proposals $q_j(x'|x)$ when $x$ is a point in
the space of rooted time trees.
Stopping criteria
- How can we tell when a phylogenetic MCMC calculation has reached equilibrium?
- How do we know when we've collected enough samples?
- One approach is to compute ESS (using empirical estimates of autocorrelation time) for each
parameter and a number of tree summary statistics (e.g. tree height and tree "length").
- Assume that once all of these scores are sufficiently high
(e.g. > 200) we have adequately sampled the posterior.
- Examine output of several independent (and independently
initialized chains. A necessary (not sufficient) condition for
convergence is that sample distributions converge as the chains
converge to equilibrium.
Are we there yet?
The AWTY application applies a number of different statistics to assess
the convergence of the tree state. It relies heavily on comparing
the result of multiple runs.
Parameter samples
Logs of individual parameters can be considered samples from the marginal posteriors of those parameters.
Tree samples
- Summary statistics (e.g. age of the MRCA of a pair of taxa) can be
computed and their marginal distributions assessed.
- A number of different approaches can be used to produce a "summary" tree meant
to represent the best guess at the true tree. (None of these algorithms are perfect!!)
- Strict consensus
- Include only clades that appear in ALL sampled trees.
- Majority rule consensus
- Include clades that appear in >50% of the sampled trees.
- Maximum clade credibility tree
- The sampled topology for which the product of the posterior clade probabilities is maximized.
In BEAST, summary trees are produced using the MCC tree method via the program TreeAnnotator.
Bayesian Phylogenetic Inference Software
Popular Bayesian phyogenetic inference software:
- MrBayes (Huelsenbeck and Ronquist, Bioinformatics, 2001)
- Early CLI program for phylogenetic inference.
- RevBayes (Höhna et al., Syst. Biol., 2016)
- R-like syntax for specifying phylogenetic models.
- BEAST/BEAST2 (Drummond and Rambaut, 2007; Bouckaert et al., 2014)
- XML specification of phylogenetic models. Extensible.
Some software implementing special models:
- MIGRATE
- Performs inference under the Structured Coalescent. (i.e. subpopulation sizes, migration rates, ancestral locations.)
- ClonalFrame/ClonalOrigin
- Infer bacterial Ancestral Recombination Graphs (generalizations of trees when recombination is present).
... there are many others!
Summary
- Bayesian phylogenetic inference is generally done by adding a tree
prior to the tree likelihood. This implicitly assumes neutral
evolution.
- Tree priors link the shape of the tree to parameters such as
speciation rates, extinction rates, sampling rates, population
sizes and migration rates.
- Development of these priors is a particularly rich area of current
research.
- Due to the immense size of the state spaces involved and the use of
distributions which are analytically intractable, almost all
Bayesian phylogenetic inference uses MCMC.
- Deciding on stopping criteria is difficult. Many necessary
requirements for convergence, but no sufficient requirements!
- Methods for summarizing sampled tree distributions exist, but may
be misleading. At the moment, sticking to summary statistics for
particular characteristics of interest is the only safe
option.
Recommended Reading
- Chapter 10: Bayesian phylogenetic inference
Also recommended:
- Felsenstein (2003), Inferring Phylogenies
- Drummond & Bouckaert (2015), Bayesian Evolutionary Analysis with BEAST