if $A$ is true, then $B$ is true
$A$ is true
$B$ is true
if $A$ is true, then $B$ is true
$B$ is false
$A$ is false
Suppose some dark night a policeman walks down a street. He hears a burglar alarm, looks across the street, and sees a jewelry store with a broken window. Then a gentleman wearing a mask comes crawling out of the window, carrying a bag full of jewelry. The policeman decides immediately that the gentleman is a thief. How does he decide this?
The policeman seems to be applying the following weak syllogism:
if $A$ is true, then $B$ is likely
$B$ is true
$A$ is likely
Our theory must satisfy the following requirements:
These requirements are enough to uniquely identify the essential rules of probability theory!
- The probability $P(A|B)$ is the degree of plausibility of proposition $A$ given that $B$ is true.
- Product rule: $P(A|BC)P(B|C) = P(AB|C)$
- Sum rule: $P(A|B)+P(\neg A|B) = 1$
By convention, $P(A)=0$ indicates $A$ is certainly false while $P(A)=1$ means $A$ is certainly true.
Strictly speaking, probabilities only ever concern propositions (i.e. with true or false values):
- Tim is a Cat
- $N=5$
A statement such as $P(N)$ is therefore as meaningless as $P(\text{Tim})$.
However, where propositions concern the value of a variable like $N$, we often use $P(n)$ as shorthand for $P(N=n)$.
Abusing notation, this is sometimes written as $P(N)$.
Traditionally, probability has been defined in terms of relative frequencies of outcomes of repeated random (weakly controlled) "experiments".
There are several problems here:
- Experiments are assumed to be repeatable.
- Assumes that randomness is a property of the system.
- Completely ignores $\sim 400$ years of physics.
The Bayesian interpretation treats probability as a measure of the plausibility of propositions conditional on available information.
A single proposition can therefore have multiple probabilities depending on the available information!
- $f(x)$ is a probability density.
- It is normalized: $\int_0^{10}f(x)dx=1$
- It is positive: $f(x)\geq 0$
- At a given point $f(x)$ may be $>1$!
Often, $f(x)$ follows the standard rules of probability.
In fact, there are several justifications:
Inference is the act of deriving logical conclusions from premises assumed to be true.
Statistical inference generalises this to situations where the premises are not sufficient to draw conclusions without uncertainty.
What is "data"?
How many of the balls in the urn are red? In other words, what is $P(N_r|d_1=R,d_2=B,d_3=R)$?
Given the description of the process, it is more tractable to consider $$P(d_1,d_2,d_3|N_r)=P(d_3|N_r,d_1,d_2)P(d_2|N_r,d_1)P(d_1|N_r)$$
Knowing nothing about the internal arrangement of the balls in the urn, we must have $P(d_1|N_r)=N_r/11$.
In general $P(d_2|N_r,d_1)$ depends on the result of the first draw!
Cheat by assuming urn is shaken between draws and we know nothing of physics, so that $P(d_2|N_r,d_1)=P(d_2|N_r)$.
Applying the product rule a couple of times yields: \begin{align} P(N_r|R,B,R)P(R,B,R)&=P(R,B,R|N_r)P(N_r)\\ P(N_r|R,B,R)& = \frac{1}{P(R,B,R)}P(R,B,R|N_r)P(N_r) \end{align}
The term $P(R,B,R)$ is a function only of the data: constant. The term $P(N_r)$ specifies the plausibility of each possible value of $N_r$ in the absence of the data.
$$\color{cyan}{P(\theta_M|D,M)} = \frac{\color{orange}{P(D|\theta_M,M)}\color{red}{P(\theta|M)}}{\color{lime}{P(D|M)}}$$
Here $\theta_M$ are parameters of some model $M$ and $D$ is data assumed to be generated by that model.
The components of the equation even have names:
A probability!
The probability of whatever you're interested in but in the absence of possibly relevant data.
In principle, any two (rational) people with access to the same information should specify exactly the same prior.
In practice this often isn't true.
NO!
- It is not possible to do inference without making assumptions.
- Priors allow previous knowledge to be incorporated.
- Frequentist (and Likelihoodist) methods also use priors: it's just not clear what they are!
For a continuous variable $a<x<b$, sensible priors may include
For a rate variable $\lambda>0$,
Hold on, how can we choose a value of $c$ in $f(\lambda)=c$ so that $f(\lambda)$ is normalized on the domain of $\lambda$?
We can't! This $f(\lambda)$ is not a true probability density.
It is important to remember that:
Only the person doing the analysis can answer this!
Priors encapsulate expert knowledge (or its absence). This is your opportunity to contribute your hard-won expertise to the analysis.
Here the interval is $[0.41, 0.91]$.
INTEGRATION
Bayes' theorem has a troublesome denominator: $$ P(\theta_M|D,M)=\frac{P(D|\theta_M,M)P(\theta_M|M)}{P(D|M)} $$
The quantity $P(D|M)$ is a normalizing constant, which is the result of integrating the numerator over all $\theta_M$: $$P(D|M)=\int P(D|\theta_M,M)P(\theta_M|M)d\theta_M$$
In our context, Monte Carlo methods are algorithms which produce random samples of values in order to characterize a probability distribution over those values.
Usually, the algorithms we deal with seek to produce an arbitrary number of independent samples of $\theta_M$ drawn from the posterior distribution $P(\theta_M|D,M)$.
Reject red samples, green samples are drawn from target distribution.
In general, the fraction of an enclosing distribution occupied by a target distribution diminishes rapidly as the number of dimensions increases.
This means that for problems with large numbers of unknown parameters, rejection sampling will only ever produce rejects!
The number of steps required for this function to decay to within the vicinity of 0 is the gap between effectively independent samples, $\tau$.
The ESS is a rough estimate of the number of actual samples a chain has generated.
You should really only consider the order of magnitude of the ESS.