Monday, December 30, 2013

How to interpret posterior probabilities?

What does it mean, for instance, to report a posterior probability (pp) of 0.87 for the monophyly of a given group of species (say Archaea)? Intuitively, a stronger support than 0.82 and a weaker support than 0.95. But what does this "0.87" mean in the absolute? 

How to interpret a 95% credible interval of, say, 90 to 110 Myr for the age of the last common ancestor of placental mammals ? 

The subjective Bayes answer is that subjective probabilities make sense only if associated with utilities. In the rather asbtract situation of phylogenetic inference, utilities are difficult to define. Strictly speaking, a pp of 0.87 for the monophyly of Archaea would mean, for instance, that you are ready to bet up to 87 euros, given that you will earn 100 euros if Archaea turn out to be monophyletic and 0 otherwise. But this interpretation sounds a bit silly.

Another possible way to grasp the meaning of subjective probabilities is to imagine other less abstract situations to get a feeling of what they mean in these new contexts and then transfer your perception of the implied weight of evidence back into the initial context. A pp of 0.87 should be interpreted as a similar strength of support in any situation, a pp of 0.87 for the monophyly of Archaea means the same strength of support as a pp of 0.87 in favor of the monophyly of a completely different group in a completely unrelated study (say, mammals), or even, as a pp of 0.87 in the favor of the presence of positive selection in a given gene, or as a pp of 0.87 that a given gene is associated with hypertension in a genome-wide association study.

So, then, let us suppose that you make a Bayesian statistical analysis to assess whether a given gene is associated with hypertension. This is a preliminary analysis, upon which you could decide to further investigate the case, using now experimental methods, but this requires you to invest time and money. If you estimate that your loss if you investigate the gene but then it turns out that the gene is not associated with hypertension is about 10 times greater than your loss if you don't investigate but then miss an important gene that turns out to be associated with hypertension, then, rationally, you should further investigate only if the posterior odds are 10:1 (or equivalently, pp = 0.91). A posterior probability of 0.99 would mean that you should basically consider a false positive at least 99 times more expensive than a false negative, etc.

By considering such hypothetical cases (and, I guess, ideally, by practising Bayesian decision making in real life), you progressively calibrate your mind to Bayesian probabilities (and utilities).

All this sounds a bit abstract and introspective, but after all, the semantics of p-values is also fairly hypothetical. I guess that, apart from their nominal frequentist calibration, the essential property of p-values is that they have the same meaning in all situations, and thus we can progressively calibrate our perception of the strength of evidence associated with p-values through our practical experience with using them.

Similarly, Bayesian probabilities are supposed to have a homogeneous meaning, and thus, as you are getting more experienced with the Bayesian paradigm, you will progressively get a good feeling of what strength of evidence a given pp is supposed to imply.

There is a ghost, however, looming over all this discussion: can we, at least under certain conditions, interpret posterior probabilities in frequentist terms ?


  1. Love your new blog, Nicolas!

    You say "we can progressively calibrate our perception of the strength of evidence associated with p-values". But I thought that measuring the strength of evidence is something that p-values cannot do. I also don't know how to compare p-values obtained from different data sets and under different modeling assumptions.

    1. thanks, Vladimir

      To be honest, I am not yet totally certain about how to interpret p-values. But I guess I was just saying that, since p-values are uniform under the null, it makes sense to say that a p-value of, say, p=1e-4, means as extreme an event under H0 in all circumstances.

      Then, of course, depending on the situations, it may take more or less strength of evidence against the null in order to reject it (in Bayesian terms, prior odds for H0 and H1 also matter).

      Or are you alluding to other problems with p-values ? like Valen Johnson's recent paper? or other more classical arguments ?

  2. I was referring to arguments effect size, sample size, etc. The idea being that in, say linear regression, small sample size and large effect can give you the same p-value and large sample size and small effect. In other words, the p-value does not tell us about the magnitude of the departure from the null. Hence, my hesitation that one can calibrate p-values from experience. Even though I have seen many p-values in my life, if you give me a new model and a new data set, with a p-value of 1e-4, I wouldn't know if I should be impressed.