A shibboleth is a custom, such as a choice of phrasing, that distinguishes one group of people from another. The term goes back to the Hebrew Bible, in which the inhabitants of Gilead identify members of the tribe of Ephraim by the way they pronounce the word “shibboleth” (Hebrew for the part of the plant containing grain, or flood/torrent). Those who are identified as such are promptly killed.
A more modern shibboleth is the interpretation of a confidence interval. If you calculate a confidence interval and subsequently claim that, with a probability of 95%, the parameter of interest lies within this interval, you will not get killed. A far worse fate awaits you: Those in the know will assume that you are a noob and have no idea what you are talking about.[1] And they will say so on TwitterBluesky, and they will all link to a StackOverflow thread from 2009 which they say “should be quite educational”, but it’s just pages and pages of arguments between GutenTagistician42 and Chi_senbergUncertainty over specific phrases permitted when interpreting confidence intervals. Confidence intervals are a great source of confusion for students learning statistics. They may have the impression that everything they have mastered so far leads up to this interpretation, only to learn that if they do interpret it that way, they are absolutely wrong. So, what’s going on here?
What a confidence interval is
Let’s start from the basics. Confidence intervals are calculated in the context of Frequentist statistics. They always target a specific parameter of interest, such as, for example, the average height in a population of interest. Now, the data we look at are a random sample from said population.[2]There is a side issue here that the data are almost never a random sample from a population that is a-priori well-defined. For the present purpose, we’re going to resolve this by cope: The population is whatever population your sample is a random sample from. We use these data to calculate, for example, a 95% confidence interval for the mean of height. The idea of the 95% confidence interval is that, if we repeatedly draw samples from that population and calculate the 95% confidence intervals, 95% of those intervals will contain the true parameter (the actual average height in the population).[3]95% refers to the nominal coverage probability. The “coverage guarantee” of the confidence interval must hold for repeated simulations from any parameter value individually. In that manner, confidence intervals describe the “worst case” scenario, which in turn causes issues in non-trivial models, as Martin Modrák points out. We are going to assume throughout that the confidence interval at stake is well-behaved and estimated within the context of a fairly simple model to keep things simple.
We do, however, usually only draw one sample. Upfront, we can expect that the confidence interval calculated from this sample will contain the true parameter with a probability of 95%. Now, we have drawn the sample and calculate the confidence interval. What’s the probability that the confidence interval contains the true parameter?
What a confidence interval isn’t
One may be forgiven to assume that the answer is 95%. This is the answer that most people would expect. For example, consider the following everyday scenario. You know that in the population, 35% of people are born without any wisdom teeth. You meet a stranger whose mouth is initially shut. What is the probability that they have no wisdom teeth? If you’re a reasonable person, you’d probably guess 35%. Then, as any reasonable person, you kindly ask the stranger to open their mouth wide to take a closer look, for science.
Why does this not work for the 95% confidence interval? Because the 95% confidence interval is a Frequentist concept. Whereas wisdom teeth are built on wisdom, Frequentism builds on a Frequentist interpretation of probabilities which builds on frequencies, hence the name. More precisely, in Frequentist statistics, a probability is the limit of a relative frequency across an infinite number of trials. You draw an infinite number of samples, calculate the confidence interval for a parameter of interest, count the relative frequency of confidence intervals that actually do contain the parameter of interest, and end up with 95%. The 95% thus refers to a property of the whole procedure of calculating a confidence interval, and to a (hypothetical) endless string of confidence intervals. It’s also what you expect for the next confidence interval if you have not yet observed the data. But once the data are in, it no longer applies.
But then what does apply after the data are in? For example, what is the probability that the 95% confidence interval contains the true parameter? Or what would the interval look like that does contain the true parameter with a probability of 95%? Well, Frequentism simply does not deal in this type of probabilities. It’s not like the confidence interval is the wrong answer to the question; it’s that the question is considered outside of the scope of the framework. And whereof one cannot speak, thereof one must be silent. You’ve shown your wisdom teeth to a statistician, and they ask you to shut your mouth again.
I do not think that this is intuitive at all. The Frequentist limits on the interpretation of probability is simply not aligned here with what most of us would be willing to say in everyday life. Sometimes people try to elaborate by explaining that in Frequentist statistics, the confidence interval either contains the true parameter (1) or not (0); so any probability in between that cannot apply. Personally, I feel like this makes Frequentism sound dumber than it is. What’s the probability that Johnny has no wisdom teeth? Well, the data were determined at his birth, so it’s either 0% or 100%. Sorry, I don’t make the rules. Now of course, given our knowledge, it’s reasonable to say the probability is 35%, but this probability represents our state of knowledge about the world and is thus a Bayesian probability. It’s just not the Frequentist way of life.
But does it matter?
As far as I can tell, a great amount of energy is exerted to ensure the correct interpretation of confidence intervals is propagated. Maybe that energy could be exerted more purposefully.
One angle on this is that maybe somebody else is a Frequentist and calculates a confidence interval, but then you look at it, and actually you’re more of a Bayesian yourself. So, you look at that 95% confidence interval and think to yourself “well, there’s a 95% probability that the true parameter is somewhere in there.” This would indeed be your best bet, unless you had additional information (e.g., prior beliefs) that you could incorporate. So, there’s a Get Out of Jail Free card if you did not commit the confidence interval crime but are merely a Bayesian witness.[4]Malte asked: If a Frequentist provides a Frequentist interval and you interpret it as a Bayesian, aren’t you the one committing the crime? But as a Bayesian, it’s perfectly legal to take whatever you can get to feed your inference machine. See Figure 1. Going one step further, you could actually just be a Frequentist who is less squeamish about probabilities, a position that Norm Matloff champions here. If you look at the whole thread and the quote posts, you will see that some people strongly disagree.
Of course, if you’re a Bayesian, you might as well go ahead and calculate a credible interval instead. Credible intervals have precisely the interpretation that everybody who is not a hardcore Frequentist wishes confidence intervals had. The cost of this is that you have to specify a prior, that is, you have to state which parameter values are how plausible upfront. But, the thing is – we can use off-the-shelf Bayesian analysis, which usually comes with “objective/uninformative” prior. Intellectually, that’s the Bayesian equivalent of the shruggie emoji (“Average height in the population? Oh boy, that could be anything! Maybe it’s 500 cm? Maybe -10 cm? Who am I to tell without having seen any data!”). And with such a prior, the credible interval will often be fairly similar to the Frequentist confidence interval (but not always!).[5]This is a bit of a trope in the Bayesian/Frequentist debate. At some point, somebody will point out that in many common scenarios, inferences end up the same. Then somebody else writes a reply to point out scenarios in which this is not the case. Then they have a civil disagreement on social media, and live happily ever after. So, the confidence interval does not have the interpretation that we want, but it is very similar to the credible interval (if we essentially don’t have any a-priori knowledge about what’s plausible), which in turn does have the interpretation that we want.
Another angle on this is: What are the downstream consequences of rampant misinterpretation of confidence intervals? As far as I can tell, the whole exercise seems mostly concerned about language. If I say “with a probability of 95%, this interval contains the true parameter”, I commit a faux pas. If I say “here’s my interval; in the long run, 95% of intervals created in this manner contain the true value”, I exhibit technical sophistication. But will any of my downstream inferences look differently? Is any consumer of my findings going to deal with the information differently?
Which is to say, if we ignore the whole shibboleth aspect of it – if you misinterpret confidence intervals, people will think you are bad at stats[6]And you may miss points on your forthcoming stats exam. If you have been taught the Frequentist way, write down the Frequentist answer. – the misinterpretation of confidence intervals is probably mostly an epiphenomenon. It does not have major consequences for subsequent scientific inferences; at least I cannot think of any. This makes it different from other statistical faux pas, such as interpreting p values as the probability that the null hypothesis is true, or assuming that all p values and confidence intervals (or credible intervals, for that matter) in the literature can be taken at face value.
But does it matter!
Smart people still get upset about other people misinterpreting confidence intervals. One valid concern is that the misinterpretation of confidence intervals is a sign of lacking statistical literacy. That may as well be true, although I feel like it mostly demonstrates that people have not fully absorbed the Frequentist mindset.[7]Just like growth mindset, except you frequently confuse yourself. In any case, there are enough other signs in the literature of lacking statistical literacy, many of which do have downstream consequences.
Another concern I sense from the Bayesian side is that if only people realise that (1) confidence intervals cannot be interpreted that way, but (2) credible intervals can be interpreted that way, they’ll become Bayesians. While I do not know whether it will play out that way empirically, I do agree that credible intervals are probably a better match for people’s inferential needs if they are all so eager to misinterpret confidence intervals. Morey et al. (2015) make a good case (see also Figure 1).
At this point, it is good to keep in mind that statisticians are, of course, very much in the business of caring a great deal about statistics in its own right. That is not a bad thing at all, it’s sort of their job. But it can lead to a slightly different list of priorities than caring about statistics strictly as an instrument for scientific inference. If we care about statistics only for instrumental reasons, misinterpreted confidence intervals are probably not the biggest problem; maybe they are not even in the top 10 – we have lots of other stuff to worry about. Such as people misunderstanding how hypothesis testing works, or people being completely confused about what their analysis is supposed to show in the first place, and about whether they are trying to do causal inference or not. But of course I’d say such a thing! I’m a causal inference person, after all.
Lastly, you can always rely on the good ol’ trusty 100% CI. Regardless of whether you’re a Bayesian or a Frequentist, it always contains the truth.
Appendix 1: The parameters are fixed, the data are random (and vice versa)
People like to say that in Frequentist statistics “the parameters are fixed and the data are random”, whereas in Bayesian statistics “the data are fixed and the parameters are random.” This is also sometimes invoked when somebody asks about the correct interpretation of confidence intervals.
In general and after a significant amount of exegesis, I can see how the whole fixed versus random thing makes sense on a certain level.
In Frequentism, the general reasoning starts by assuming a certain fixed, known parameter (say, that a group difference on the population level is zero). We then think about what we expect for repeated random draws of data (say, the expected distribution of observed group differences in samples of a certain size if the population difference is actually zero). And then we use this to draw inferences – for example, if the observed group difference in a sample is so large that it would be unlikely to occur if the group difference on the population level is zero, we discard the hypothesis that the group difference on the population level is zero. At all points in the process, probabilities refer to relative frequencies of observed data.
In contrast, in Bayesian statistics, we have observed some data and then try to infer which parameter would have most likely resulted in these data. So, we do end up with probabilities for parameters (which quantify our degree of belief, rather than relative frequencies).
If we now equate “has a probability attached to it” with “random” (we think of it as a random variable) and “has no probability attached to it” with “fixed”, I guess it works out.
The thing is, this just describes differences in the inferential procedure and not some deeper difference about the world and how we make sense of it. It’s not like Bayesians don’t believe in true parameters that take on certain values; the true parameters are just unknown – but we can make probability statements about them, yay! For Frequentists, the true parameters are unknown as well; even if the whole inferential procedure involves reasoning about scenarios with known parameters (such as “assuming the group difference in the population is actually zero”, aka, assuming the null hypothesis).
But then invoking “the parameters are fixed, the data are random” as an explanation for why confidence intervals should not be interpreted in a certain manner seems to boil down to “In this Frequentist house we do not make probability statements of this type.” Which, fair enough, is essentially the same as the explanation further above. But I find it confusing because it sounds a lot more mysterious, as if there was some fundamental disagreement about the state of the world – rather than just a difference in how we try to learn about the world.
Appendix 2: A biblically accurate alternative interpretation of confidence intervals
There is an alternative fully-Frequentist interpretation of confidence intervals that more closely connects them to the idea of null-hypothesis significance testing. The 95% confidence interval contains all possible population parameters that, if they were our null hypothesis, would not be rejected with an alpha of .05 because of our data.[8]Usually, the confidence intervals that people report are two-sided confidence intervals, and so they correspond to two-sided tests. It’s also possible to report confidence intervals that correspond to one-sided tests, although those are a bit confusing, because on one side they extend towards (minus or plus) infinity. If you ever got into the awkward situation that you wanted to report a confidence interval from which one can read off the results of one-sided tests with an alpha of .05, I’d recommend to simply calculate the 90% confidence interval. That can be used to conduct one-sided tests with an alpha of .05 (into both directions). One could shorten this to “the confidence interval contains all parameter values that we cannot reject” or “the confidence interval contains all parameter values that are compatible with the data.” The appeal of this is that it’s more snappy than any statement about coverage, while not being wrong according to Frequentist logic. It also does feel like an explanation, although when you think about it, it assumes that you have understood null hypothesis significance testing, and null hypothesis significance testing is confusing in its own right. If you can just assume that people have understood this one complicated thing, can’t you just assume that they have also understood confidence intervals and don’t need any interpretation at all? Then again, if you are deeply confused about the whole matter, maybe this is an elegant solution that buries any confusion one level deeper, where it won’t upset people who really care about confidence intervals.
Acknowledgments: Martin Modrák and Casper Albers were so kind to provide helpful feedback on this blog post. As always, they bear no responsibility if I got anything wrong. That’s all Malte’s fault.
Footnotes
↑1 | And they will say so on |
---|---|
↑2 | There is a side issue here that the data are almost never a random sample from a population that is a-priori well-defined. For the present purpose, we’re going to resolve this by cope: The population is whatever population your sample is a random sample from. |
↑3 | 95% refers to the nominal coverage probability. The “coverage guarantee” of the confidence interval must hold for repeated simulations from any parameter value individually. In that manner, confidence intervals describe the “worst case” scenario, which in turn causes issues in non-trivial models, as Martin Modrák points out. We are going to assume throughout that the confidence interval at stake is well-behaved and estimated within the context of a fairly simple model to keep things simple. |
↑4 | Malte asked: If a Frequentist provides a Frequentist interval and you interpret it as a Bayesian, aren’t you the one committing the crime? But as a Bayesian, it’s perfectly legal to take whatever you can get to feed your inference machine. See Figure 1. Going one step further, you could actually just be a Frequentist who is less squeamish about probabilities, a position that Norm Matloff champions here. If you look at the whole thread and the quote posts, you will see that some people strongly disagree. |
↑5 | This is a bit of a trope in the Bayesian/Frequentist debate. At some point, somebody will point out that in many common scenarios, inferences end up the same. Then somebody else writes a reply to point out scenarios in which this is not the case. Then they have a civil disagreement on social media, and live happily ever after. |
↑6 | And you may miss points on your forthcoming stats exam. If you have been taught the Frequentist way, write down the Frequentist answer. |
↑7 | Just like growth mindset, except you frequently confuse yourself. |
↑8 | Usually, the confidence intervals that people report are two-sided confidence intervals, and so they correspond to two-sided tests. It’s also possible to report confidence intervals that correspond to one-sided tests, although those are a bit confusing, because on one side they extend towards (minus or plus) infinity. If you ever got into the awkward situation that you wanted to report a confidence interval from which one can read off the results of one-sided tests with an alpha of .05, I’d recommend to simply calculate the 90% confidence interval. That can be used to conduct one-sided tests with an alpha of .05 (into both directions). |
Good post except for this statement that suggests Bayesian approaches are ultimately better, even if the difference may be small.
“Another concern I sense from the Bayesian side is that if only people realise that (1) confidence intervals cannot be interpreted that way, but (2) credible intervals can be interpreted that way, they’ll become Bayesians. While I do not know whether it will play out that way empirically, I do agree that credible intervals are probably a better match for people’s inferential needs if they are all so eager to misinterpret confidence intervals. Morey et al. (2015) make a good case (see also Figure 1).”
There is a cost to using the right language in the interpretation of “uncertainty intervals” (the range of plausible values that cannot be ruled out by the data). To compute a Bayesian interval, Bayesians have to specify an a priori assumption about the true effect size. This prior can be true or false and as it is not based on data it is most likely to be false. Data are used to reduce the falseness of the prior, but especially when the data set is small, the prior will still bias the credibility interval. As a result of this step, all credibility intervals are conditional on a prior and will vary for different priors. So, Bayesians will have to say properly, Conditional on my prior beliefs, I can assert that the 95% credibility interval contains the true value.” This is done in election forecasts when results of one survey are used to update an existing model based on other information. It makes little sense to rely on some researcher’s prior when they present the results of their study. Who would trust a researcher who says, Conditional on my prior beliefs, the results support my hypothesis? Frequentist statistics avoid this problem which is why they have remained the dominant approach despite Bayesian’s criticism for 100 years.
Hi Uli, thanks for chiming in! Yes, you are absolutely right about the downside (or upside, depending on whom you ask) of conditioning on prior beliefs. I’ll adjust the language a bit; I think in a previous draft I had a side remark “at the cost of having to specify a prior.” Given that the post may be interpreted as “just go Bayesian, it solves this at no cost,” probably good to re-insert some information.
Wow I forgot Rickrolling was a thing
I admit it’s a bit of a vintage meme.
I’m a noob in this field, and I’ve never laughed so hard in my life! I guess I’ve officially leveled up to ‘math nerd’ status 🤓🤣
Is it correct to say that the probability is 1 if the true value happens to be in the confidence interval, and 0 if it is not, and we just don’t know which it is?
I am not sure this would be correct — people have “explained” it that way to me, but others have pointed out that this is not frequentist canon (possibly because it still assigns probabilities at all?) and, thus, a straw man.