TL;DR: Publication bias is a bitch, but poor hypothesising may be worse.
Estimated reading time: 10 minutes
And this is what I struggle with, right, with Registered Reports and this idea that we should be focusing on process and soundness and all this stuff. If there’s two papers that have equally good methods, that, before I knew the results, I would have said they were equally well-posed questions, but one reports a cure for cancer and the other reports a failed attempt to cure cancer – I’m gonna like the cure for cancer more, and I can’t escape feeling like at some point, you know, that shit matters.
First, a clarification: Sanjay does like Registered Reports (RRs)! He gave the following comment on his comment (meta Sanjay): “Looking at that quote in writing (rather than spoken) and without any context, it might sound like I’m ambivalent about RRs, but that’s not the case. I fully support the RR format and I don’t think what I said is a valid reason not to have them.” The issue is further discussed in a new Black Goat episode.
I have to admit this statement was a bit startling when I first heard it – Sanjay doesn’t like null results? But, but… publication bias! Ok, I should say that I am a bit über-anxious when it comes to this issue. I think that our collective bias against null results is one of the main causes of the replication crisis, and that makes me want everyone to embrace null results like their own children and dance in a circle around them singing Kumbaya.
But Sanjay is right of course – we all want a cure for cancer;Except for nihilists. (There’s nothing to be afraid of, Donny.) we want to find out what is true, not what isn’t. And that is why positive results will always feel more interesting and more important to us than null results. This is the root of publication bias: Every player in the publication system (readers, journal editors, reviewers, authors) is biased against null results. And every player expects every other player to be biased against null results and tries to cater for that to make progress.
Of course there are exceptions – sometimes we don’t buy a certain theory or don’t want it to be true (e.g. because it competes with our own theory). In these cases we can be biased against positive results. But overall, on average, all things being equal, I would say a general bias towards positive findings is a fair assessment of our system and ourselves, and this is what we’ll talk about today.
In this post, I will try to make the case for null results. I’m up against Sanjay Srivastava’s gut feeling, so this better be convincing. Ok, here we go: Four reasons why we should love null results.
1) We are biased against null results
Because we find positive results more important and interesting, we fall prey to motivated reasoning: We will judge studies reporting positive results to be of higher quality than studies reporting null results. We will retrospectively downgrade our impression of the methods of a paper after learning it produced null results. And we will be more motivated to find flaws in the methods, while the methods of papers reporting positive results will get an easier pass.I like the illustration of motivated reasoning by Gilovich (1991), explained here: Evidence in favour of your prior convictions will be examined taking a “Can I believe this?” stance, whereas evidence in opposition to your prior beliefs will be examined taking a “Must I believe this?” stance. The must stance typically gives evidence a much harder time to pass the test.
This means that as soon as we know the results of a study, we are no longer competent judges of the used research methods. But we know that sound methods make and break what we can learn from a study. This is why we absolutely must shield ourselves from judging papers based on or after knowing the results.
In other words: It’s ok to prefer the cancer-cure-finding RR to the non-cancer-cure-finding RR.Well… not really, though. If this leads to better outcomes for authors of positive results (fame, citations, grants), you still select for people who are willing and able to game the remaining gameable aspects of the system. But they have to be RRs, because we are guaranteed to fool ourselves if we base publication decisions on this feeling.
Reason #1: We should love null results to counter our tendency to underestimate their quality.
2) Null results are unpopular because our epistemology sucks
NB: I tried to avoid going down the epistemological and statistical rabbit holes of NHST and instead focus on the practical surface of NHST as it’s commonly used by psychologists, with all the shortcomings this entails.
This section was partly inspired by Daniël Lakens’ recent workshop in Munich, where we looked at the falsifiability of hypotheses in published papers.
I think one reason why null results are unpopular is that they don’t tell us if the hypothesis we are interested in is likely to be false or not.
The most common statistical framework in psychology is null hypothesis significance testing (NHST). We start out with a shiny new hypothesis, Hshinynew, which typically postulates an effect; a difference between conditions or a relationship between variables. But, presumably because we like it so much that we wouldn’t want it to come to any harm, we never actually test it. Instead, we set up and test a null hypothesis (H0) of un-shiny stuff: no effect, no difference or no relationship.Better known as Gigerenzer’s “null ritual” If our test comes up significant, p < .05, we reject H0, accept Hshinynew, and fantasise about how much ice cream we could buy from our hypothetical shiny new grant money. But what happens when p ≥ .05? P-hacking aside: When was the last time you read a paper saying “turns out we were wrong, p > .05″? NHST only tests H0. The p-value says nothing about the probability of Hshinynew being true. A non-significant p-value means that either H0 is true or you simply didn’t have enough power to reject it. In a Bayesian sense, data underlying a non-significant p-value can be strong evidence for the null or it can be entirely inconclusive (and everything in between).
“In science, the only failed experiment is one that does not lead to a conclusion.”
(Mack, 2014, p. 030101-1)
Maybe it’s just me, but I do find strong evidence for H0 interesting. Or, if you’re not a fan of Bayesian thinking: rejecting Hshinynew with a low error rate.These two are not identical of course, but you get the idea. I assume that we don’t reject Hshinynew whenever p ≥ .05 mainly because we like it too much. But we could, and thanks to Neyman & Pearson we would know our error rate (rejecting Hshinynew when it is true): beta, more commonly known as 1-power. With 95% power, you wouldn’t even fool yourself more often when rejecting Hshinynew than when rejecting H0.
There must be a catch, right? Of course there is. Das Leben ist kein Ponyhof, as we say in German (life isn’t a pony farm). As you know from every painful minute spent on the Sample Size Samba, power depends on effect size. With 95% power for detecting dshinynew, you have less than 95% power for detecting anything smaller than dshinynew. So the catch is that we must commit ourselves to defining Hshinynew more narrowly than “something going on” and think about which effect size we expect or are interested in.
I think we could gain some null-result fans back if we set up our hypothesis tests in a way that would allow us to conclude more than “H0 could not be rejected for unknown reasons”. This would of course leave us with a lot less wiggle space to explain how our shiny new hypothesis is still true regardless of our results – in other words, we would have to start doing actual science, and science is hard.
Reason #2: Yeah, I kind of get why you wouldn’t love null results in these messy circumstances. But we could love them more if we explicitly made our alternative hypotheses falsifiable.
3) Null results come before positive results
Back to the beginning of this post: We all want to find positive results. Knowing what’s true is the end goal of this game called scientific research. But most of us agree that knowledge can only be accumulated via falsification. Due to several unfortunate hiccups of the nature of our existence and consciousness, we have no direct access to what is true and real. But we can exclude things that certainly aren’t true and real.
Imagine working on a sudoku – not an easy-peasy one from your grandma’s gossip magazinesbracing myself for a shitstorm of angry grannies but one that’s challenging for your smartypants brain. For most of the fields you’ll only be able to figure out the correct number because you can exclude all other numbers. Before you finally find that one number, progress consists in ruling out another number.
Now let’s imagine science as one huge sudoku, the hardest one that ever existed. Let’s say our future depends on scientists figuring it out. And we don’t have much time. What you’d want is a) putting the smartest people of the planet on it, and b) a Google spreadsheet,
because Google spreadsheets rock so that they could make use of anyone else’s progress instantly. You would want them to tell each other if they found out that a certain number does not go into a certain field.
Reason #3: We should love null results because they are our stepping stones to positive results, and although we might get lucky sometimes, we can’t just decide to skip that queue.
4) Null results are more informative
The number of true findings in the published literature depends on something significance tests can’t tell us: The base rate of true hypotheses we’re testing. If only a very small fraction of our hypotheses are true, we could always end up with more false positives than true positives (this is one of the main points of Ioannidis’ seminal 2005 paper).
When Felix Schönbrodt and Michael Zehetleitner released this great Shiny app a while ago, I remember having some vivid discussions with Felix about what the rate of true hypotheses in psychology may be. In his very nice accompanying blog post, Felix included a flowchart assuming 30% true hypotheses. At the time I found this grossly pessimistic: Surely our ability to develop hypotheses can’t be worse than a coin flip? We spent years studying psychology! We have theories! We are really smart! I honestly believed that the rate of true hypotheses we study should be at least 60%.
A few months ago, this interesting paper by Johnson, Payne, Want, Asher, & Mandal came out. They re-analysed 73 effects from the RP:P data and tried to model publication bias. I have to admit that I’m not maths-savvy enough to understand their model and judge their method,I tell myself it’s ok because this is published in the Journal of the American Statistical Association. but they estimate that over 700 hypothesis tests were run to produce these 73 significant results. They assume that power for tests of true hypotheses was 75%, and that 7% of the tested hypotheses were true. Seven percent.
Uh, umm… so not 60% then. To be fair to my naive 2015 self: this number refers to all hypothesis tests that were conducted, including p-hacking. That includes the one ANOVA main effect, the other main effect, the interaction effect, the same three tests without outliers, the same six tests with age as covariate, … and so on.
Let’s see what these numbers mean for the rates of true and false findings. If you’re anything like me, you can vaguely remember that the term “PPV” is important for this, but you can’t quite remember what it stands for and that scares you so much that you don’t even want to look at it if you’re honest… For the frightened rabbit in me and maybe in you, I’ve made a wee table to explain the PPV and its siblings NPV, FDR, and FOR.
Ok, now we got that out of the way, let’s stick the Johnson et al. numbers into a flowchart. You see that the PPV is shockingly low: Of all significant results, only 53% are true. Wow. I must admit that even after reading Ioannidis (2005) several times, this hadn’t quite sunk in. If the 7% estimate is anywhere near the true rate, that basically means that we can flip a coin any time we see a significant result to estimate if it reflects a true effect.
But I want to draw your attention to the negative predictive value. The chance that a non-significant finding is true is 98%! Isn’t that amazing and heartening? In this scenario, null results are vastly more informative than significant results.
I know what you’re thinking: 7% is ridiculously low. Who knows what those statisticians put into their Club Mate when they calculated this? For those of you who are more like 2015 Anne and think psychologists are really smart, I plotted the PPV and NPV for different levels of power across the whole range of the true hypothesis rate, so you can pick your favourite one. I chose five levels of power: 21% (neuroscience estimate by Button et al., 2013), 75% (Johnson et al. estimate), 80% and 95% (common conventions), and 99% (upper bound of what we can reach).
The plot shows two vertical dashed lines: The left one marks 7% true hypotheses, as estimated by Johnson et al. The right one marks the intersection of PPV and NPV for 75% power: This is the point at which significant results become more informative than negative results. That happens when more than 33% of the studied hypotheses are true. So if Johnson et al. are right, we would need to up our game from 7% of true hypotheses to a whopping 33% to get to a point where significant results are as informative as null results!
This is my take-home message: We are probably in a situation where the fact that an effect is significant doesn’t tell us much about whether or not it’s real. But: Non-significant findings likely are correct most of the time – maybe even 98% of the time. Perhaps we should start to take them more seriously.
Reason #4: We should love null results because they are more likely to be true than significant results.
Especially Reason #4 has been quite eye-opening for me and thrown up a host of new questions – is there a way to increase the rate of true hypotheses we’re testing? How much of this is due to bad tests for good hypotheses? Did Johnson et al. get it right? Does it differ across subfields, and if so, in what way? Don’t we have to lower alpha to increase the PPV given this dire outlook? Or go full Bayesian? Should replications become mandatory?In the Johnson et al. scenario, two significant results in a row boost the PPV to 94%.
I have no idea if I managed to shift anyone’s gut feeling the slightest bit. But hey, I tried! Now can we do the whole Kumbaya thing please?
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365–376.
Dawson, E., Gilovich, T., & Regan, D. T. (2002). Motivated reasoning and performance on the Wason Selection Task. Personality and Social Psychology Bulletin, 28(10), 1379–1387. doi: 10.1177/014616702236869
Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33, 587–606.
Gilovich, T. (1991). How we know what isn’t so: The fallibility of human reason in everyday life. New York: Free Press.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. doi: 10.1371/journal.pmed.0020124
Johnson, V. E., Payne, R. D., Wang, T., Asher, A., & Mandal, S. (2017). On the reproducibility of psychological science. Journal of the American Statistical Association, 112(517), 1-10. doi: 10.1080/01621459.2016.1240079
Mack, C. (2014). In Praise of the Null Result. Journal of Micro/Nanolithography, MEMS, and MOEMS, 13(3), 030101.
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
Footnotes [ + ]
|1.||↑||Except for nihilists. (There’s nothing to be afraid of, Donny.)|
|2.||↑||I like the illustration of motivated reasoning by Gilovich (1991), explained here: Evidence in favour of your prior convictions will be examined taking a “Can I believe this?” stance, whereas evidence in opposition to your prior beliefs will be examined taking a “Must I believe this?” stance. The must stance typically gives evidence a much harder time to pass the test.|
|3.||↑||Well… not really, though. If this leads to better outcomes for authors of positive results (fame, citations, grants), you still select for people who are willing and able to game the remaining gameable aspects of the system.|
|4.||↑||Better known as Gigerenzer’s “null ritual”|
|5.||↑||These two are not identical of course, but you get the idea.|
|6.||↑||bracing myself for a shitstorm of angry grannies|
|7.||↑||I tell myself it’s ok because this is published in the Journal of the American Statistical Association.|
|8.||↑||In the Johnson et al. scenario, two significant results in a row boost the PPV to 94%.|