A bit more than a month ago, I arrived in Ann Arbor, Michigan, to spend my summer working with some of the wonderful data sets being collected at the Institute for Social Research of the University of Michigan and to enjoy the unique flair of a US college town (why does it smell like weed everywhere?). So far, I think I have acclimated to local habits pretty well. I’m pre-emptively asking people how they are doing; I’m making sure to tone up my enthusiasm (i.e. “nice” → “that’s amazing!”, “sounds good” → “What a great idea!”, “like it” → “love it to death”) to avoid any misunderstandings. I think I have almost figured out how recycling works here—although it is very confusing, after all, we Germans take our recycling quite seriously. I even feel tempted to start talking about my bigger goals and plans, and how I strive to become the best version of myself.Just to prevent people from thinking I’m a slob. Actually, I’m sort of the best version of myself that is currently on the market per definition, but then maybe I’m just too lazy to apply counterfactuals to my everyday life. I’m not sure whether I would be a better and/or happier person if I tried to apply counterfactuals to my everyday life. Many times, I enjoy the increased display of positive affect.
However, there is that one thing that I can’t quite get over: The low tolerance for negativity, including most forms of direct or even indirect criticism, in social interactions in the academic context.
I give a talk that goes fairly well. Afterwards, there are some very engaged questions from faculty members whose opinion I value highly. However, later one of the US students asks me whether I didn’t mind that I got roasted. I’m utterly confused because I normally assume that critical questions are a good sign: People are awake and thinking about the topic (instead of sitting in the back of the room, checking their emails, slowly letting their mind drift towards the relaxing summer vacation that is not really going to happen anyway because this is academia, after all).
A student gives a talk about an issue relevant to ethnic minorities. I think that they are asking a great question, but it’s pretty darn obvious that the research design is totally useless to address that great question. For a second, I think about saying something—but then again, I’ve already said a lot of critical stuff that day, so I just cross my fingers and hope somebody else will say something. After all, the central problem is so clear, and it has just been brought up in the previous talk, so I know that the others know—somebody has to bring it up, right? But that doesn’t happen, and so all following “questions” are just affirmations of the importance of the research topic plus the ubiquitous suggestion that W could moderate the association because that totally makes sense following theory X by author YZ.
A friend tells me that they were involved in the preparation of future student instructors. They have been instructed to instruct the instructors to never tell a student that they are wrong. “You are on the right path!”—could be okay if used sensibly, “That was a good try!”—off limits, “That’s wrong”—nope nope nope.College rankings and how US universities have been turned into corporations and what that means for academia is probably a topic for a separate blogpost…
So there you have it. I don’t want to re-iterate the tone discussion that we had some time ago within the psychological community. I don’t even want to talk about cultural differences in the way we express criticism, although that is certainly an interesting topic.Have you heard about the Dutch? Wow, they are so rude!
I personally, emotionally, viscerally hate being criticized. Even if it’s just opening a manuscript file that has been overhauled or gently annotated by my advisor (who is definitely more the gentle type) or our proofreader (who is also extremely friendly):Love her to death. OMG see what a month in the US did to me. It hits you like a thousand knives stabbing you all over your body.
Life is like a box of chocolates, scientific criticism isn’t. (Photo by Jennifer Pallian)
And yet I think being criticized is an essential part of science.
If nobody criticizes my work, I won’t learn that I’m wrong. Or probably I will know that I’m wrong because that’s my null hypothesis,As with most null hypotheses in psychology, p < .05 most of the time, obviously. Just kidding. but I will never learn that I am much wronger than I think I am and in a multitude of ways. We all need critical feedback from our scientific community to learn. Sparing others from the potential negative emotions will do them a disservice in the long run, because science is about figuring out the truth, not about feeling good.
But even though valid scientific criticism can help us learn, it need not be constructive in the sense in which the word is normally used, ”maybe instead of trying it this way, wouldn’t it be nice if you tried it that way?” If somebody points out a valid problem in my work, it is not their job to suggest a solution—that is first and foremost my job, because I decided to dive into that particular research question and now have to figure out the best way to address it. If there is actually no proper solution, tough luck—for me.
Scientific criticism does not even need to be nice and friendly. Sure, I generally prefer people who don’t act like assholes, and I firmly believe that being an asshole is not the most effective way to interact with others. But the question of whether a critic is being an asshole is orthogonal to the validity of the scientific argument they raise. If they make a good point, they make a good point.
So the next time I see a research design gone awry, I will probably say something. And so should you.
Just to prevent people from thinking I’m a slob. Actually, I’m sort of the best version of myself that is currently on the market per definition, but then maybe I’m just too lazy to apply counterfactuals to my everyday life. I’m not sure whether I would be a better and/or happier person if I tried to apply counterfactuals to my everyday life.
College rankings and how US universities have been turned into corporations and what that means for academia is probably a topic for a separate blogpost…
Have you heard about the Dutch? Wow, they are so rude!
Love her to death. OMG see what a month in the US did to me.
As with most null hypotheses in psychology, p < .05 most of the time, obviously. Just kidding.
Don’t judge my, I’m just a filthy second-stringer.
And this is what I struggle with, right, with Registered Reports and this idea that we should be focusing on process and soundness and all this stuff. If there’s two papers that have equally good methods, that, before I knew the results, I would have said they were equally well-posed questions, but one reports a cure for cancer and the other reports a failed attempt to cure cancer – I’m gonna like the cure for cancer more, and I can’t escape feeling like at some point, you know, that shit matters.
First, a clarification: Sanjay does like Registered Reports (RRs)! He gave the following comment on his comment (meta Sanjay): “Looking at that quote in writing (rather than spoken) and without any context, it might sound like I’m ambivalent about RRs, but that’s not the case. I fully support the RR format and I don’t think what I said is a valid reason not to have them.” The issue is further discussed in a new Black Goat episode.
I have to admit this statement was a bit startling when I first heard it – Sanjay doesn’t like null results? But, but… publication bias! Ok, I should say that I am a bit über-anxious when it comes to this issue. I think that our collective bias against null results is one of the main causes of the replication crisis, and that makes me want everyone to embrace null results like their own children and dance in a circle around them singing Kumbaya.
But Sanjay is right of course – we all want a cure for cancer;Except for nihilists. (There’s nothing to be afraid of, Donny.) we want to find out what is true, not what isn’t. And that is why positive results will always feel more interesting and more important to us than null results. This is the root of publication bias: Every player in the publication system (readers, journal editors, reviewers, authors) is biased against null results. And every player expects every other player to be biased against null results and tries to cater for that to make progress.
Of course there are exceptions – sometimes we don’t buy a certain theory or don’t want it to be true (e.g. because it competes with our own theory). In these cases we can be biased against positive results. But overall, on average, all things being equal, I would say a general bias towards positive findings is a fair assessment of our system and ourselves, and this is what we’ll talk about today.
In this post, I will try to make the case for null results. I’m up against Sanjay Srivastava’s gut feeling, so this better be convincing. Ok, here we go: Four reasons why we should love null results.
1) We are biased against null results
Because we find positive results more important and interesting, we fall prey to motivated reasoning: We will judge studies reporting positive results to be of higher quality than studies reporting null results. We will retrospectively downgrade our impression of the methods of a paper after learning it produced null results. And we will be more motivated to find flaws in the methods, while the methods of papers reporting positive results will get an easier pass.I like the illustration of motivated reasoning by Gilovich (1991), explained here: Evidence in favour of your prior convictions will be examined taking a “Can I believe this?” stance, whereas evidence in opposition to your prior beliefs will be examined taking a “Must I believe this?” stance. The must stance typically gives evidence a much harder time to pass the test.
This means that as soon as we know the results of a study, we are no longer competent judges of the used research methods. But we know that sound methods make and break what we can learn from a study. This is why we absolutely must shield ourselves from judging papers based on or after knowing the results.
In other words: It’s ok to prefer the cancer-cure-finding RR to the non-cancer-cure-finding RR.Well… not really, though. If this leads to better outcomes for authors of positive results (fame, citations, grants), you still select for people who are willing and able to game the remaining gameable aspects of the system. But they have to be RRs, because we are guaranteed to fool ourselves if we base publication decisions on this feeling.
Reason #1: We should love null results to counter our tendency to underestimate their quality.
2) Null results are unpopular because our epistemology sucks NB: I tried to avoid going down the epistemological and statistical rabbit holes of NHST and instead focus on the practical surface of NHST as it’s commonly used by psychologists, with all the shortcomings this entails. This section was partly inspired by Daniël Lakens’ recent workshop in Munich, where we looked at the falsifiability of hypotheses in published papers.
I think one reason why null results are unpopular is that they don’t tell us if the hypothesis we are interested in is likely to be false or not.
The most common statistical framework in psychology is null hypothesis significance testing (NHST). We start out with a shiny new hypothesis, Hshinynew, which typically postulates an effect; a difference between conditions or a relationship between variables. But, presumably because we like it so much that we wouldn’t want it to come to any harm, we never actually test it. Instead, we set up and test a null hypothesis (H0) of un-shiny stuff: no effect, no difference or no relationship.Better known as Gigerenzer’s “null ritual” If our test comes up significant, p < .05, we reject H0, accept Hshinynew, and fantasise about how much ice cream we could buy from our hypothetical shiny new grant money. But what happens when p ≥ .05? P-hacking aside: When was the last time you read a paper saying “turns out we were wrong, p > .05″? NHST only tests H0. The p-value says nothing about the probability of Hshinynew being true. A non-significant p-value means that either H0 is true or you simply didn’t have enough power to reject it. In a Bayesian sense, data underlying a non-significant p-value can be strong evidence for the null or it can be entirely inconclusive (and everything in between).
“In science, the only failed experiment is one that does not lead to a conclusion.”
(Mack, 2014, p. 030101-1)
Maybe it’s just me, but I do find strong evidence for H0 interesting. Or, if you’re not a fan of Bayesian thinking: rejecting Hshinynew with a low error rate.These two are not identical of course, but you get the idea. I assume that we don’t reject Hshinynew whenever p ≥ .05 mainly because we like it too much. But we could, and thanks to Neyman & Pearson we would know our error rate (rejecting Hshinynew when it is true): beta, more commonly known as 1-power. With 95% power, you wouldn’t even fool yourself more often when rejecting Hshinynew than when rejecting H0.
There must be a catch, right? Of course there is. Das Leben ist kein Ponyhof, as we say in German (life isn’t a pony farm). As you know from every painful minute spent on the Sample Size Samba, power depends on effect size. With 95% power for detecting dshinynew, you have less than 95% power for detecting anything smaller than dshinynew. So the catch is that we must commit ourselves to defining Hshinynew more narrowly than “something going on” and think about which effect size we expect or are interested in.
I think we could gain some null-result fans back if we set up our hypothesis tests in a way that would allow us to conclude more than “H0 could not be rejected for unknown reasons”. This would of course leave us with a lot less wiggle space to explain how our shiny new hypothesis is still true regardless of our results – in other words, we would have to start doing actual science, and science is hard.
Reason #2: Yeah, I kind of get why you wouldn’t love null results in these messy circumstances. But we could love them more if we explicitly made our alternative hypotheses falsifiable.
3) Null results come before positive results
Back to the beginning of this post: We all want to find positive results. Knowing what’s true is the end goal of this game called scientific research. But most of us agree that knowledge can only be accumulated via falsification. Due to several unfortunate hiccups of the nature of our existence and consciousness, we have no direct access to what is true and real. But we can exclude things that certainly aren’t true and real.
Imagine working on a sudoku – not an easy-peasy one from your grandma’s gossip magazinesbracing myself for a shitstorm of angry grannies but one that’s challenging for your smartypants brain. For most of the fields you’ll only be able to figure out the correct number because you can exclude all other numbers. Before you finally find that one number, progress consists in ruling out another number.
Now let’s imagine science as one huge sudoku, the hardest one that ever existed. Let’s say our future depends on scientists figuring it out. And we don’t have much time. What you’d want is a) putting the smartest people of the planet on it, and b) a Google spreadsheet, because Google spreadsheets rock so that they could make use of anyone else’s progress instantly. You would want them to tell each other if they found out that a certain number does not go into a certain field.
Reason #3: We should love null results because they are our stepping stones to positive results, and although we might get lucky sometimes, we can’t just decide to skip that queue.
4) Null results are more informative
The number of true findings in the published literature depends on something significance tests can’t tell us: The base rate of true hypotheses we’re testing. If only a very small fraction of our hypotheses are true, we could always end up with more false positives than true positives (this is one of the main points of Ioannidis’ seminal 2005 paper).
When Felix Schönbrodt and Michael Zehetleitner released this great Shiny app a while ago, I remember having some vivid discussions with Felix about what the rate of true hypotheses in psychology may be. In his very nice accompanying blog post, Felix included a flowchart assuming 30% true hypotheses. At the time I found this grossly pessimistic: Surely our ability to develop hypotheses can’t be worse than a coin flip? We spent years studying psychology! We have theories! We are really smart! I honestly believed that the rate of true hypotheses we study should be at least 60%.
A few months ago, this interesting paper by Johnson, Payne, Want, Asher, & Mandal came out. They re-analysed 73 effects from the RP:P data and tried to model publication bias. I have to admit that I’m not maths-savvy enough to understand their model and judge their method,I tell myself it’s ok because this is published in the Journal of the American Statistical Association. but they estimate that over 700 hypothesis tests were run to produce these 73 significant results. They assume that power for tests of true hypotheses was 75%, and that 7% of the tested hypotheses were true. Seven percent.
Uh, umm… so not 60% then. To be fair to my naive 2015 self: this number refers to all hypothesis tests that were conducted, including p-hacking. That includes the one ANOVA main effect, the other main effect, the interaction effect, the same three tests without outliers, the same six tests with age as covariate, … and so on.
Let’s see what these numbers mean for the rates of true and false findings. If you’re anything like me, you can vaguely remember that the term “PPV” is important for this, but you can’t quite remember what it stands for and that scares you so much that you don’t even want to look at it if you’re honest… For the frightened rabbit in me and maybe in you, I’ve made a wee table to explain the PPV and its siblings NPV, FDR, and FOR.
Ok, now we got that out of the way, let’s stick the Johnson et al. numbers into a flowchart. You see that the PPV is shockingly low: Of all significant results, only 53% are true. Wow. I must admit that even after reading Ioannidis (2005) several times, this hadn’t quite sunk in. If the 7% estimate is anywhere near the true rate, that basically means that we can flip a coin any time we see a significant result to estimate if it reflects a true effect.
But I want to draw your attention to the negative predictive value. The chance that a non-significant finding is true is 98%! Isn’t that amazing and heartening? In this scenario, null results are vastly more informative than significant results.
I know what you’re thinking: 7% is ridiculously low. Who knows what those statisticians put into their Club Mate when they calculated this? For those of you who are more like 2015 Anne and think psychologists are really smart, I plotted the PPV and NPV for different levels of power across the whole range of the true hypothesis rate, so you can pick your favourite one. I chose five levels of power: 21% (neuroscience estimate by Button et al., 2013), 75% (Johnson et al. estimate), 80% and 95% (common conventions), and 99% (upper bound of what we can reach).
The plot shows two vertical dashed lines: The left one marks 7% true hypotheses, as estimated by Johnson et al. The right one marks the intersection of PPV and NPV for 75% power: This is the point at which significant results become more informative than negative results. That happens when more than 33% of the studied hypotheses are true. So if Johnson et al. are right, we would need to up our game from 7% of true hypotheses to a whopping 33% to get to a point where significant results are as informative as null results!
This is my take-home message: We are probably in a situation where the fact that an effect is significant doesn’t tell us much about whether or not it’s real. But: Non-significant findings likely are correct most of the time – maybe even 98% of the time. Perhaps we should start to take them more seriously.
Reason #4: We should love null results because they are more likely to be true than significant results.
Especially Reason #4 has been quite eye-opening for me and thrown up a host of new questions – is there a way to increase the rate of true hypotheses we’re testing? How much of this is due to bad tests for good hypotheses? Did Johnson et al. get it right? Does it differ across subfields, and if so, in what way? Don’t we have to lower alpha to increase the PPV given this dire outlook? Or go full Bayesian? Should replications become mandatory?In the Johnson et al. scenario, two significant results in a row boost the PPV to 94%.
I have no idea if I managed to shift anyone’s gut feeling the slightest bit. But hey, I tried! Now can we do the whole Kumbaya thing please?
References Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365–376.
Dawson, E., Gilovich, T., & Regan, D. T. (2002). Motivated reasoning and performance on the Wason Selection Task. Personality and Social Psychology Bulletin, 28(10), 1379–1387. doi: 10.1177/014616702236869
Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33, 587–606.
Gilovich, T. (1991).How we know what isn’t so: The fallibility of human reason in everyday life. New York: Free Press.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. doi: 10.1371/journal.pmed.0020124
Johnson, V. E., Payne, R. D., Wang, T., Asher, A., & Mandal, S. (2017). On the reproducibility of psychological science. Journal of the American Statistical Association, 112(517), 1-10. doi: 10.1080/01621459.2016.1240079
Mack, C. (2014). In Praise of the Null Result. Journal of Micro/Nanolithography, MEMS, and MOEMS, 13(3), 030101.
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
Except for nihilists. (There’s nothing to be afraid of, Donny.)
I like the illustration of motivated reasoning by Gilovich (1991), explained here: Evidence in favour of your prior convictions will be examined taking a “Can I believe this?” stance, whereas evidence in opposition to your prior beliefs will be examined taking a “Must I believe this?” stance. The must stance typically gives evidence a much harder time to pass the test.
Well… not really, though. If this leads to better outcomes for authors of positive results (fame, citations, grants), you still select for people who are willing and able to game the remaining gameable aspects of the system.
TL;DR: What’s an age-effect net of all time-varying covariates? The sound of one hand clapping.
Recently, we submitted a paper with some age trajectories of measures of individuals’ (un-)well-being. We thought of these trajectories in the most descriptive way: How do these measures change across the life course, all things considered? And really while this might not be the most interesting research question because it doesn’t directly answer why stuff happens, I’m a fan of simple descriptive studies and think they should have a place in our domain; Paul Rozin wrote a great piece on the importance of descriptive studies.
Anyway, the editor asked us to justify why we did not include any time-varying covariates (e.g. income, education, number of children, health) in our analysis of age trajectories. I thought the editor had requested an actual justification; my co-author (an economist) thought the editor just wanted to tell us that we should throw in all sorts of covariates. I felt too lazy to re-run all analyses and create new figures and tables, plus I always get a weird twitch in my left eye when somebody asks for “statistical control” without additional justification, so instead I looked into the (scientific) literature on the midlife crisis and tried to figure out how people have justified the inclusion of control variables in the analyses of age effects on well-being.Ruben, on the other hand, would probably get a twitch in his right eye if he found out I did not automate making my figures and tables.
Cat ownership, a time-varying covariate. (Pic: pixabay.com)
Whether or not life satisfaction dips in middle adulthood (somewhere between age 45-64) before rising again in older agebefore dipping again in a terminal decline. If you read that footnote, you’re now in the mortality salience condition of my Terror Management Theory study. has been hotly debated by psychologists and economists. There are a lot of papers out there on that subject and personally, I’m totally agnostic regarding the existence of the midlife crisis – ask me again in 20 years, if I’m not too busy driving my Porsche. But there are a lot of interesting methodological questions that arise when trying to answer this question.
A brief list of stuff I don’t want to talk about in this post, which are important nonetheless:
the Age-Period-Cohort conundrum: In short, this requires us to make certain assumptions when we want to identify age/period/cohort effects. That’s okay though, every researcher needs to make assumptions from time to time.
longitudinal vs. cross-sectional data: Both can have their pros and cons.
what we can learn from lab studies in which researchers recruit older people and then compare their performance on an arbitrary task X to the performance of their convenient undergraduate sample. How do you reasonably match 60 year old people that decided to participate in lab studies onto a younger sample of psych majors that really just want to get their freakin’ course credit?
lots of other interesting stuff you can do with longitudinal data that is more interesting than simple descriptive trajectories
But let’s get back to the topic our editor raised: Should we control for time-varying covariates such as income, marital status, health? The logic seems straightforward: Wouldn’t we want to “purge” the relationship between age and life satisfaction from other factors?
Quite obviously, a lot of stuff changes as we age. We get older, get our degrees, start a decent job and make some money,Or, alternatively, start a blog. maybe marry and settle down or travel to an Ashram to awaken our inner goddess and spite our conservative parents or maybe just get a lot of cats.
Mount Midlife Crisis, not to be confused with Mount Doom. (Art: Hakuin Ekaku)
To control for these variables might be wrong for two distinct reasons, and I will start with the somewhat more obscure one.
First, our time-varying covariate might actually be causally affected by life satisfaction. This argument has been raised regarding the statistical control of marital status by the late Norval Glenn (2009). He simulated a data situation in which (1) life satisfaction is stable across the life course and (2) starting from 21, only the 10 happiest people marry each year. He then demonstrated that controlling for marital status will result in a spurious trajectory, that is, a pronounced decline of life satisfaction over the life course even though we know that there’s no age effect in the underlying data. If you have read this blog before and the data situation sounds somewhat familiar to you: Marital status would be one of the infamous colliders that you should not control for because if you do, I will come after you.And you should be scared because I can deliver angry rants about inapproriate treatment of third variables. I might bring cookies and hope that you have decent coffee because this might take longer. If marital status is affected by age (the older you are, the more likely you are to be married), and if satisfied people are more likely to marry, marital status becomes a collider of its two causes and should not be controlled.
The second reason is somewhat more obvious: In many cases, the time-varying covariates will mediate the effects of age on your outcome. That is probably most obvious for health: Health declines with age. Decreases in health affect life satisfaction.They obviously do, though the fact that life satisfaction remains stable until a certain age despite decreases in health has been labeled the happiness paradox. So life satisfaction might decrease with age because of declining health. Now what does it mean if we control for this potential mediator?
Well, it means that we estimate the age effect net of the parts that are mediated through health. That is not inherently nonsensical, we just have to interpret the estimate properly. For example, Andrew Oswald was cited in Vol. 30 of the Observer: “[But] encouragingly, by the time you are 70, if you are still physically fit then on average you are as happy and mentally healthy as a 20 year old.” Now this might be indeed encouraging for people who think they are taking great care of their health and predict that they will be healthy by the time they are 70; but whether it’s encouraging on average strongly depends on the average health at age 70.
For example, if we assume that only the luckiest 1% of the population will be physically fit at that age, 99% will end up unhappier than 20 year olds (whether or not 20-year-olds are very happy is a different question). That doesn’t sound very optimistic any more, does it? The lucky one percent might also be very special with respect to other characteristics such as income, and a message such as “the wealthy will be still happy at age 70, whereas the poor are wasting away because of a lack of health care” again sounds not very encouraging. For the record, I’m not claiming that this is happening, but those are all scenarios that are aligned with the simple statement that those who are physically fit at age 70 are as mentally healthy as 20 year olds.
So the estimated association has its own justification but must be interpreted carefully.Actually, there is yet another problem that has been pointed out by Felix Thoemmes in the comments section of this post. A mediator is also almost always a collider, as it is per definition caused by the independent variable of interest and (most likely) by some other factors. So you would actually have to control the backdoor paths from the mediator in turn, or else your estimate of the direct effect, whatever it reflects, will actually be biased again. Additionally, it renders the “remaining” age effect hard to interpret, so it might not be very enlightening to look at age effects net of the effects of time-varying covariates. Let’s assume you “control” all sorts stuff that happens in life as people age – marital status, education, income, number of children, maybe also number of good friends, cat ownership, changes in health, and when we are already at it, why don’t we also control for the stuff that is underlying changes in health, such as functioning of organs and cell damage? – and still find a significant age effect.
What does that mean? Well, it means that you haven’t included all time-varying covariates that are relevant to life satisfaction because age effects must necessarily be mediated through something. The sheer passing of time only has effects because stuff happens in that time.
The “stuff” might be all sorts of things, and we might be inclined to consider that stuff more or less psychologically meaningful. For example, we might not consider changes in marital status or physical health to be “genuinely psychological”, so we might decide to control for these things to arrive at a purely psychological age effect. Such a “purely psychological” age effect might then be driven by e.g. people’s attitude towards the world. For example, people might get more optimistic and thus more satisfied controlling for other life circumstances. But I would again be careful with those interpretations, because of the collider problem outlined before and because of the somewhat arbitrary distinction between physical changes and socialrole changes as opposed to psychological changes.
In other words: what you should or shouldn’t control for always depends on your research question. If you study living things and control for life, don’t be surprised if your results seem a bit dull.
Ruben, on the other hand, would probably get a twitch in his right eye if he found out I did not automate making my figures and tables.
before dipping again in a terminal decline. If you read that footnote, you’re now in the mortality salience condition of my Terror Management Theory study.
Or, alternatively, start a blog.
And you should be scared because I can deliver angry rants about inapproriate treatment of third variables. I might bring cookies and hope that you have decent coffee because this might take longer.
They obviously do, though the fact that life satisfaction remains stable until a certain age despite decreases in health has been labeled the happiness paradox.
Actually, there is yet another problem that has been pointed out by Felix Thoemmes in the comments section of this post. A mediator is also almost always a collider, as it is per definition caused by the independent variable of interest and (most likely) by some other factors. So you would actually have to control the backdoor paths from the mediator in turn, or else your estimate of the direct effect, whatever it reflects, will actually be biased again.
“The pursuit of knowledge is, I think, mainly actuated by love of power. And so are all advances in scientific technique.”
― Bertrand Russell
Today on the 100% CI we have to stop the jokes for a moment. Today, we will talk about being disappointed by your idols, about science criticism being taken too far. Here, we reveal how Dr. Andrew Gelman, the prominent statistician and statistics blogger, abused his power.
We do not make this accusation lightly, but our discoveries leave no other conclusion: In his skepticism of psychological science, Dr. Gelman lost sight of right and wrong. What follows is a summary of the evidence we obtained.
This March the four of us visited the Applied Statistics Center at Columbia University for a brief workshop on Stan, the probabilistic programming language. We were excited about this opportunity to learn from one of our personal heroes.
The course, however, did not live up to our expectations. Frequently, Dr. Gelman would interrupt the course with diatribes against psychological science. On the second-to-last afternoon, we were supposed to write our first Stan code in a silent study session. We were left alone in the Gelman lab and our minds wandered. Our attention was drawn to a particularly large file drawer that turned out to be unlocked. What we discovered can only be described as profoundly shocking:
But this particular file drawer problem was different: The lab log revealed that Dr Gelman was desperate to obtain evidence against the phenomenon – and failed repeatedly. Initially, he invested enormous resources to run experiments with extraordinary four digit sample sizes to “nail the coffin shut on Power Pose”, as a hand-written note on an early report reads. The data painted a very clear picture, and it was not to his liking. As it dawned on him that, contrary to his personal convictions, Power Posing might be a real phenomenon, he began to stack the deck.
Instead of simple self-reports, he tried manifest behavioral observations and even field studies where the effect was expected to vanish. Power Pose prevailed. He deliberately reduced study samples to the absurdly low numbers often criticized on his very own blog. But even in his last attempts with 1-β almost equal to ɑ: Power Pose prevailed. As more and more evidence in favor of Power Posing was gathered, the research became… sloppy. Conditions were dropped, outliers removed, moderators randomly added, and, yes, even p-values were rounded up. Much to Dr. Gelman’s frustration, Power Pose prevailed. He was *unable* to collect data in favor of the null hypothesis.
He thought he had one final Bayesian trick up his sleeve: By hiring a skilled hypnotist he manipulated his priors, his own beliefs (!) in Power Posing. But even with these inhumane levels of disbelief, the posterior always indicated beyond a doubt: Power Pose prevailed. It was almost like the data were trying to tell him something – but Dr. Gelman had forgotten how to listen to evidence a long time ago.
In a recent publication, Simmons and Simonsohn analyzed the evidential value of the published literature on Power Posing. The centerpiece of their research is a p-curve (figure below, left graph) on the basis of which they “conclusively reject the null hypothesis that the sample of existing studies examines a detectable effect.” Had Dr. Gelman not hidden his findings in a file drawer, Simmons and Simonsohn’s conclusions would have been dramatically different (right graph).
Initially, we couldn’t believe that he would go this far just to win an argument. We were sure there must have been some innocuous explanation – yet we also did not want to confront him with our suspicions right away. We wanted to catch him red-handed.
Thus, we decided to infiltrate one of his studies, which he was covertly advertising under the obvious pseudonym Mr. Dean Wangle. He administered the study wearing a fake moustache and a ridiculous French beret, but his voice is unmistakeable. Below is a video of an experimental session that we were able to record with a hidden camera. The footage is very tough to watch.
Combined, the evidence leaves only one conclusion: Andrew Gelman betrayed science in his war on power posing.
Does playing violent video games increase aggression?No, but violent video game research kinda does. What makes advertisements persuasive?David Hasselhoff, obvs. Are 5%25%75% of the population addicted to socialmedia?It’s almost like humans have a fundamental need for social interactions. Who are these people that watch porn?Literally everyone everywhere Why do we enjoy cat videos so much?WHY???
These are the typical research questions media psychologists are concerned with. Broadly, media psychology describes and explains human behavior, cognition, and affect with regards to the use and effects of media and technology. Thus, it’s a hybrid discipline that borrows heavily from social, cognitive, and educational psychology in both its theoretical approaches and empirical traditions. The difference between a social psychologist and a media psychologist that both study video game effects is that the former publishes their findings in JPSP while the latter designs “What Twilight character are you?” self-testsTEAM EDWARD FTW! for perezhilton.com to avoid starving. And so it goes.
New is always better
A number of media psychologists is interested in improving psychology’s research practices and quality of evidence. Under the editorial oversight of Nicole Krämer, the Journal of Media Psychology (JMP), the discipline’s flagshipBy “flagship” I mean one of two journals nominally dedicated to this research topic, the other being Media Psychology. It’s basically one of those People’s Front of Judea vs. Judean People’s Front situations. journal, not only signed the Transparency and Openness Promotion Guidelines, it has also become one of roughly fifty journals that offer the Registered Reports format.
To promote preregistration in general and the new submission format at JMP in particular, the journal launched a fully preregistered Special Issue on “Technology and Human Behavior” dedicated exclusively to empirical work that employs these practices. Andy PrzybylskiWho, for reasons I can’t fathom, prefers being referred to as a “motivational researcher” and I were fortunate enough to be the guest editors of this issue.
The papers in this issue are nothing short of amazing – do take a look at them even if it is outside of your usual area of interest. All materials, data, analysis scripts, reviews, and editorial letters are available here. I hope that these contributions will serve as an inspiration and model for other (media) researchers, and encourage scientists studying media to preregister designs and share their data and materials openly.
Media Psychology BCBefore Chambers
If you already suspected that in all this interdisciplinary higgledy-piggledy, media psychology did not only inherit its parent disciplines’ merits, but also some of their flaws, you’re probablyunerring-absolute-100%-pinpoint-unequivocally-no-ifs-and-buts-dead-on-the-money correct. Fortunately, Nicole was kind enough to allot our special issue editorial more space than usual in order to report a meta-scientific analysis of the journal’s past and to illustrate how some of the new practices can ameliorate the evidential value of research. For this reason, we surveyed a) availability of data, b) errors in the reporting of statistical analyses, and c) sample sizes and statistical power of all 147 studies in N = 146 original research articles published in JMP between volume 20/1, when it became an English-language publication, and volume 28/2 (the most recent issue at the time this analysis was planned). This blog post is a summary of the analyses in our editorial, which — including its underlying raw data, analysis code, and code book — is publicly available at https://osf.io/5cvkr/.
Availability of Data and Materials
Historically the availability of research data in psychology has been poor. Our sample of JMP publications suggests that media psychology is no exception to this, as we were not able to identify a single publication reporting a link to research data in a public repository or the journal’s supplementary materials.
Statistical Reporting Errors
A recent study by Nuijten et al. (2015) indicates a high rate of reporting errors in reported Null Hypothesis Significance Tests (NHSTs) in psychological research reports. To make sure such inconsistencies were avoided for our special issue, we validated all accepted research reports with statcheck 1.2.2, a package for the statistical programming language R that works like a spellchecker for NHSTs by automatically extracting reported statistics from documents and recomputingp-values are recomputed from the reported test statistics and degrees of freedom. Thus, for the purpose of recomputation, it is assumed that test statistics and degrees of freedom are correctly reported, and that any inconsistency is caused by errors in the reporting of p-values. The actual inconsistencies, however, can just as well be caused by errors in the reporting of test statistics and/or degrees of freedom.p-values.
For our own analyses, we scanned all nemp = 131 JMP publications reporting data from at least one empirical study (147 studies in total) with statcheck to obtain an estimate for the reporting error rate in JMP. Statcheck extracted a total of 1036 NHSTs reported in nnhst = 98 articles. Forty-one publications (41.8% of nnhst) reported at least one inconsistent NHST (max = 21), i.e. reported test statistics and degrees of freedom did not match reported p-values. Sixteen publications (16.3% of nnhst) reported at least one grossly inconsistent NHST (max = 4), i.e. the reported p-value is < .05 while the recomputed p-value is > .05, or vice-versa. Thus, a substantial proportion of publications in JMP seem to contain inaccurately reported statistical analyses, of which some might affect the conclusions drawn from them (see Figure 1).
Caution is advised when speculating about the causes of the inconsistencies. Many of them are probably clerical errors that do not alter the inferences or conclusions in any way.For example, in 20 cases the authors reported p = .000, which is mathematically impossible (for each of these precomputed < .001). Other inconsistencies might be explained by authors not declaring that their tests were one-tailed (which is relevant for their interpretation). However, with some concern, we observe it is unlikely to be the only cause, as in 19 out of 23 cases the reported p-values were equal to or smaller than .05 while the recomputed p-values were larger than .05, whereas the opposite pattern was observed in only four cases. Indeed, if incorrectly reported p-values resulted merely from clerical errors, we would expect inconsistencies in both directions to occur at approximately equal frequencies.
All of these inconsistencies can easily be detected using the freely available R package statcheck or, for those who do not use R, in your browser via www.statcheck.io.
Sample Sizes and Statistical Power
High statistical power is paramount in order to reliably detect true effects in a sample and, thus, to correctly reject the null hypothesis when it is false. Further, low power reduces the confidence that a statistically significant result actually reflects a true effect. A generally low-powered field is more likely to yield unreliable estimates of effect sizes and low reproducibility of results. We are not aware of any previous attempts to estimate average power in media psychology.
Strategy 1: Reported power analyses. One obvious strategy for estimating average statistical power is to examine the reported power analyses in empirical research articles. Searching all papers for the word “power” yielded 20 hits and just one of these was an article that reported an a priori determined sample size.In the 19 remaining articles power is mentioned, for example, to either demonstrate observed or post-hoc power (which is redundant with reported NHSTs), to suggest larger samples should be used in future research, or to explain why an observed nonsignificant “trend” would in fact be significant had the statistical power been higher.
Strategy 2: Analyze power given sample sizes. Another strategy is to examine the power for different effect sizes given the average sample size (S) found in the literature. The median sample size in JMP is 139 with a considerable range across all experiments and surveys (see Table 1). As in other fields, surveys tend to have healthy sample sizes apt to reliably detect medium to large relationships between variables.
For experiments (including quasi-experiments), the outlook is a bit different. With a median sample size per condition/cell of 30.67, the average power of experiments published in JMP to detect small differences between conditions (d = .20) is 12%, 49% for medium effects (d = .50), and 87% for large effects (d = .80). Even when assuming that the average effect examined in the media psychological literature could be as large as those in social psychology (d = .43), our results indicate that the chance that an experiment published in JMP will detect them is 38%, worse than flipping a coin.An operation that would be considerably less expensive.
Table 1. Sample sizes and power of studies published in JMP volumes 20/1 to 28/2. n = Number of published studies; MDS = Median sample size; MDs/cell = Median sample size per condition; 1-ßr=.1/d=.2 / 1-ßr=.3/d=.5 / 1-ßr=.5/d=.8 = Power to detect small/medium/large bivariate relationships/differences between conditions.
For between-subjects, mixed designs, and total we assumed independent t-tests. For within-subjects designs we assumed dependent t-tests. All tests two-tailed, α = .05. Power analyses were conducted with the R package pwr 1.20
Feeling the Future of Media Psychology
The above observations could lead readers to believe that we are concerned about the quality of publications in JMP in particular. If anything, the opposite is true, as this journal recently committed itself to a number of changes in its publishing practices to promote open, reproducible, high-quality research. These analyses are simply another step in a phase of sincere self-reflection. Thus, we would like these findings, troubling as they are, to be taken not as a verdict, but as an opportunity for researchers, journals, and organizations to reflect similarly on their own practices and hence improve the field as a whole.
One key area which could be improved in response to these challenges is how researchers create, test, and refine psychological theories used to study media. Like other psychology subfields, media psychology is characterized by frequent emergence of new theories which purport to explain phenomena of interest.As James Anderson recently put it in a very clever paper (as usual): “Someone entering the field in 2014 would have to learn 295 new theories the following year.” This generativity may, in part, be a consequence of the fuzzy boundaries between exploratory and confirmatory modes of social sciences research.
Both modes of research – confirming hypotheses and exploring uncharted territory – benefit from preregistration. Drawing this distinction helps the reader determine which hypotheses carefully test ideas derived from theory and previous empirical studies, and it liberates exploratory research from the pressure to present an artificial hypothesis-testing narrative.
As technology experts, media psychology researchers are well positioned to use and study new tools that shape our science. A range of new web-based platforms have been built by scientists and engineers at the Center for Open Science, including their flagship, the OSF, and preprint services like PsyArXiv. Designed to work with scientists’ existing research flows, these tools can help prevent data loss due to hardware malfunctions, misplacement,Including mindless grad students and hungry dogs or relocations of researchers, while enabling scientists to claim more credit by allowing others to use and cite their materials, protocols, and data. A public repository for media psychology research materials is already in place.
Like psychological science as a whole, media psychology faces a pressing credibility gap. Unlike some other areas of psychological inquiry,such as meta science however, media research — whether concerning the Internet, video games, or film — speaks directly to everyday life in the modern world. It affects how the public forms their perceptions of media effects, and how professional groups and governmental bodies make policies and recommendations. In part because it is key to professional policy, empirical findings disseminated to caregivers, practitioners, and educators should be built on an empirical foundation with sufficient rigor.
We are, on balance, optimistic that media psychologists can meet these challenges and lead the way for psychologists in other areas. This special issue and the registered reports submission track present an important step in this direction and we thank the JMP editorial board, our expert reviewers, and of course, the dedicated researchers who devoted their limited resources to this effort.
The promise of building an empirically-based understanding of how we use, shape, and are shaped by technology is an alluring one. We firmly believe that incremental steps taken towards scientific transparency and empirical rigor will help us realize this potential.
If you read this entire post, there’s a 97% chance you’re on Team Edward.
It’s almost like humans have a fundamental need for social interactions.
Literally everyone everywhere
TEAM EDWARD FTW!
By “flagship” I mean one of two journals nominally dedicated to this research topic, the other being Media Psychology. It’s basically one of those People’s Front of Judea vs. Judean People’s Front situations.
Who, for reasons I can’t fathom, prefers being referred to as a “motivational researcher”
p-values are recomputed from the reported test statistics and degrees of freedom. Thus, for the purpose of recomputation, it is assumed that test statistics and degrees of freedom are correctly reported, and that any inconsistency is caused by errors in the reporting of p-values. The actual inconsistencies, however, can just as well be caused by errors in the reporting of test statistics and/or degrees of freedom.
For example, in 20 cases the authors reported p = .000, which is mathematically impossible (for each of these precomputed < .001). Other inconsistencies might be explained by authors not declaring that their tests were one-tailed (which is relevant for their interpretation).
In the 19 remaining articles power is mentioned, for example, to either demonstrate observed or post-hoc power (which is redundant with reported NHSTs), to suggest larger samples should be used in future research, or to explain why an observed nonsignificant “trend” would in fact be significant had the statistical power been higher.
An operation that would be considerably less expensive.
As James Anderson recently put it in a very clever paper (as usual): “Someone entering the field in 2014 would have to learn 295 new theories the following year.”
Scroll to the very end of this post for an addendum.If you only see footnotes, you have scrolled too far.
Reading skills of children correlate with their shoe size. Number of storks in an area correlates with birth rate. Ice cream sales correlate with deaths by drowning. Maybe they used different examples to teach you, but I’m pretty sure that we’ve all learned about confounding variables during our undergraduate studies. After that, we’ve probably all learned that third variables ruin inference, yadda yadda, and obviously the only way to ever learn anything about cause and effect are proper experiments, with randomization and stuff. End of the story, not much more to learn about causality.YMMV and I hope that there are psych programs out there that teach more about causal inference in non-experimental settings. Throw in some “control variables” and pray to Meehl that some blanket statement “Experimental studies are needed to determine whether…” will make your paper publishable anyway.
Causal inference from observational data boils down to assumptions you have to makeThere’s no free lunch in causal inference. Inference from your experiment, for example, depends on the assumption that your randomization worked. And then there’s the whole issue that the effects you find in your experiment might have literally nothing to do with the world that happens outside the lab, so don’t think that experiments are an easy way out of this misery. and third variables you have to take into account. I’m going to talk about a third variable problem today, conditioning on a collider. You might not have heard of this before, but every time you condition on a collider, a baby stork gets hit by an oversized shoe filled with ice creamJust to make sure: I don’t endorse any form of animal cruelty. and the quality of the studies supporting your own political view deteriorates.If you are already aware of colliders, you will probably want to skip the following stupid jokes and smugness and continue with the last two paragraphs in which I make a point about viewpoint bias in reviewers’ decisions.
Let’s assume you were interested in the relationship between conscientiousness and intelligence. You collect a large-ish sample of N = 10,000As we say in German: “Gönn dir!” and find a negative correlation between intelligence and conscientiousness of r = – .372 (see Figure 1).
However, your sample consisted only of college students. Now you might be aware that there is a certain range restriction in intelligence of college students (compared to the overall population), so you might even go big and claim that the association you found is probably an underestimation! Brilliant.
The collider – being a college student – rears its ugly head. Being a college student is positively correlated with intelligence (r = .426). It is also positively correlated with conscientiousness (r = .433).Just to make sure: This is fake data. Fake data should not be taken as evidence for the actual relationship between a set of variables (though some of the more crafty and creative psychologists might disagree). Let’s assume that conscientiousness and intelligence have a causal effect on college attendance, and that they are actually not correlated at all in the general population, see Figure 2.
If you select a college sample (i.e. the pink dots), you will find a negative correlation between conscientiousness and intelligence of, guess what, exactly r = -.372, because this is how I generated my data. There is a very intuitive explanation for the case of dichotomous variables:The collider problem is just the same for continuous measures. In the population, there are smart lazy people, stupid diligent people, smart diligent people and stupid lazy people.Coincidentally, you will find each of the four combinations represented among the members of The 100% CI at any given point in time, but we randomly reassign these roles every week. In your hypothetical college sample, you would have smart lazy people, stupid diligent people, smart diligent people but no stupid lazy people because they don’t make it to college.Ha, ha, ha. Thus, in your college sample, you will find a spurious correlation between conscientiousness and intelligence.Notice that you might be very well able to replicate this association in every college sample you can get. In that sense, the negative correlation “holds” in the population of all college students, but it is a result from selection into the sample (and not causal processes between conscientiousness and intelligence, or even good old fashioned confounding variables) and doesn’t tell you anything about the correlation in the general population.
By the way, additionally sampling a non-college sample and finding a similar negative correlation among non-college peeps wouldn’t strengthen your argument: You are still conditioning on a collider. From Figure 2, you can already guess a slight negative relationship in the blue cloud,If you are really good at guessing correlations (it’s a skill you can train!) you might even see that it’s about r = -.200, and pooling all data points and and estimating the relationship between IQ and conscientiousness while controlling for the collider results in r = -.240. Maybe a more relevant example: If you find a certain correlation in a clinical sample, and you find the same correlation in a non-clinical sample, that doesn’t prove it’s real in the not-so-unlikely case that ending up in the clinical sample is a collider caused by the variables you are interested in.
On an abstract level: Whenever X1 (conscientiousness) and X2 (intelligence) both cause Y (college attendance) in some manner, conditioning on Y will bias the relationship between X1 and X2 and potentially introduce a spurious association (or hide an existing link between X1 and X2, or exaggerate an existing link, or reverse the direction of the association…). Conditioning can mean a range of things, including all sort of “control”: Selecting respondents based on their values on Y?or anything that is caused by Y, because the whole collider logic also extends to so-called descendants of a collider That’s conditioning on a collider. Statistically controlling for Y? That’s conditioning on a collider. Generating propensity scores based on Y to match your sample for this variable? That’s conditioning on a collider. Running analyses separately for Y = 0 and Y = 1? That’s conditioning on a collider. Washing your hair in a long, relaxing shower at CERN? You better believe that’s conditioning on a collider. If survival depends on Y, there might be no way for you to not condition on Y unless you raise the dead.
When you start becoming aware of colliders, you might encounter them in the wild, aka everyday life. For example, I have noticed that among my friends, those who study psychology (X1) tend to be less aligned with my own political views (X2). The collider is being friends with me (Y): Psychology students are more likely to become friends with me because, duh, that’s how you find your friends as a student (X1->Y). People who share my political views are more likely to become friends with me (X2->Y). Looking at my friends, they are either psych peeps or socialist anti-fascist freegan feminists.This might sound like I want to imply that the other authors of this blog are fascists, but that wasn’t true last time I checked. Even though those two things are possibly positively correlated in the overall population,Actually, I’m pretty damn sure that the average psych student is more likely to be a socialist anti-fascist freegan feminist than the average person who is not a psychology student. the correlation in my friends sample is negative (X1 and X2 are negatively correlated conditional on Y).
Other examples: I got the impression that bold claims are negatively correlated with methodological rigor in the published psychological literature, but maybe that’s just because both flashy claims and methodological rigor increase chances of publication and we just never get to see the stuff that is both boring and crappy?This might come as less of a surprise to you if you’re a journal editor because you get to see the whole range.
At some point, I got the impression that female (X1) professors were somewhat smarter (X2) than male professors, and based on that, one might conclude that women are smarter than men. But female professors might just be smarter because tenure (Y) is less attainable for women (X1->Y)For whatever reason, you can add mediators such as discrimination and more likely for smart people (X2->Y), so that only very smart women become professors but some mediocre males can also make it. The collider strikes again!
Tenure and scientific eminence are nice examples in general because they are colliders for a fuckload of variables. For example, somebody had suggested that women were singled out as instances of bad science because of their gender. Leaving aside the issue whether women are actually overrepresented among the people who have been shamed for sloppy research,I actually have no clue whether that’s true or not, I just don’t have any data and no intuition on that matter such an overrepresentation would neither tells us that women are unfairly targeted nor that women are more prone to bad research practices.Notice that both accounts would equal a causal effect of gender, as the arrows are pointing away from “gender” and end at “being criticised for bad research”, no matter what happens in between. Of course, the parts in between might be highly informative. Assuming that women (X1) have worse chances to get into the limelight than men, but overstating the implications of your evidence (X2) helps with getting into the limelight; we could find that women in the limelight (conditioning on Y) are more likely to have overstated their evidence because the more tempered women simply didn’t make it. That’s obviously just wild speculation, but in everyday life, people are very willing to speculate about confounding variables, so why not speculate a collider for a change?
Which leads to the last potential collider that I would like you to consider. Let’s assume that the methodological rigor of a paper (X1) makes you more likely to approve of it as a reviewer. Furthermore, let’s assume that you – to some extent – prefer papers that match your own bias (X2).For example, I believe that the metric system is objectively superior to others, so I wouldn’t approve of a paper that champions the measurement of baking ingredients in the unit of horse hooves. If you think I chose this example because it sounds so harmless, you haven’t heard me rant about US-letter format yet. Even if research that favors your point of view is on average just as good as research that tells a different story (X1 and X2 are uncorrelated), your decision to let a paper pass or not (Y) will introduce a negative correlation: The published papers that match your viewpoint will on average be worse.Plomin et al. claimed that the controversy surrounding behavioral genetics led to the extra effort necessary to build a stronger foundation for the field, which is the flipside of this argument.
So peeps, if you really care about a cause, don’t give mediocre studies an easy time just because they please you: At some point, the whole field that supports your cause might lose its credibility because so much bad stuff got published.
Addendum: Fifty Shades of Colliders
Since publishing this post, I have learned that a more appropriate title would have been “That one weird third variable problem that gets mentioned quite a bit across various contexts but somehow people seem to lack a common vocabulary so here is my blog post anyway also time travel will have had ruined blog titles by the year 2100.”
One of my favorite personality psychologists,Also: One of the few people I think one can call “personality psychologist” without offending them. Not sure though. *hides*Sanjay Srivastava, blogged about the “selection-distortion effect”before it was cool, back in 2014.
Neuro-developmental psychologist Dorothy Bishop talks about the perils of correlational data in the research of developmental disorders in this awesome blog post and describes the problems of within-groups correlations.
Last but not least, Patrick Forscher just started a series of blog post about causality (first and second post are already up), starting from the very scratch. I highly recommend his blog for a more systematic yet entertaining introduction to the topic!No CERN jokes though. Those are the100.ci-exclusive!
If you only see footnotes, you have scrolled too far.
YMMV and I hope that there are psych programs out there that teach more about causal inference in non-experimental settings.
“Experimental studies are needed to determine whether…”
Added bonus: After reading this, you will finally know how to decide whether or not a covariate is necessary, unnecessary, or even harmful.
I have been informed that only grad students can afford to actually read stuff, which is kind of bad, isn’t it?
There’s no free lunch in causal inference. Inference from your experiment, for example, depends on the assumption that your randomization worked. And then there’s the whole issue that the effects you find in your experiment might have literally nothing to do with the world that happens outside the lab, so don’t think that experiments are an easy way out of this misery.
Just to make sure: I don’t endorse any form of animal cruelty.
If you are already aware of colliders, you will probably want to skip the following stupid jokes and smugness and continue with the last two paragraphs in which I make a point about viewpoint bias in reviewers’ decisions.
As we say in German: “Gönn dir!”
Just to make sure: This is fake data. Fake data should not be taken as evidence for the actual relationship between a set of variables (though some of the more crafty and creative psychologists might disagree).
The collider problem is just the same for continuous measures.
Coincidentally, you will find each of the four combinations represented among the members of The 100% CI at any given point in time, but we randomly reassign these roles every week.
Ha, ha, ha.
Notice that you might be very well able to replicate this association in every college sample you can get. In that sense, the negative correlation “holds” in the population of all college students, but it is a result from selection into the sample (and not causal processes between conscientiousness and intelligence, or even good old fashioned confounding variables) and doesn’t tell you anything about the correlation in the general population.
If you are really good at guessing correlations (it’s a skill you can train!) you might even see that it’s about r = -.200,
or anything that is caused by Y, because the whole collider logic also extends to so-called descendants of a collider
This might sound like I want to imply that the other authors of this blog are fascists, but that wasn’t true last time I checked.
Actually, I’m pretty damn sure that the average psych student is more likely to be a socialist anti-fascist freegan feminist than the average person who is not a psychology student.
This might come as less of a surprise to you if you’re a journal editor because you get to see the whole range.
For whatever reason, you can add mediators such as discrimination
I actually have no clue whether that’s true or not, I just don’t have any data and no intuition on that matter
Notice that both accounts would equal a causal effect of gender, as the arrows are pointing away from “gender” and end at “being criticised for bad research”, no matter what happens in between. Of course, the parts in between might be highly informative.
For example, I believe that the metric system is objectively superior to others, so I wouldn’t approve of a paper that champions the measurement of baking ingredients in the unit of horse hooves. If you think I chose this example because it sounds so harmless, you haven’t heard me rant about US-letter format yet.
Plomin et al. claimed that the controversy surrounding behavioral genetics led to the extra effort necessary to build a stronger foundation for the field, which is the flipside of this argument.
Also: One of the few people I think one can call “personality psychologist” without offending them. Not sure though. *hides*
No CERN jokes though. Those are the100.ci-exclusive!
[Disclaimer: I am not an EEG expert. I probably got some things wrong. Please let me know about them.]
TL;DR: I reviewed four infant ERP studies on the same topic and found that their results are maximally incongruent with each other. Yet the analytical choices made in the papers differ too much to even allow the conclusion that there probably is no underlying effect.
If you just want to skim this post, you can skip to the short summaries at the end of each section, which I highlighted so they’re easy to find.
Estimated reading time (excl. tables): 17 minutes
Some weeks ago, I reviewed an EEG paper on infants’ perception of biological motion. The authors cited four older studies that report ERP correlates of 5- and 8-month-old children’s ability to discriminate normal human motion, such as walking, from different forms of unnatural or non-biological motion.
Because I wasn’t familiar with this literature and wanted to be a good reviewer, I went and had a look at these fab four. What I found was a combined sample size of 51, four different analysed time windows and region-of-interest combinations, a left-skewed p-curve, and a lot of question marks on my forehead. This blog post is a story of my journey digging through these papers to see what they can tell us about infants’ perception of biological motion.
The four studies in question:
Hirai, M., & Hiraki, K. (2005). An event-related potentials study of biological motion perception in human infants. Brain Research: Cognitive Brain Research, 22, 301–304.
Marshall, P. J. & Shipley, T. F. (2009). Event-related potentials to point-light displays of human action in five-month-olds infants. Developmental Neuropsychology, 34(3), 368-377. doi: 10.1080/87565640902801866
You have probably seen videos of point-light displays (PLDs) of human motion before: Single dots represent the joints of a person and despite this seemingly impoverished setup (compared to a normal video recording), it is surprisingly easy to recognise the displayed action, e.g., a walking person. I didn’t embed an example video because I don’t want to scare away my new pals with an Elsevier lawsuit this early, but the Biomotion Lab at Queen’s University has put some of their cool stimuli online.
Whenever you find that you can perform some cognitive task with apparent ease (like recognising a bunch of moving dots as a walking person), a developmental psychologist somewhere gets very itchy and really, really wants to know at what exact point between its nonexistence and current giant-walnut-like state your brain acquired this intriguing skill.
The four papers I’m reviewing here look for EEG correlates of this previously found behavioural effect via event-related potentials (ERPs). My aim is to find out if they can tell us something about what happens on an infant’s scalp when they watch PLDs of upright human motion. I will first give a very brief summary of each study and then compare their analytical choices and results with a focus on the contrast between upright biological motion (BM) and “non-biological motion” (nBM).
You will notice that in almost all cases, the nBM PLDs consist of points whose motion paths and velocity are identical to the points in the BM PLDs. “Non-biological” thus refers to the relation of the individual points to each other (in scrambled PLDs, where the points’ starting positions have been randomised) or to the orientation of the displayed figure (in inverted PLDs that have been turned upside down).
Because EEG results depend heavily on many boring but important technical details about recording and analysis, this post contains a bunch of big, cluttered tables with a tiny font size which I feel terrible about yet not terrible enough to spare you them. I simply didn’t find a more elegant way to include all this information. Feel free to ignore most of itFor a good laugh I recommend having a look at Table 1 for sample sizes and exclusion rates though. if you’re only here for the stats rage.It’s totally a thing.
Alright! Fasten your seatbelts, here we go:
HH05 (Hirai & Hiraki, 2005)
The rationale is simple: We know infants are sensitive to biological motion, but nobody has looked at neural correlates of this before, so let’s check it out. HH05 investigate 8-month-olds’ ERPs in reaction to PLDs of upright walking compared to scrambled motion (the points’ paths and velocity are identical to the upright condition, but their starting points are randomised – check “scrambled” in the Biomotion Lab animation to get an idea). Each trial lasts 510 ms.
RHS06 (Reid et al., 2006)
RHS06 look at the same age group (8-month-olds). But in contrast to HH05, they compare upright motion to inverted motion (turning the animation upside down). With 1000 ms, their trials are twice as long as HH05’s. Another difference is that they use two different kinds of movement: walking and kicking, thus creating a 2×2 condition design (action type: walking vs kicking x orientation: upright vs inverted). What’s funny about this is that they do not once mention why they added the kicking motion, and in the remainder of the paper collapse walking and kicking into a simple contrast between upright and inverted.
My p-hacking alarm bells started to make some clunking noises when I first read this. You just don’t add stuff to an experiment and then never mention it again, especially when it sets your study apart from previous ones. You only do that when you tried something and it didn’t work. Please tell me if this is an unfair assumption.
3: upright (walking & kicking) corrupted (walking with backward-flexing knees) impossible (kicking with spinning leg)
2: upright (walking, running, throwing a ball, kicking a ball) vs scrambled
* Refers to each cell of the original 2x2 (action x orientation) design if I understand correctly.
** Refers to the two main conditions (upright vs inverted).
**** Plus 12 kids who wouldn’t even wear the EEG cap. I suspect similar rates in the other papers that are not reported.
RHLS08 (Reid et al., 2008)
RHLS08 again investigate 8-month-olds, but throw another set of considerations into the mix: They compare upright motion (again walking and kicking) to a) a “corrupted body schema” condition where the walking PLDs were edited such that the knees bent backwards, and b) a “biomechanically impossible” condition where the kicking PLDs were edited such that the kicking leg seemed to come off and spin in a circle. Trial length is again 1000 ms.
The dropout rate in this study struck me as odd: with a final sample of N=15 and 40 exclusions, it is 3.2x as high as in RHS06 (N=12, 10 exclusions). What happened there?
This relatively high attrition rate was due to three experimental conditions in the present study when compared to the standard two conditions in most infant ERP studies. (p. 162)
Ok, but wait a second… Didn’t RHS06 start out with even more conditions (four)? Is that why they forgot about one of their factors halfway through their paper and changed it to a simple contrast between upright and inverted PLDs?
I’m not liking this.
MS09 (Marshall & Shipley, 2009)
MS09 go back to the basics – upright versus scrambled motion, but this time with 5-month-olds. Their trials are twice as long as RHS06’s and RHLS08’s (2000-2300 ms). For unnamed reasons they use four different types of action: walking, running, throwing a ball, and kicking a ball. Each condition consists of only four trials of each of these actions (16 upright, 16 scrambled). Here’s their justification for the low trial number:
ERP averages composed of less than 20 trials are not unusual in the infant visual ERP literature (e.g., de Haan & Nelson, 1997; Snyder, Webb, & Nelson, 2002), especially in studies involving dynamic social stimuli (Reid et al., 2008; Striano, Reid, & Hoehl, 2006). (p. 370)
Ah, the classic “we used a shitty design because others do it too” argument. What I find more worrying than the low total number of trials are the fairly heterogeneous stimuli: I would not expect the brain to react identically when viewing displays of continuous walking versus distinct goal-directed actions involving an inanimate object (throwing/kicking a ball). What can we expect from an average of only eight instances of each of these? I’m not an EEG expert but this simply isn’t going to work.
Summary: overview We have two studies comparing upright and scrambled motion (HH05 and MS09), one comparing upright and inverted motion (RHS06), and one comparing upright motion with a “corrupted body schema” condition and a “biomechanically impossible” condition (RHLS08). Three studies look at 8-month-olds and one looks at 5-month-olds.
Table 2: EEG recording and preprocessing
62, Geodesic Sensor Net
0.1 - 100 Hz bandpass
30 Hz low-pass
100 ms pre trial
100 ms pre trial
0.1 Hz high-pass
35 Hz low-pass
100 ms pre trial + first 100 ms of trial
(10-20 system), Electro-Cap
0.1 Hz high-pass
100 Hz low-pass
100 ms pre trial
Design & analysis
Which dependent variables did the studies look at? In other words: Which time windows at which electrode sites were analysed? (See Table 2 for boring EEG recording details.)
HH05 define a target time window at 200-300 ms after trial onset based on adult ERP data. To me this sounds surprising because as a rule of thumb I would expect infant ERPs to show up laterBecause let’s face it, babies are a bit dim. than corresponding adult ERPs. But anyway, at least they do give a justification. They pick 26 electrodes in the occipitotemporal region as their target area (see Figure 1) and compare right and left hemisphere (13 electrodes on each side). They do not provide any justification for either the chosen area or the fact that they compare left and right hemisphere (now their design turned into a 2×2 interaction: stimulus type x lateralisation).
RHS06 stick with the time window of 200-300 ms, with the analysis of lateralisation effects, and, broadly speaking, with the target area (“posterior”): They compare P3 and CP5 on the left with P4 and CP6 on the right. Interestingly, they do not cite HH05, even though they submitted their paper almost a year after HH05 had been published online. Instead, RHS06 justify the time window (and the search for lateralisation effects) by pointing to studies reporting an N170 in adults in response to BM and the claim that “in infant research, the P220–P290 waveform has been named the ‘infant N170’” (p. 212). Alright, sounds legit. Their justification for the target area is less consistent: Again, they cite the adult-N170 literature, which reported this effect “at a variety of posterior locations, including occipital (O1, O2), temporal (T7, T8) and parietal cortex (P7, P3, P4, P8)” (p. 212). Sadly, the reason why they then confined their analyses to P3, P4, CP5, and CP6 remains a mystery for the reader.
Somewhat unexpectedly, RHLS08 cite both themselves (RHS06) and HH05 as a reference for looking at parietal regions, but quietly drop CP5/CP6 and the lateralisation effect (P3 and P4 are now being analysed jointly and not compared with each other). What really stunned me is that they changed the analysed time window to 300-700 ms without any justification. This means their analysis window at the parietal region does not even overlap with HH05 and RHS06.
A variation of the old time window comes into play again for the newly-added frontal target area: They include a region composed of F7, F8, F3, F4, Fz, FC3, FC4, C3, C4, and Cz at 200-350 ms (again without justification), because they hypothesise “differential processing in parietal and frontal regions” (p. 162) for the contrast between corrupted and impossible PLDs.
There’s one more thing. All other papers use 100 ms directly preceding the trial for baseline correction, only RHLS use 100 ms pre trial and the first 100 ms of the trial. Their justification for this makes no sense in light of the other studies:
This ensured that differences in the ERP were due to factors associated with motion rather than a reaction to observed differences between the conditions in the initial configuration of the point lights. (p. 164)
MS09 go on a big fishing expedition and test the full trial length from 0-2000 ms in 100-ms bins separately for P3, P4, P7, P8, T7, T8, O1, and O2 (citing Jokisch et al., 2005; HH05; and RHS06). They also hypothesise a lateralisation effect, citing RHS06, but never directly compare any electrodes from the right and left hemisphere. MS09 thus run 20 separate tests for each of 8 electrodes (160 tests in total) and – spoiler alert – do not correct for multiple comparisons.
Summary: design & analyses We have three different time windows for the BM versus nBM contrast (HH05 and RHS06: 200-300 ms, RHLS08: 300-700 ms, MS09: 0-2000 ms), and a fourth one if we include RHLS08’s search for a frontal difference between corrupted and impossible motion (200-350 ms). All studies look at “somewhat” posterior/parietal electrode sites, but in many creative combinations: a large ill-defined area on the left vs on the right (HH05), P3 and CP5 on the left vs P4 and CP6 on the right (RHS06), P3 and P4 combined (RHLS08), and an 8-electrode carnage involving P3, P4, P7, P8, T7, T8, O1, and O2 (MS09).
Table 3: Analyses and results
target time window
upright vs scrambled
26 electrodes collapsed into 2 sites (left vs right)
laterality vs stimulus type
- main effects - upright vs scrambled in RH*
F(1,6)=7.1 reported as ns F(1,12)=7.1
upright vs inverted
left posterior: P3, CP5
right posterior: P4, CP6
laterality vs stimulus type
- interaction - main effects
- simple effects
F(1,11)=6.767 “no other effects found”
parietal: 3x1 ANOVA t-test upright vs impossible
t-test upright vs corrupted
t-test imposs. vs corrupted frontal: 3x1 ANOVA t-test upright vs impossible
t-test upright vs corrupted
t-test imposs. vs corrupted
F(2,28)=3.535 t(14)=2.312 not reported
t(14)=1.803 F(2,28)=5.517 not reported
Wilcoxon signed-rank tests on mean amplitude in 100 ms bins across whole trials & each electrode
No test statistics reported. Electrodes & time frames reported as “p<.05”:
P3: 800-2000 ms
P4: 1300-2000 ms
P7: 500-2000 ms
P8: 500-2000 ms
O2: 800-1300 ms
T8: 600-2000 ms
* RH = right hemisphere
** Reported as “a statistical trend” (p. 164)
*** Reported as “p = .05” (p. 164)
Test statistics and summary statistics are summarised in Table 3 and Table 4, respectively, and the directions of effects are shown in Figure 1. I will ignore the results for the frontal region examined by RHLS08, because they added this to investigate the perception of “corrupted body schema” motion and I decided to focus on the contrast of upright vs impossible motion.
Up until the result section, I expected HH05 to look for a main effect of stimulus type. This main effect is implied to be not significant: “only the laterality x stimulus type interaction was significant” (p. 302). Luckily they thought of lateralisation just in time!Phew! Taking this into account, they find a significant interaction: upright motion had a more negative amplitude than scrambled motion in the right hemisphere, but this contrast was reversed and not significant in the left hemisphere.
HH05 do not correct for multiple comparisons (despite testing one interaction effect and two main effects), which the interaction effect would not have held up to: F(1, 6) = 7.1, p = .037.
In contrast to HH05, RHS06 do predict an interaction of stimulus type and lateralisation, which is exactly what they find (F(1, 11) = 6.767, p = .025). Here, however, the amplitude for upright motion in the right hemisphere is significantly more positive than for inverted motion. One could argue that scrambled (HH05) and inverted (RHS06) PLDs may well elicit very different ERPs and that a reversed effect may thus not be surprising. But it’s important to note that the ERPs for upright motion look completely different in the two papers: Within the right hemisphere, mean amplitude in HH05 is roughly -9 μV, SE = 3 μV (taken from Figure 2B in the manuscript), whereas in RHS06 it is +1.95 μV, SE = 1.23 μV (p. 212-213). The difference between these values is d = 1.7!
RHLS08 do not mention lateralisation. They hypothesise a simple contrast between upright and impossible motion in the parietal area. What’s funny is that they cite HH05 to predict a more positive amplitude for upright stimuli even though we just saw that HH05 found a more negative amplitude:
Based on previous research (e.g. Hirai and Hiraki, 2005), we hypothesized that the perception of biological facets of the stimuli would manifest themselves in a parietal location with an increase in positivity for the biological motion compared to the biomechanically impossible motion. (p. 162)
They find a main effect of condition and a significant simple contrast between upright and impossible stimuli (t(14)=2.312, p = .037), which would not hold up to Bonferroni correction (they performed at least two post-hoc tests: corrupted vs impossible is not significant, upright vs corrupted is not reportedI have a quantum theory of unreported test results: They are simultaneously significant and not significant until you look at them, which immediately renders them not significant.). Interestingly, mean amplitude for upright motion is positive like in RHS06, but this time less positive than the amplitude for impossible motion despite being way larger than in RHS06: M = 6.28 μV, SE = 2.57 μV. This is noteworthy because the number represents an average across both hemispheres, not just of the right hemisphere as in RHS06. If the amplitude for upright motion had been smaller in the left hemisphere like it was in RHS06 and HH05, this should have attenuated the overall effect and an average of this magnitude would be even less likely.
It may be hard to believe that these contradictory results could become even messier, but MS09 add yet another pattern to the mix: For the mid-parietal electrodes P3 and P4, they find significantly more positive activation for upright motion from 800 ms onwards (well outside the analysis window of any of the other studies), but for lateral parietal electrodes P7 and P8, the amplitude is less positive/more negative from 500 ms onwards. I don’t quite know what to make of this due to their creative 160-uncorrected-tests approach and the fact that they do not report any test statistics but only state “p<.05” for any given effect. Sadly this means that their results cannot be used for a p-curve analysis.
Summary: results Two papers find an interaction of stimulus and lateralisation with a greater difference between BM and nBM stimuli in the right hemisphere (HH05 and RHS06) – but the differences are in opposite directions. The other two papers find a significant difference between BM and nBM across hemispheres at mid-parietal electrodes P3 and P4 (RHLS08 and MS09) – but these two difference are again in opposite directions. Additionally, MS09 find an effect on lateral parietal electrodes P7 and P8, which again is in the opposite direction of their mid-parietal effect.
I don’t think I could have made up four less congruent results if I’d tried.
Table4: Comparison of ERP amplitudes for upright motion
* Exact values were not provided in the text; the given values are estimates based on Figure 2B in the manuscript.
** MS09 do not provide amplitude means. Amplitude signs were taken from Figure 1 in the manuscript.
Table 4 summarises the incongruity of ERPs across papers for upright BM alone. The most tragic aspect of this is that we cannot even sum up all effects and conclude that taken together, there is none: The analysed time windows and scalp regions were shifted around so much between studies that these contradictory findings might still be compatible with each other!
So – do infants show an observable ERP effect when they’re viewing PLDs of biological versus non-biological motion? I ran a p-curve analysis on the results of HH05, RHS06, and RHLS08 (MS09 couldn’t be included because they don’t report test statistics or exact p-values). I stuck to the instructions of Simonsohn, Nelson, and Simmons and made a p-curve disclosure tableThe first time I did this so thoroughly and it was a great experience – I can very much recommend it. It forces you to track down the authors’ actual hypotheses and to think about which analysis would test them. It sounds trivial but it can be quite adventurous in the case of a not-so-tidy paper.. Three effectsI included the lateralisation x stimulus type interaction effect of HH05 and RHS06 and the upright vs impossible parietal contrast of RHLS08. are of course too small a sample for definitive conclusions, and the binomial tests for evidential value and lack of evidential value both come up not significant (p = .875 and p = .2557, respectively). But… Just look at it!
I have developed a new rule of thumb to decide if I believe the findings in a paper: If all p’s are ≧ .025, I’m not having any of it. Of course that can happen for true effects, but in three consecutive papers? Papers that weren’t preregistered? Papers that throw red flags of obscure if-I-fail-to-mention-it-it’s-not-a-lie phrasing in your face like confetti at a carnival parade? I don’t think so.
Now you may say: But this research is 8 to 12 years old! Times have changed and it seems like the motion perception folks have drawn the right conclusions from this carnage and stopped publishing papers on it. Right? Well. the reason I looked into this literature in the first place was that I reviewed a paper trying to build on it just this January.
I very much hope that infant ERP standards have improved since 2009, but the fact that a paper called “How to get statistically significant effects in any ERP experiment (and why you shouldn’t)” was published in Psychophysiology in December 2016 indicates that it’s probably not all good yet.
This story is an example for how cumulative science cannot work. If you want to build on the non-replicated work of someone else, we first need to know if you can replicate the original effect. If you can’t, that’s fine! If you find a somewhat different effect: That’s fine too! There might be truth in your findings. But we need to know about it. Tell us what happened at time X at electrodes Y and Z when you first looked at those, because we know you did.
Preregister your hypotheses and make exploratory analyses great again! Ok, I realise that preregistration wasn’t a thing in 2008, but from what I’ve heard, deduction and truthful reporting were. Exploratory analyses dressed up as confirmatory despite running counter to previous studies or even your own predictions are ridiculously easy to see through. Your readers aren’t that cheap. We can learn a lot from the results of data exploration, but only if we know the full context.
And, for the sake of completenessSigh.: No, N = 7 is not ok. N = 15 isn’t either, especially when we’re talking about 15 wiggly little monsters who hate EEG caps like nothing else and will decide that it’s time to go home after 35 trials. I’m not even criticising the huge exclusion rates – I have worked in an infant EEG lab and I know it’s impossible to get around that. But if you honestly don’t even have the resources for 20Writing this causes almost physical pain when you know what’s needed are 3-digit numbers. “good” participants, team up with other labsMASSIVE kudos to Michael Frank for starting this movement of saving developmental psychology’s neck. or just leave it be. Especially if your research isn’t a matter of life and death.
[Some wise yet witty concluding words I haven’t found yet]
A few more random oddities in case you’re interested.
At the end of their paper, HH05 briefly report having tested a younger age group: “In a preliminary study, five 6-month-old infants were also measured for their ERPs during perception of BM and SM. Contrary to the 8-month-old infants, we did not find a significant ERP difference between the responses to BM and SM. However, we cannot conclude that 6-month-old infants do not process BM in such a small subject pool” (p. 303). I like how N = 5 is too small a sample but N = 7 isn’t.
The (upright) stimuli used in RHS06 and RHLS08 sound identical, but RHLS08 do not cite their earlier paper (although the reason might have been to stay anonymous toward reviewers)
EEG recording in RHS06 and RHLS08 sounds identical too, but the former report recording 19 scalp electrodes and the latter 23, which seems strangely arbitrary. Also I would like to point out again that RHS06 do not report any filters.
“EEG was recorded continuously with Ag–AgCl electrodes from 19 scalp locations of the 10–20 system, referenced to the vertex (Cz). Data was amplified via a Twente Medical Systems 32-channel REFA amplifier. Horizontal and vertical electrooculargram were recorded bipolarly. Sampling rate was set at 250 Hz. EEG data was re-referenced offline to the linked mastoids” (p. 212)
“EEG was recorded continuously with Ag-AgCl electrodes from 23 scalp locations of the 10–20 system, referenced to the vertex (Cz). Data were amplified via a Twente Medical Systems 32-channel REFA amplifier. Horizontal and vertical electrooculargram were recorded bipolarly. Sampling rate was set at 250 Hz. EEG data were baseline corrected and re-referenced offline to the linked mastoids. Data were filtered with high and low-pass filters from 0.1 to 35 Hz” (p. 164)
RHS06 use a strange formulation to describe the time frame they analysed: “For statistical analysis a time window was chosen around the amplitude peak of the effect from 200 to 300 ms after stimulus onset” (p. 212). Does that mean they averaged the amplitude between 200 and 300 ms like HH05 did? Or did they look for a peak somewhere between 200 and 300 ms and then analysed a time bin of unknown onset and length around this peak?
RHLS08 use the same mysterious description: “For statistical analysis a time window was chosen in parietal regions (P3, P4) around the amplitude peak of the effect from 300–700 ms after stimulus onset” (p. 164). Interestingly, they use quite different language to describe the analysed time window in frontal regions: “For assessment of differences in frontal electrodes, we considered the mean amplitude in the three conditions from 200–350 ms after stimulus onset” (p. 164). Huh, so it’s not that they’re simply not able to use less convoluted language to tell us about how they computed an average. I can’t help but read these descriptions as “we looked at time window X but we won’t tell you which exact time bins within X we analysed”.
For a good laugh I recommend having a look at Table 1 for sample sizes and exclusion rates though.
It’s totally a thing.
Because let’s face it, babies are a bit dim.
I have a quantum theory of unreported test results: They are simultaneously significant and not significant until you look at them, which immediately renders them not significant.
The first time I did this so thoroughly and it was a great experience – I can very much recommend it. It forces you to track down the authors’ actual hypotheses and to think about which analysis would test them. It sounds trivial but it can be quite adventurous in the case of a not-so-tidy paper.
I included the lateralisation x stimulus type interaction effect of HH05 and RHS06 and the upright vs impossible parietal contrast of RHLS08.
Writing this causes almost physical pain when you know what’s needed are 3-digit numbers.
MASSIVE kudos to Michael Frank for starting this movement of saving developmental psychology’s neck.
But hey, I’m still somewhat optimistic about the future of psychology, here’s why:Alternative explanation: I’m just an optimistic person. But I’ve noticed that heritability estimates don’t really make for entertaining blog posts.
Sometimes, it helps to take a more historical perspective to realize that we have come a long way. Starting from a Austrian dude with a white beard who sort of built his whole theory of the development of the human mind on a single boy who was scared of horses, and who didn’t seem to be overly interested in a rigorous test of his own hypotheses to, well, at least nowadays psychologists acknowledge that stuff should be tested empirically. Notice that I don’t want to imply that Freud was the founding father of psychology.It was, of course, Wilhelm Wundt, and I’m not only saying this because I am pretty sure that the University of Leipzig would revoke my degrees if I claimed otherwise. However, he is of – strictly historical – importance to my own subfield, personality psychology. Comparing the way Freud worked to the way we conduct our research today makes it obvious that things changed for the better. Sure, personality psychology might be more boring and flairless nowadays, but really all I care about is that it is accurate.
You don’t even have to go back in time that far: Sometimes, I have to read journal articles from the 80s.Maybe the main reason why I care about good science is that sloppy studies make literature search even more tedious than it would be anyway. Sure, not all journal articles nowadays are the epitome of honest and correct usage of statistics but really you don’t stumble across “significant at the p < .20 level” frequently these days. And if you’re lucky, you will even get a confidence interval or an effect size estimate!
And you don’t even have to look at psychology. A Short History of Nearly Everything used to be my favorite book when I was in high school and later as a grad student, reading about the blunder years of other disciplines that grew up fine nonethelessto varying degrees, obviously. But hey, did you know that plate tectonics became accepted among geologists as late as the 1960s? gave me great hope that psychology is not lost.
Psychologists are starting to try to replicate their own as well as other researchers’ work – and often fail, which is great for science because this is how we learn things.For example, that some effects only work under very strict boundary conditions, such as “effect occurs only in this one lab, and probably only at that one point in time.”
We now have Registered Reports in which peer review happens before the results are known, which is such a simple yet brilliant idea to avoid that undesirable results simply disappear in the file drawer.
To date, 367 people have signed the Peer Reviewers’ Openness Initiative and will now request that data, stimuli and materials are made public whenever possible (it can get complicated though), and 114 people have signed the Commitment to Research Transparency that calls for reproducible scripts and open data for all analyses but also states that the grading of a PhD thesis has to be independent of statistical significanceReally this seems to be a no-brainer, but then again, some people seem to mistake the ability to find p < .05 with scientific skills. or successful publication.
The psychology department of the Ludwig-Maximilians-Universität Munich explicitely embraced replicability and transparency in their job ad for a social psychology professor. That’s by no means the norm yet, and I’m not sure whether this particular case worked out, but one can always dream.
The publication landscape is changing, too.
People are starting to uploade preprints of their articles which is a long overdue step in the right direction. Collabra is a new journal with a community-centered model to make Open Access affordable to everyone.
Old journals are changing, too: Psychological Science now requires a data availability statement with each submission. The Journal of Research in Personality requires a research disclosure statement and invites replications.There are more examples but these two come to my mind because of their awesome editors. There are also journals that take a, uhm, more incremental approach to open and replicable science. For example, I think it’s great that the recent editorial of the Journal of Personality and Social Psychology: Attitudes and Social Cognition concludes that underpowered studies are a problem, but somehow I feel like the journal (or the subfield?) is lagging a few years behind in the whole discussion about replicable science.
Additionally, media attention has been drawn to failed replications, sloppy research, or overhyped claims such as power pose, the whole infamous pizzagateNot the story about the pedophile ring, the story about the psychological study that took place at an all-you-can-eat pizza buffet. story, and the weak evidence behind brain training games. Now you might disagree about that, but I take it as a positive sign that parts of media are falling out of love with catchy one-shot studies because I feel like that whole love affair has probably been damaging psychology by rewarding all the wrong behaviors. Anne is skeptical about this point because she doubts that this is indicative of actual rethinking as compared to a new kind of sexiness: debunking of previously sexy findings. Julia is probably unable to give an unbiased opinion on this as she happens to be the first author of a very sexy debunking paper. Now please excuse me while I will give yet another interview about the non-existent effects of birth order on personality.
And last but not least, we are using the internet now. A lot of the bad habits of psychologists – incomplete method sections, unreported failed experiments, data secrecy – are legacy bugs of the pre-internet era. A lot of the pressing problems of psychology are now discussed more openly thanks to social media. Imagine a grad student trapped in a lab stubbornly trying to find evidence for that one effect, filing away one failed experiment after the other. What would that person have done 20 years ago? How would they ever have learned that this is not a weakness of their own lab, but an endemic problem to a system that only allows for the publication of polished to-good-to-be-true results?In case you know, tell me! I’d really like to know what it was like to get a PhD in psychology back then. Nowadays, I’d hope that they would get an anonymous blog and bitch about these issues in public. Science benefits from less secrecy and less veneer.
Sometimes I get all depressed when I hear senior or mid-career people stating that the current scientific paradigm in psychology is splendid; that each failed replication only tells us that there is another hidden moderator we can discover in an exciting new experiment performed on 70 psychology undergrads; that early-career researchers who are dissatisfied are just lazy or envious or lack flair; and that people who care about statistics are destructive iconoclasts.What an amazing name for a rock band.
This is where we have to go back to the historical perspective.
While I would love to see a complete overhaul of psychology within just one generation of scientists, maybe it will take a bit longer. Social change supposedly happens when cohorts are getting replaced.cf. Ryder, N. B. (1965). The cohort as a concept in the study of social change. American Sociological Review
30: 843-861. Most people who are now professors were scientifically socialized under very different norms, and I can see how it is hard to accept that things have changed, especially if your whole identity is built around successes that are now under attack.I have all the more respect for the seniors who are not like that but instead update their opinions, cf. Kahneman’s comment here, or even actively push others to update their opinions, for example, the old farts I met at SIPS. But really what matters in the long run – and I guess we all agree that science will go on after we are all dead – is that the upcoming generation of researchers is informed about past mistakes and learns how to do proper science. Which is why you should probably go out and get active: teach your grad students about the awesome new developments of the last years; talk to your undergraduates about the replication crisis.
Bewildered students that are unable to grasp why psychologists haven’t been pre-registering their hypotheses and sharing their data all along is what keeps me optimistic about the future.
Alternative explanation: I’m just an optimistic person. But I’ve noticed that heritability estimates don’t really make for entertaining blog posts.
Maybe the main reason why I care about good science is that sloppy studies make literature search even more tedious than it would be anyway.
to varying degrees, obviously. But hey, did you know that plate tectonics became accepted among geologists as late as the 1960s?
For example, that some effects only work under very strict boundary conditions, such as “effect occurs only in this one lab, and probably only at that one point in time.”
Really this seems to be a no-brainer, but then again, some people seem to mistake the ability to find p < .05 with scientific skills.
There are more examples but these two come to my mind because of their awesome editors. There are also journals that take a, uhm, more incremental approach to open and replicable science. For example, I think it’s great that the recent editorial of the Journal of Personality and Social Psychology: Attitudes and Social Cognition concludes that underpowered studies are a problem, but somehow I feel like the journal (or the subfield?) is lagging a few years behind in the whole discussion about replicable science.
Not the story about the pedophile ring, the story about the psychological study that took place at an all-you-can-eat pizza buffet.
Anne is skeptical about this point because she doubts that this is indicative of actual rethinking as compared to a new kind of sexiness: debunking of previously sexy findings. Julia is probably unable to give an unbiased opinion on this as she happens to be the first author of a very sexy debunking paper. Now please excuse me while I will give yet another interview about the non-existent effects of birth order on personality.
In case you know, tell me! I’d really like to know what it was like to get a PhD in psychology back then.
What an amazing name for a rock band.
cf. Ryder, N. B. (1965). The cohort as a concept in the study of social change. American Sociological Review
I have all the more respect for the seniors who are not like that but instead update their opinions, cf. Kahneman’s comment here, or even actively push others to update their opinions, for example, the old farts I met at SIPS.
In my scientific work I strive to be as open as possible. Unfortunately I work with data that I cannot de-identify well enough to share (aka weird sex diaries) and data that simply isn’t mine to share (aka the reproductive histories of all Swedish people since 1950). To compensate this embarrassing lack of openness I’ve tried to devise other ways of being transparent about my work. After a few failures (nobody ever tuned into my Rstudio-speedrun-Twitch-channel), this is one that’s worked quite well for me in the past. I now think it might be interesting even for people who don’t have to compensate for anything.
I did this, because there’s a bunch of problems you can run into if you just share data and code:
Incomplete directions or code. Maybe you have some private knowledge like “First you need to set the working directory to C:\DavemasterDave\Nature_paper\PNAS_paper\JPSP_paper\PAID_paper\data, load the statistical packages in the order passed down to us from our elder advisors, rename this variable, and pray to Meehl.”
Inconsistent package versions. You used dplyr v0.5.0 instead of v.0.4.9 to reproduce the analyses. Turns out they changed distinct() to mean “throw away all my data and don’t warn me about it”.
Hassle. First you download the code from the supplement file. Then get the data from Dryad. Then put them in the same folder and install the following packages (oh by the way you’ll need to compile this one from source for which you’ll need R, Rstudio, Rtools, and OS X Xcode command line tools). Oh and to get the codebook you just have to scale Mount Doom. Just don’t awake my advisor and you’re almost there.
Poor documentation of the code. What does the function transformData_???_profit do? What do the variables prc_stirn_b, fertile_fab mean? Is RJS_6R the one that’s reverse-coded or did you name it this way after reversing RJS_6? And the all-time classic: Are women sex == 2 or sex == 1 (or twice as sex as men)?
It starts with the cleaned data. When I work on a project, I spend 80% of my time on data preparation, cleaning, checking. Yet when I download reproducible files, they often start with a file named cleaned_final_2_Sunday23July_BCM.csv. Something went missing there. And there might be mistakes in that missing bit. In fact it’s approximately 99% more likely that I made a mistake cleaning my data than that Paul Bürkner made a mistake writing brms. As psychological science gets more complex, we should share our pipelines for wrangling data.
Last, but most aggravatingly: Loss. Turns out personal university pages aren’t particularly reliable places to store your data and code and neither is that backup you made on that USB stick which you lent to your co-worker who.. fuck.
Reproducible websites solve these problems. Side effects may include agonising uncertainty about whether you should release them and whether mistakes will be found (fun fact: nobody reads the supplement. I’ve been including a video of Rick Astley in every single one of my supplements and so far nobody noticed. Another fun fact: nobody reads blog posts longer than 1000 characters, so admitting this here poses no risk at all).
This is the stack that I use to make my reproducible websites.
R & RStudio. RStudio is an integrated development environment. This is not strictly necessary, but as RStudio makes or maintains a lot of the packages necessary to achieve reproducibility, using their editor is a smart choice.
Packrat. Packrat solves messes due to different R package versions. Unfortunately, it’s a bit immature, so I’d currently recommend activating it on only in the final stage of your project. If you’re often working on many projects simultaneously and you know the grief it causes when a package update in one project breaks your code in another, it might be worth the hassle to keep it activated.
Rmarkdown (knitr). Markdown is an easy to learn markup language. Rmarkdown lets you document your R code, put the graphs where they belong etc. It also now generates websites and bibliographies with little additional work.
Git & Github. Git does distributed version tracking. Many scientists work alone, so Git may seem like overkill, but a) when you go open, you will be never be alone again (cue sappy music) and b) the features Github offers (notably: hosting for your website through Github pages) make up for Git’s somewhat steep learning curve. RStudio provides a rustic visual interface for Git, I personally prefer SourceTree.
Zenodo. They will permanently archive your website (and anything else you care to share) for free. If you make your stuff public, you get a citable DOI, which will lead to your document, even if Github and Zenodo should one day cease to exist. Zenodo can be integrated with Github, so that releases on Github are automatically uploaded to Zenodo.
To make this stack work well together, there’s a few hurdles to clear. And let me be completely frank with you: You still need R, RStudio, and a working knowledge of Mount Doom’s geography. But your readers will only need a web browser to make sense of your work (okay, printing works too, but one of my co-authors once tried to print out all my Mplus outputs, deforesting a substantial part of Honduras in the process).
To make it easier, I’ve uploaded a starter RStudio project that you can fork on Github to start out with a configuration that worked well for me in the past. I’ve tried to paper over some of the rough edges that this stack still has and I added some initial structure and code, so you can adapt it.
With these projects I document my entire research process using Rmarkdown (e.g. loading & processing the raw data, wrangling it into the right shape, analysing it, making graphs).
But instead of sharing the raw scripts (which only make sense when run interactively with the data), I create a website where readers see the code together with the resulting graphs and other output.
I use the reports generated like this to make sense of my own results and to share extensive results with my co-authors. Some friends even write their complete manuscripts using Scholarly Markdown, and Papaja, but for me this is more about not losing all the interesting details that can’t make it into the manuscript.
Here’s two recent projects where I’ve used this stack (or an earlier version):
https://rubenarslan.github.io/paternal_age_fitness/ – the online supplement for this manuscript. This one is the largest I’ve made so far (~90 pages, >7000 graphs) and involved a complex setup including models run on a cluster, but documented offline. In this project I was in the situation that I wanted to repeat the same commands for different samples a lot, so that I had to learn to compartmentalise my Markdown into components (more on that on another day).
https://rubenarslan.github.io/generation_scotland_pedigree_gcta/ – the online supplement for this manuscript. This one is much simpler, just a single page with a lot of tabs. Here we presented a few models from our model selection procedure in the manuscript, but we wanted to show the results other choices would have had and how different components influenced one another.
The stack should also work with some modifications for Shiny apps.
(this post was jointly written by Malte & Anne; in a perfect metaphor for academia, WordPress doesn’t know how to handle multiple authors)
We believe in scientific openness and transparency, and consider unrestricted access to data underlying publications indispensable. Therefore, weNot just the authors of this post, but all four of us. signed the Peer Reviewers’ Openness (PRO) Initiative, a commitment to only offer comprehensive review for or recommend the publication of a manuscript if the authors make their data and materials publicly available, unless they provide compelling reasonsThe data-hungry dog of a former grad student whose name you forgot is not a compelling reason. why they cannot do so (e.g. ethical or legal restrictions).
As reviewers, we enthusiastically support PRO and its values.
Also as reviewers, we think PRO can be a pain in the arse.
Ok, not really. But advocating good scientific practice (like data sharing) during peer review can result in a dilemma.
This is how it’s supposed to work: 1) You accept an invitation for review. 2) Before submitting your review, you ask the editor to relay a request to the authors to share their data and materials (unless they have already done so). 3) If authors agree – fantastic. If authors decline and refuse to provide a sensible rationale why their data cannot be shared – you reject the paper. Simple.
So far, so PRO. But here’s where it gets hairy: What happens when the action editor handling the paper refuses to relay the request for data, or even demands that such a request is removed from the written review?
Here is a reply Anne recently got from an editor after a repeatedThe editor apologised for overlooking the first email – most likely an honest mistake. Talk to Chris Chambers if you want to hear a few stories about the funny tendency of uncomfortable emails to get lost in the post. PRO request:
“We do not have these requirements in our instructions to authors, so we can not ask for this without discussing with our other editors and associate editors. Also, these would need to involve the publication team. For now, we can relieve you of the reviewing duties, since you seem to feel strongly about your position.
Let me know if this is how we should proceed so we do not delay the review process further for the authors.”
Much like judicial originalists insist on interpreting the US constitution literally as it was written by a bunch of old white dudes more than two centuries ago, editors will sometimes cite existing editorial guidelines by which authors obligate themselves to share data on request, but only after a paper has been published, which has got to be the “Don’t worry, I use protection” argument of academia.We picked a heterosexual male perspective here but we’re open to suggestions for other lewd examples. Also, we know that this system simply does. not. work.
As reviewers, it is our duty to evaluate submitted research reports, and data are not just an optional part of empirical research – they are the empirical research (the German Psychological Society agrees!). You wouldn’t accept a research report based on the promise that “theory and hypotheses are available on request”, right?Except when reviewing for Cyberpsychology.
PRO sets “data or it didn’t happen” as a new minimum standard for scientific publications. As a consequence, comprehensive review should only be offered for papers that meet this minimum standard. The technically correctThe best kind of being correct. application of the PRO philosophy for the particular case of the Unimpressed Editor is straightforward: When they decide – on behalf of the authors! – that data should or will not be shared, the principled consequence is to withdraw the offer to review the submission. As they say, PRO before the status quo.
Withdrawing from the process, however, decreases the chance that the data will be made publicly accessible, and thus runs counter to PRO’s ideals. As we say in German, “Operation gelungen, Patient tot” – surgery successful, patient deceased.
Adhering strictly to PRO would work great if everybody participated: The pressure on non-compliant journals would become too heavy. Then again, if everybody already participated, PRO wouldn’t be a thing. In the world of February 2017, editors can just appoint the next best reviewerWithdrawing from review might still have an impact in the absence of a major boycott by causing the editors additional hassle and delaying the review process – then again, this latter part would unfairly harm the authors, too. who might simply not care about open data – and couldn’t you push for a better outcome if you kept your foot in the door? Then again, if all PRO signatories eroded the initiative’s values that way, the day of reaching the critical mass for a significant boycott would never come.
A major concern here is that the authors are never given the chance to consider the request although they might be receptive to the arguments presented. If increased rates of data sharing is the ultimate goal, what is more effective: boycotting journals that actively suppress such demands by invited reviewers, or loosening up the demands and merely suggest that data should be shared so at least the gist of it gets through?
There are two very different ways to respond to such editorial decisions, and we feel torn because each seems to betray the values of open, valuable, proper scientific research. You ask: What is the best strategy in the long run? Door in the face! Foot in the door! Help, I’m trapped in a revolving door! We would really like to hear your thoughts on this!
RE: Door in the face
Thank you very much for the quick response.
Of course I would have preferred a different outcome, but I respect your decision not to request something from the authors that wasn’t part of the editorial guidelines they implicitly agreed to when they submitted their manuscript.
What I do not agree with are the journal’s editorial guidelines themselves for the reasons I provided in my previous email. It seems counterproductive to invite peers as “gatekeepers” while withholding relevant information that are necessary for them to fulfill their duty until the gatekeeping process has been completed.
Your decision not even to relay my request for data sharing to the authors (although they might gladly do so!), unfortunately, bars me from providing a comprehensive review of the submission. It is literally impossible for me to conclude a recommendation about the research as a whole when I’m only able to consider parts of it.
Therefore, I ask that you unassign me as a reviewer, and not invite me again for review except for individual manuscripts that meet these standards, or until the editorial policy has changed.
RE: Foot in the door
Thank you very much for the quick response.
Of course I would have preferred a different outcome, but I respect your decision not to request something from the authors that wasn’t part of the editorial guidelines they implicitly agreed to when they submitted their manuscript.
In fact, those same principles should apply to me as a reviewer, as I, too, agreed to review the submission under those rules. Therefore, in spite of the differences in my own personal standards versus those presented in your editorial guidelines, I have decided to complete my review of the manuscript as originally agreed upon.
You will see that I have included a brief paragraph on the benefits of data sharing in my review. I neither demand the authors share their data nor will I hold it against them if they refuse to do so at this point. I simply hope they are persuaded by the scientific arguments presented in my review and elsewhere — In fact, I hope that you are too.
I appreciate this open and friendly exchange, and I hope that you will consider changing the editorial guidelines to increase the openness, robustness, and quality of the research published in your journal.
Not just the authors of this post, but all four of us.
The data-hungry dog of a former grad student whose name you forgot is not a compelling reason.
The editor apologised for overlooking the first email – most likely an honest mistake. Talk to Chris Chambers if you want to hear a few stories about the funny tendency of uncomfortable emails to get lost in the post.
We picked a heterosexual male perspective here but we’re open to suggestions for other lewd examples.
Withdrawing from review might still have an impact in the absence of a major boycott by causing the editors additional hassle and delaying the review process – then again, this latter part would unfairly harm the authors, too.