TL;DR: What’s an age-effect net of all time-varying covariates? The sound of one hand clapping.
Recently, we submitted a paper with some age trajectories of measures of individuals’ (un-)well-being. We thought of these trajectories in the most descriptive way: How do these measures change across the life course, all things considered? And really while this might not be the most interesting research question because it doesn’t directly answer why stuff happens, I’m a fan of simple descriptive studies and think they should have a place in our domain; Paul Rozin wrote a great piece on the importance of descriptive studies.
Anyway, the editor asked us to justify why we did not include any time-varying covariates (e.g. income, education, number of children, health) in our analysis of age trajectories. I thought the editor had requested an actual justification; my co-author (an economist) thought the editor just wanted to tell us that we should throw in all sorts of covariates. I felt too lazy to re-run all analyses and create new figures and tables, plus I always get a weird twitch in my left eye when somebody asks for “statistical control” without additional justification, so instead I looked into the (scientific) literature on the midlife crisis and tried to figure out how people have justified the inclusion of control variables in the analyses of age effects on well-being.Ruben, on the other hand, would probably get a twitch in his right eye if he found out I did not automate making my figures and tables.
Cat ownership, a time-varying covariate. (Pic: pixabay.com)
Whether or not life satisfaction dips in middle adulthood (somewhere between age 45-64) before rising again in older agebefore dipping again in a terminal decline. If you read that footnote, you’re now in the mortality salience condition of my Terror Management Theory study. has been hotly debated by psychologists and economists. There are a lot of papers out there on that subject and personally, I’m totally agnostic regarding the existence of the midlife crisis – ask me again in 20 years, if I’m not too busy driving my Porsche. But there are a lot of interesting methodological questions that arise when trying to answer this question.
A brief list of stuff I don’t want to talk about in this post, which are important nonetheless:
the Age-Period-Cohort conundrum: In short, this requires us to make certain assumptions when we want to identify age/period/cohort effects. That’s okay though, every researcher needs to make assumptions from time to time.
longitudinal vs. cross-sectional data: Both can have their pros and cons.
what we can learn from lab studies in which researchers recruit older people and then compare their performance on an arbitrary task X to the performance of their convenient undergraduate sample. How do you reasonably match 60 year old people that decided to participate in lab studies onto a younger sample of psych majors that really just want to get their freakin’ course credit?
lots of other interesting stuff you can do with longitudinal data that is more interesting than simple descriptive trajectories
But let’s get back to the topic our editor raised: Should we control for time-varying covariates such as income, marital status, health? The logic seems straightforward: Wouldn’t we want to “purge” the relationship between age and life satisfaction from other factors?
Quite obviously, a lot of stuff changes as we age. We get older, get our degrees, start a decent job and make some money,Or, alternatively, start a blog. maybe marry and settle down or travel to an Ashram to awaken our inner goddess and spite our conservative parents or maybe just get a lot of cats.
Mount Midlife Crisis, not to be confused with Mount Doom. (Art: Hakuin Ekaku)
To control for these variables might be wrong for two distinct reasons, and I will start with the somewhat more obscure one.
First, our time-varying covariate might actually be causally affected by life satisfaction. This argument has been raised regarding the statistical control of marital status by the late Norval Glenn (2009). He simulated a data situation in which (1) life satisfaction is stable across the life course and (2) starting from 21, only the 10 happiest people marry each year. He then demonstrated that controlling for marital status will result in a spurious trajectory, that is, a pronounced decline of life satisfaction over the life course even though we know that there’s no age effect in the underlying data. If you have read this blog before and the data situation sounds somewhat familiar to you: Marital status would be one of the infamous colliders that you should not control for because if you do, I will come after you.And you should be scared because I can deliver angry rants about inapproriate treatment of third variables. I might bring cookies and hope that you have decent coffee because this might take longer. If marital status is affected by age (the older you are, the more likely you are to be married), and if satisfied people are more likely to marry, marital status becomes a collider of its two causes and should not be controlled.
The second reason is somewhat more obvious: In many cases, the time-varying covariates will mediate the effects of age on your outcome. That is probably most obvious for health: Health declines with age. Decreases in health affect life satisfaction.They obviously do, though the fact that life satisfaction remains stable until a certain age despite decreases in health has been labeled the happiness paradox. So life satisfaction might decrease with age because of declining health. Now what does it mean if we control for this potential mediator?
Well, it means that we estimate the age effect net of the parts that are mediated through health. That is not inherently nonsensical, we just have to interpret the estimate properly. For example, Andrew Oswald was cited in Vol. 30 of the Observer: “[But] encouragingly, by the time you are 70, if you are still physically fit then on average you are as happy and mentally healthy as a 20 year old.” Now this might be indeed encouraging for people who think they are taking great care of their health and predict that they will be healthy by the time they are 70; but whether it’s encouraging on average strongly depends on the average health at age 70.
For example, if we assume that only the luckiest 1% of the population will be physically fit at that age, 99% will end up unhappier than 20 year olds (whether or not 20-year-olds are very happy is a different question). That doesn’t sound very optimistic any more, does it? The lucky one percent might also be very special with respect to other characteristics such as income, and a message such as “the wealthy will be still happy at age 70, whereas the poor are wasting away because of a lack of health care” again sounds not very encouraging. For the record, I’m not claiming that this is happening, but those are all scenarios that are aligned with the simple statement that those who are physically fit at age 70 are as mentally healthy as 20 year olds.
So the estimated association has its own justification but must be interpreted carefully. Additionally, it renders the “remaining” age effect hard to interpret, so it might not be very enlightening to look at age effects net of the effects of time-varying covariates. Let’s assume you “control” all sorts stuff that happens in life as people age – marital status, education, income, number of children, maybe also number of good friends, cat ownership, changes in health, and when we are already at it, why don’t we also control for the stuff that is underlying changes in health, such as functioning of organs and cell damage? – and still find a significant age effect.
What does that mean? Well, it means that you haven’t included all time-varying covariates that are relevant to life satisfaction because age effects must necessarily be mediated through something. The sheer passing of time only has effects because stuff happens in that time.
The “stuff” might be all sorts of things, and we might be inclined to consider that stuff more or less psychologically meaningful. For example, we might not consider changes in marital status of physical health to be “genuinely psychological”, so we might decide to control for these things to arrive at a purely psychological age effect. Such a “purely psychological” age effect might then be driven by e.g. people’s attitude towards the world. For example, people might get more optimistic and thus more satisfied controlling for other life circumstances. But I would again be careful with those interpretations, because of the collider problem outlined before and because of the somewhat arbitrary distinction between physical changes and socialrole changes as opposed to psychological changes.
In other words: what you should or shouldn’t control for always depends on your research question. If you study living things and control for life, don’t be surprised if your results seem a bit dull.
“The pursuit of knowledge is, I think, mainly actuated by love of power. And so are all advances in scientific technique.”
― Bertrand Russell
Today on the 100% CI we have to stop the jokes for a moment. Today, we will talk about being disappointed by your idols, about science criticism being taken too far. Here, we reveal how Dr. Andrew Gelman, the prominent statistician and statistics blogger, abused his power.
We do not make this accusation lightly, but our discoveries leave no other conclusion: In his skepticism of psychological science, Dr. Gelman lost sight of right and wrong. What follows is a summary of the evidence we obtained.
This March the four of us visited the Applied Statistics Center at Columbia University for a brief workshop on Stan, the probabilistic programming language. We were excited about this opportunity to learn from one of our personal heroes.
The course, however, did not live up to our expectations. Frequently, Dr. Gelman would interrupt the course with diatribes against psychological science. On the second-to-last afternoon, we were supposed to write our first Stan code in a silent study session. We were left alone in the Gelman lab and our minds wandered. Our attention was drawn to a particularly large file drawer that turned out to be unlocked. What we discovered can only be described as profoundly shocking:
But this particular file drawer problem was different: The lab log revealed that Dr Gelman was desperate to obtain evidence against the phenomenon – and failed repeatedly. Initially, he invested enormous resources to run experiments with extraordinary four digit sample sizes to “nail the coffin shut on Power Pose”, as a hand-written note on an early report reads. The data painted a very clear picture, and it was not to his liking. As it dawned on him that, contrary to his personal convictions, Power Posing might be a real phenomenon, he began to stack the deck.
Instead of simple self-reports, he tried manifest behavioral observations and even field studies where the effect was expected to vanish. Power Pose prevailed. He deliberately reduced study samples to the absurdly low numbers often criticized on his very own blog. But even in his last attempts with 1-β almost equal to ɑ: Power Pose prevailed. As more and more evidence in favor of Power Posing was gathered, the research became… sloppy. Conditions were dropped, outliers removed, moderators randomly added, and, yes, even p-values were rounded up. Much to Dr. Gelman’s frustration, Power Pose prevailed. He was *unable* to collect data in favor of the null hypothesis.
He thought he had one final Bayesian trick up his sleeve: By hiring a skilled hypnotist he manipulated his priors, his own beliefs (!) in Power Posing. But even with these inhumane levels of disbelief, the posterior always indicated beyond a doubt: Power Pose prevailed. It was almost like the data were trying to tell him something – but Dr. Gelman had forgotten how to listen to evidence a long time ago.
In a recent publication, Simmons and Simonsohn analyzed the evidential value of the published literature on Power Posing. The centerpiece of their research is a p-curve (figure below, left graph) on the basis of which they “conclusively reject the null hypothesis that the sample of existing studies examines a detectable effect.” Had Dr. Gelman not hidden his findings in a file drawer, Simmons and Simonsohn’s conclusions would have been dramatically different (right graph).
Initially, we couldn’t believe that he would go this far just to win an argument. We were sure there must have been some innocuous explanation – yet we also did not want to confront him with our suspicions right away. We wanted to catch him red-handed.
Thus, we decided to infiltrate one of his studies, which he was covertly advertising under the obvious pseudonym Mr. Dean Wangle. He administered the study wearing a fake moustache and a ridiculous French beret, but his voice is unmistakeable. Below is a video of an experimental session that we were able to record with a hidden camera. The footage is very tough to watch.
Combined, the evidence leaves only one conclusion: Andrew Gelman betrayed science in his war on power posing.
Does playing violent video games increase aggression?No, but violent video game research kinda does. What makes advertisements persuasive?David Hasselhoff, obvs. Are 5%25%75% of the population addicted to socialmedia?It’s almost like humans have a fundamental need for social interactions. Who are these people that watch porn?Literally everyone everywhere Why do we enjoy cat videos so much?WHY???
These are the typical research questions media psychologists are concerned with. Broadly, media psychology describes and explains human behavior, cognition, and affect with regards to the use and effects of media and technology. Thus, it’s a hybrid discipline that borrows heavily from social, cognitive, and educational psychology in both its theoretical approaches and empirical traditions. The difference between a social psychologist and a media psychologist that both study video game effects is that the former publishes their findings in JPSP while the latter designs “What Twilight character are you?” self-testsTEAM EDWARD FTW! for perezhilton.com to avoid starving. And so it goes.
New is always better
A number of media psychologists is interested in improving psychology’s research practices and quality of evidence. Under the editorial oversight of Nicole Krämer, the Journal of Media Psychology (JMP), the discipline’s flagshipBy “flagship” I mean one of two journals nominally dedicated to this research topic, the other being Media Psychology. It’s basically one of those People’s Front of Judea vs. Judean People’s Front situations. journal, not only signed the Transparency and Openness Promotion Guidelines, it has also become one of roughly fifty journals that offer the Registered Reports format.
To promote preregistration in general and the new submission format at JMP in particular, the journal launched a fully preregistered Special Issue on “Technology and Human Behavior” dedicated exclusively to empirical work that employs these practices. Andy PrzybylskiWho, for reasons I can’t fathom, prefers being referred to as a “motivational researcher” and I were fortunate enough to be the guest editors of this issue.
The papers in this issue are nothing short of amazing – do take a look at them even if it is outside of your usual area of interest. All materials, data, analysis scripts, reviews, and editorial letters are available here. I hope that these contributions will serve as an inspiration and model for other (media) researchers, and encourage scientists studying media to preregister designs and share their data and materials openly.
Media Psychology BCBefore Chambers
If you already suspected that in all this interdisciplinary higgledy-piggledy, media psychology did not only inherit its parent disciplines’ merits, but also some of their flaws, you’re probablyunerring-absolute-100%-pinpoint-unequivocally-no-ifs-and-buts-dead-on-the-money correct. Fortunately, Nicole was kind enough to allot our special issue editorial more space than usual in order to report a meta-scientific analysis of the journal’s past and to illustrate how some of the new practices can ameliorate the evidential value of research. For this reason, we surveyed a) availability of data, b) errors in the reporting of statistical analyses, and c) sample sizes and statistical power of all 147 studies in N = 146 original research articles published in JMP between volume 20/1, when it became an English-language publication, and volume 28/2 (the most recent issue at the time this analysis was planned). This blog post is a summary of the analyses in our editorial, which — including its underlying raw data, analysis code, and code book — is publicly available at https://osf.io/5cvkr/.
Availability of Data and Materials
Historically the availability of research data in psychology has been poor. Our sample of JMP publications suggests that media psychology is no exception to this, as we were not able to identify a single publication reporting a link to research data in a public repository or the journal’s supplementary materials.
Statistical Reporting Errors
A recent study by Nuijten et al. (2015) indicates a high rate of reporting errors in reported Null Hypothesis Significance Tests (NHSTs) in psychological research reports. To make sure such inconsistencies were avoided for our special issue, we validated all accepted research reports with statcheck 1.2.2, a package for the statistical programming language R that works like a spellchecker for NHSTs by automatically extracting reported statistics from documents and recomputingp-values are recomputed from the reported test statistics and degrees of freedom. Thus, for the purpose of recomputation, it is assumed that test statistics and degrees of freedom are correctly reported, and that any inconsistency is caused by errors in the reporting of p-values. The actual inconsistencies, however, can just as well be caused by errors in the reporting of test statistics and/or degrees of freedom.p-values.
For our own analyses, we scanned all nemp = 131 JMP publications reporting data from at least one empirical study (147 studies in total) with statcheck to obtain an estimate for the reporting error rate in JMP. Statcheck extracted a total of 1036 NHSTs reported in nnhst = 98 articles. Forty-one publications (41.8% of nnhst) reported at least one inconsistent NHST (max = 21), i.e. reported test statistics and degrees of freedom did not match reported p-values. Sixteen publications (16.3% of nnhst) reported at least one grossly inconsistent NHST (max = 4), i.e. the reported p-value is < .05 while the recomputed p-value is > .05, or vice-versa. Thus, a substantial proportion of publications in JMP seem to contain inaccurately reported statistical analyses, of which some might affect the conclusions drawn from them (see Figure 1).
Caution is advised when speculating about the causes of the inconsistencies. Many of them are probably clerical errors that do not alter the inferences or conclusions in any way.For example, in 20 cases the authors reported p = .000, which is mathematically impossible (for each of these precomputed < .001). Other inconsistencies might be explained by authors not declaring that their tests were one-tailed (which is relevant for their interpretation). However, with some concern, we observe it is unlikely to be the only cause, as in 19 out of 23 cases the reported p-values were equal to or smaller than .05 while the recomputed p-values were larger than .05, whereas the opposite pattern was observed in only four cases. Indeed, if incorrectly reported p-values resulted merely from clerical errors, we would expect inconsistencies in both directions to occur at approximately equal frequencies.
All of these inconsistencies can easily be detected using the freely available R package statcheck or, for those who do not use R, in your browser via www.statcheck.io.
Sample Sizes and Statistical Power
High statistical power is paramount in order to reliably detect true effects in a sample and, thus, to correctly reject the null hypothesis when it is false. Further, low power reduces the confidence that a statistically significant result actually reflects a true effect. A generally low-powered field is more likely to yield unreliable estimates of effect sizes and low reproducibility of results. We are not aware of any previous attempts to estimate average power in media psychology.
Strategy 1: Reported power analyses. One obvious strategy for estimating average statistical power is to examine the reported power analyses in empirical research articles. Searching all papers for the word “power” yielded 20 hits and just one of these was an article that reported an a priori determined sample size.In the 19 remaining articles power is mentioned, for example, to either demonstrate observed or post-hoc power (which is redundant with reported NHSTs), to suggest larger samples should be used in future research, or to explain why an observed nonsignificant “trend” would in fact be significant had the statistical power been higher.
Strategy 2: Analyze power given sample sizes. Another strategy is to examine the power for different effect sizes given the average sample size (S) found in the literature. The median sample size in JMP is 139 with a considerable range across all experiments and surveys (see Table 1). As in other fields, surveys tend to have healthy sample sizes apt to reliably detect medium to large relationships between variables.
For experiments (including quasi-experiments), the outlook is a bit different. With a median sample size per condition/cell of 30.67, the average power of experiments published in JMP to detect small differences between conditions (d = .20) is 12%, 49% for medium effects (d = .50), and 87% for large effects (d = .80). Even when assuming that the average effect examined in the media psychological literature could be as large as those in social psychology (d = .43), our results indicate that the chance that an experiment published in JMP will detect them is 38%, worse than flipping a coin.An operation that would be considerably less expensive.
Table 1. Sample sizes and power of studies published in JMP volumes 20/1 to 28/2. n = Number of published studies; MDS = Median sample size; MDs/cell = Median sample size per condition; 1-ßr=.1/d=.2 / 1-ßr=.3/d=.5 / 1-ßr=.5/d=.8 = Power to detect small/medium/large bivariate relationships/differences between conditions.
For between-subjects, mixed designs, and total we assumed independent t-tests. For within-subjects designs we assumed dependent t-tests. All tests two-tailed, α = .05. Power analyses were conducted with the R package pwr 1.20
Feeling the Future of Media Psychology
The above observations could lead readers to believe that we are concerned about the quality of publications in JMP in particular. If anything, the opposite is true, as this journal recently committed itself to a number of changes in its publishing practices to promote open, reproducible, high-quality research. These analyses are simply another step in a phase of sincere self-reflection. Thus, we would like these findings, troubling as they are, to be taken not as a verdict, but as an opportunity for researchers, journals, and organizations to reflect similarly on their own practices and hence improve the field as a whole.
One key area which could be improved in response to these challenges is how researchers create, test, and refine psychological theories used to study media. Like other psychology subfields, media psychology is characterized by frequent emergence of new theories which purport to explain phenomena of interest.As James Anderson recently put it in a very clever paper (as usual): “Someone entering the field in 2014 would have to learn 295 new theories the following year.” This generativity may, in part, be a consequence of the fuzzy boundaries between exploratory and confirmatory modes of social sciences research.
Both modes of research – confirming hypotheses and exploring uncharted territory – benefit from preregistration. Drawing this distinction helps the reader determine which hypotheses carefully test ideas derived from theory and previous empirical studies, and it liberates exploratory research from the pressure to present an artificial hypothesis-testing narrative.
As technology experts, media psychology researchers are well positioned to use and study new tools that shape our science. A range of new web-based platforms have been built by scientists and engineers at the Center for Open Science, including their flagship, the OSF, and preprint services like PsyArXiv. Designed to work with scientists’ existing research flows, these tools can help prevent data loss due to hardware malfunctions, misplacement,Including mindless grad students and hungry dogs or relocations of researchers, while enabling scientists to claim more credit by allowing others to use and cite their materials, protocols, and data. A public repository for media psychology research materials is already in place.
Like psychological science as a whole, media psychology faces a pressing credibility gap. Unlike some other areas of psychological inquiry,such as meta science however, media research — whether concerning the Internet, video games, or film — speaks directly to everyday life in the modern world. It affects how the public forms their perceptions of media effects, and how professional groups and governmental bodies make policies and recommendations. In part because it is key to professional policy, empirical findings disseminated to caregivers, practitioners, and educators should be built on an empirical foundation with sufficient rigor.
We are, on balance, optimistic that media psychologists can meet these challenges and lead the way for psychologists in other areas. This special issue and the registered reports submission track present an important step in this direction and we thank the JMP editorial board, our expert reviewers, and of course, the dedicated researchers who devoted their limited resources to this effort.
The promise of building an empirically-based understanding of how we use, shape, and are shaped by technology is an alluring one. We firmly believe that incremental steps taken towards scientific transparency and empirical rigor will help us realize this potential.
If you read this entire post, there’s a 97% chance you’re on Team Edward.
It’s almost like humans have a fundamental need for social interactions.
Literally everyone everywhere
TEAM EDWARD FTW!
By “flagship” I mean one of two journals nominally dedicated to this research topic, the other being Media Psychology. It’s basically one of those People’s Front of Judea vs. Judean People’s Front situations.
Who, for reasons I can’t fathom, prefers being referred to as a “motivational researcher”
p-values are recomputed from the reported test statistics and degrees of freedom. Thus, for the purpose of recomputation, it is assumed that test statistics and degrees of freedom are correctly reported, and that any inconsistency is caused by errors in the reporting of p-values. The actual inconsistencies, however, can just as well be caused by errors in the reporting of test statistics and/or degrees of freedom.
For example, in 20 cases the authors reported p = .000, which is mathematically impossible (for each of these precomputed < .001). Other inconsistencies might be explained by authors not declaring that their tests were one-tailed (which is relevant for their interpretation).
In the 19 remaining articles power is mentioned, for example, to either demonstrate observed or post-hoc power (which is redundant with reported NHSTs), to suggest larger samples should be used in future research, or to explain why an observed nonsignificant “trend” would in fact be significant had the statistical power been higher.
An operation that would be considerably less expensive.
As James Anderson recently put it in a very clever paper (as usual): “Someone entering the field in 2014 would have to learn 295 new theories the following year.”
Scroll to the very end of this post for an addendum.If you only see footnotes, you have scrolled too far.
Reading skills of children correlate with their shoe size. Number of storks in an area correlates with birth rate. Ice cream sales correlate with deaths by drowning. Maybe they used different examples to teach you, but I’m pretty sure that we’ve all learned about confounding variables during our undergraduate studies. After that, we’ve probably all learned that third variables ruin inference, yadda yadda, and obviously the only way to ever learn anything about cause and effect are proper experiments, with randomization and stuff. End of the story, not much more to learn about causality.YMMV and I hope that there are psych programs out there that teach more about causal inference in non-experimental settings. Throw in some “control variables” and pray to Meehl that some blanket statement “Experimental studies are needed to determine whether…” will make your paper publishable anyway.
Causal inference from observational data boils down to assumptions you have to makeThere’s no free lunch in causal inference. Inference from your experiment, for example, depends on the assumption that your randomization worked. And then there’s the whole issue that the effects you find in your experiment might have literally nothing to do with the world that happens outside the lab, so don’t think that experiments are an easy way out of this misery. and third variables you have to take into account. I’m going to talk about a third variable problem today, conditioning on a collider. You might not have heard of this before, but every time you condition on a collider, a baby stork gets hit by an oversized shoe filled with ice creamJust to make sure: I don’t endorse any form of animal cruelty. and the quality of the studies supporting your own political view deteriorates.If you are already aware of colliders, you will probably want to skip the following stupid jokes and smugness and continue with the last two paragraphs in which I make a point about viewpoint bias in reviewers’ decisions.
Let’s assume you were interested in the relationship between conscientiousness and intelligence. You collect a large-ish sample of N = 10,000As we say in German: “Gönn dir!” and find a negative correlation between intelligence and conscientiousness of r = – .372 (see Figure 1).
However, your sample consisted only of college students. Now you might be aware that there is a certain range restriction in intelligence of college students (compared to the overall population), so you might even go big and claim that the association you found is probably an underestimation! Brilliant.
The collider – being a college student – rears its ugly head. Being a college student is positively correlated with intelligence (r = .426). It is also positively correlated with conscientiousness (r = .433).Just to make sure: This is fake data. Fake data should not be taken as evidence for the actual relationship between a set of variables (though some of the more crafty and creative psychologists might disagree). Let’s assume that conscientiousness and intelligence have a causal effect on college attendance, and that they are actually not correlated at all in the general population, see Figure 2.
If you select a college sample (i.e. the pink dots), you will find a negative correlation between conscientiousness and intelligence of, guess what, exactly r = -.372, because this is how I generated my data. There is a very intuitive explanation for the case of dichotomous variables:The collider problem is just the same for continuous measures. In the population, there are smart lazy people, stupid diligent people, smart diligent people and stupid lazy people.Coincidentally, you will find each of the four combinations represented among the members of The 100% CI at any given point in time, but we randomly reassign these roles every week. In your hypothetical college sample, you would have smart lazy people, stupid diligent people, smart diligent people but no stupid lazy people because they don’t make it to college.Ha, ha, ha. Thus, in your college sample, you will find a spurious correlation between conscientiousness and intelligence.Notice that you might be very well able to replicate this association in every college sample you can get. In that sense, the negative correlation “holds” in the population of all college students, but it is a result from selection into the sample (and not causal processes between conscientiousness and intelligence, or even good old fashioned confounding variables) and doesn’t tell you anything about the correlation in the general population.
By the way, additionally sampling a non-college sample and finding a similar negative correlation among non-college peeps wouldn’t strengthen your argument: You are still conditioning on a collider. From Figure 2, you can already guess a slight negative relationship in the blue cloud,If you are really good at guessing correlations (it’s a skill you can train!) you might even see that it’s about r = -.200, and pooling all data points and and estimating the relationship between IQ and conscientiousness while controlling for the collider results in r = -.240. Maybe a more relevant example: If you find a certain correlation in a clinical sample, and you find the same correlation in a non-clinical sample, that doesn’t prove it’s real in the not-so-unlikely case that ending up in the clinical sample is a collider caused by the variables you are interested in.
On an abstract level: Whenever X1 (conscientiousness) and X2 (intelligence) both cause Y (college attendance) in some manner, conditioning on Y will bias the relationship between X1 and X2 and potentially introduce a spurious association (or hide an existing link between X1 and X2, or exaggerate an existing link, or reverse the direction of the association…). Conditioning can mean a range of things, including all sort of “control”: Selecting respondents based on their values on Y?or anything that is caused by Y, because the whole collider logic also extends to so-called descendants of a collider That’s conditioning on a collider. Statistically controlling for Y? That’s conditioning on a collider. Generating propensity scores based on Y to match your sample for this variable? That’s conditioning on a collider. Running analyses separately for Y = 0 and Y = 1? That’s conditioning on a collider. Washing your hair in a long, relaxing shower at CERN? You better believe that’s conditioning on a collider. If survival depends on Y, there might be no way for you to not condition on Y unless you raise the dead.
When you start becoming aware of colliders, you might encounter them in the wild, aka everyday life. For example, I have noticed that among my friends, those who study psychology (X1) tend to be less aligned with my own political views (X2). The collider is being friends with me (Y): Psychology students are more likely to become friends with me because, duh, that’s how you find your friends as a student (X1->Y). People who share my political views are more likely to become friends with me (X2->Y). Looking at my friends, they are either psych peeps or socialist anti-fascist freegan feminists.This might sound like I want to imply that the other authors of this blog are fascists, but that wasn’t true last time I checked. Even though those two things are possibly positively correlated in the overall population,Actually, I’m pretty damn sure that the average psych student is more likely to be a socialist anti-fascist freegan feminist than the average person who is not a psychology student. the correlation in my friends sample is negative (X1 and X2 are negatively correlated conditional on Y).
Other examples: I got the impression that bold claims are negatively correlated with methodological rigor in the published psychological literature, but maybe that’s just because both flashy claims and methodological rigor increase chances of publication and we just never get to see the stuff that is both boring and crappy?This might come as less of a surprise to you if you’re a journal editor because you get to see the whole range.
At some point, I got the impression that female (X1) professors were somewhat smarter (X2) than male professors, and based on that, one might conclude that women are smarter than men. But female professors might just be smarter because tenure (Y) is less attainable for women (X1->Y)For whatever reason, you can add mediators such as discrimination and more likely for smart people (X2->Y), so that only very smart women become professors but some mediocre males can also make it. The collider strikes again!
Tenure and scientific eminence are nice examples in general because they are colliders for a fuckload of variables. For example, somebody had suggested that women were singled out as instances of bad science because of their gender. Leaving aside the issue whether women are actually overrepresented among the people who have been shamed for sloppy research,I actually have no clue whether that’s true or not, I just don’t have any data and no intuition on that matter such an overrepresentation would neither tells us that women are unfairly targeted nor that women are more prone to bad research practices.Notice that both accounts would equal a causal effect of gender, as the arrows are pointing away from “gender” and end at “being criticised for bad research”, no matter what happens in between. Of course, the parts in between might be highly informative. Assuming that women (X1) have worse chances to get into the limelight than men, but overstating the implications of your evidence (X2) helps with getting into the limelight; we could find that women in the limelight (conditioning on Y) are more likely to have overstated their evidence because the more tempered women simply didn’t make it. That’s obviously just wild speculation, but in everyday life, people are very willing to speculate about confounding variables, so why not speculate a collider for a change?
Which leads to the last potential collider that I would like you to consider. Let’s assume that the methodological rigor of a paper (X1) makes you more likely to approve of it as a reviewer. Furthermore, let’s assume that you – to some extent – prefer papers that match your own bias (X2).For example, I believe that the metric system is objectively superior to others, so I wouldn’t approve of a paper that champions the measurement of baking ingredients in the unit of horse hooves. If you think I chose this example because it sounds so harmless, you haven’t heard me rant about US-letter format yet. Even if research that favors your point of view is on average just as good as research that tells a different story (X1 and X2 are uncorrelated), your decision to let a paper pass or not (Y) will introduce a negative correlation: The published papers that match your viewpoint will on average be worse.Plomin et al. claimed that the controversy surrounding behavioral genetics led to the extra effort necessary to build a stronger foundation for the field, which is the flipside of this argument.
So peeps, if you really care about a cause, don’t give mediocre studies an easy time just because they please you: At some point, the whole field that supports your cause might lose its credibility because so much bad stuff got published.
Addendum: Fifty Shades of Colliders
Since publishing this post, I have learned that a more appropriate title would have been “That one weird third variable problem that gets mentioned quite a bit across various contexts but somehow people seem to lack a common vocabulary so here is my blog post anyway also time travel will have had ruined blog titles by the year 2100.”
One of my favorite personality psychologists,Also: One of the few people I think one can call “personality psychologist” without offending them. Not sure though. *hides*Sanjay Srivastava, blogged about the “selection-distortion effect”before it was cool, back in 2014.
Neuro-developmental psychologist Dorothy Bishop talks about the perils of correlational data in the research of developmental disorders in this awesome blog post and describes the problems of within-groups correlations.
Last but not least, Patrick Forscher just started a series of blog post about causality (first and second post are already up), starting from the very scratch. I highly recommend his blog for a more systematic yet entertaining introduction to the topic!No CERN jokes though. Those are the100.ci-exclusive!
If you only see footnotes, you have scrolled too far.
YMMV and I hope that there are psych programs out there that teach more about causal inference in non-experimental settings.
“Experimental studies are needed to determine whether…”
Added bonus: After reading this, you will finally know how to decide whether or not a covariate is necessary, unnecessary, or even harmful.
I have been informed that only grad students can afford to actually read stuff, which is kind of bad, isn’t it?
There’s no free lunch in causal inference. Inference from your experiment, for example, depends on the assumption that your randomization worked. And then there’s the whole issue that the effects you find in your experiment might have literally nothing to do with the world that happens outside the lab, so don’t think that experiments are an easy way out of this misery.
Just to make sure: I don’t endorse any form of animal cruelty.
If you are already aware of colliders, you will probably want to skip the following stupid jokes and smugness and continue with the last two paragraphs in which I make a point about viewpoint bias in reviewers’ decisions.
As we say in German: “Gönn dir!”
Just to make sure: This is fake data. Fake data should not be taken as evidence for the actual relationship between a set of variables (though some of the more crafty and creative psychologists might disagree).
The collider problem is just the same for continuous measures.
Coincidentally, you will find each of the four combinations represented among the members of The 100% CI at any given point in time, but we randomly reassign these roles every week.
Ha, ha, ha.
Notice that you might be very well able to replicate this association in every college sample you can get. In that sense, the negative correlation “holds” in the population of all college students, but it is a result from selection into the sample (and not causal processes between conscientiousness and intelligence, or even good old fashioned confounding variables) and doesn’t tell you anything about the correlation in the general population.
If you are really good at guessing correlations (it’s a skill you can train!) you might even see that it’s about r = -.200,
or anything that is caused by Y, because the whole collider logic also extends to so-called descendants of a collider
This might sound like I want to imply that the other authors of this blog are fascists, but that wasn’t true last time I checked.
Actually, I’m pretty damn sure that the average psych student is more likely to be a socialist anti-fascist freegan feminist than the average person who is not a psychology student.
This might come as less of a surprise to you if you’re a journal editor because you get to see the whole range.
For whatever reason, you can add mediators such as discrimination
I actually have no clue whether that’s true or not, I just don’t have any data and no intuition on that matter
Notice that both accounts would equal a causal effect of gender, as the arrows are pointing away from “gender” and end at “being criticised for bad research”, no matter what happens in between. Of course, the parts in between might be highly informative.
For example, I believe that the metric system is objectively superior to others, so I wouldn’t approve of a paper that champions the measurement of baking ingredients in the unit of horse hooves. If you think I chose this example because it sounds so harmless, you haven’t heard me rant about US-letter format yet.
Plomin et al. claimed that the controversy surrounding behavioral genetics led to the extra effort necessary to build a stronger foundation for the field, which is the flipside of this argument.
Also: One of the few people I think one can call “personality psychologist” without offending them. Not sure though. *hides*
No CERN jokes though. Those are the100.ci-exclusive!
[Disclaimer: I am not an EEG expert. I probably got some things wrong. Please let me know about them.]
TL;DR: I reviewed four infant ERP studies on the same topic and found that their results are maximally incongruent with each other. Yet the analytical choices made in the papers differ too much to even allow the conclusion that there probably is no underlying effect.
If you just want to skim this post, you can skip to the short summaries at the end of each section, which I highlighted so they’re easy to find.
Estimated reading time (excl. tables): 17 minutes
Some weeks ago, I reviewed an EEG paper on infants’ perception of biological motion. The authors cited four older studies that report ERP correlates of 5- and 8-month-old children’s ability to discriminate normal human motion, such as walking, from different forms of unnatural or non-biological motion.
Because I wasn’t familiar with this literature and wanted to be a good reviewer, I went and had a look at these fab four. What I found was a combined sample size of 51, four different analysed time windows and region-of-interest combinations, a left-skewed p-curve, and a lot of question marks on my forehead. This blog post is a story of my journey digging through these papers to see what they can tell us about infants’ perception of biological motion.
The four studies in question:
Hirai, M., & Hiraki, K. (2005). An event-related potentials study of biological motion perception in human infants. Brain Research: Cognitive Brain Research, 22, 301–304.
Marshall, P. J. & Shipley, T. F. (2009). Event-related potentials to point-light displays of human action in five-month-olds infants. Developmental Neuropsychology, 34(3), 368-377. doi: 10.1080/87565640902801866
You have probably seen videos of point-light displays (PLDs) of human motion before: Single dots represent the joints of a person and despite this seemingly impoverished setup (compared to a normal video recording), it is surprisingly easy to recognise the displayed action, e.g., a walking person. I didn’t embed an example video because I don’t want to scare away my new pals with an Elsevier lawsuit this early, but the Biomotion Lab at Queen’s University has put some of their cool stimuli online.
Whenever you find that you can perform some cognitive task with apparent ease (like recognising a bunch of moving dots as a walking person), a developmental psychologist somewhere gets very itchy and really, really wants to know at what exact point between its nonexistence and current giant-walnut-like state your brain acquired this intriguing skill.
The four papers I’m reviewing here look for EEG correlates of this previously found behavioural effect via event-related potentials (ERPs). My aim is to find out if they can tell us something about what happens on an infant’s scalp when they watch PLDs of upright human motion. I will first give a very brief summary of each study and then compare their analytical choices and results with a focus on the contrast between upright biological motion (BM) and “non-biological motion” (nBM).
You will notice that in almost all cases, the nBM PLDs consist of points whose motion paths and velocity are identical to the points in the BM PLDs. “Non-biological” thus refers to the relation of the individual points to each other (in scrambled PLDs, where the points’ starting positions have been randomised) or to the orientation of the displayed figure (in inverted PLDs that have been turned upside down).
Because EEG results depend heavily on many boring but important technical details about recording and analysis, this post contains a bunch of big, cluttered tables with a tiny font size which I feel terrible about yet not terrible enough to spare you them. I simply didn’t find a more elegant way to include all this information. Feel free to ignore most of itFor a good laugh I recommend having a look at Table 1 for sample sizes and exclusion rates though. if you’re only here for the stats rage.It’s totally a thing.
Alright! Fasten your seatbelts, here we go:
HH05 (Hirai & Hiraki, 2005)
The rationale is simple: We know infants are sensitive to biological motion, but nobody has looked at neural correlates of this before, so let’s check it out. HH05 investigate 8-month-olds’ ERPs in reaction to PLDs of upright walking compared to scrambled motion (the points’ paths and velocity are identical to the upright condition, but their starting points are randomised – check “scrambled” in the Biomotion Lab animation to get an idea). Each trial lasts 510 ms.
RHS06 (Reid et al., 2006)
RHS06 look at the same age group (8-month-olds). But in contrast to HH05, they compare upright motion to inverted motion (turning the animation upside down). With 1000 ms, their trials are twice as long as HH05’s. Another difference is that they use two different kinds of movement: walking and kicking, thus creating a 2×2 condition design (action type: walking vs kicking x orientation: upright vs inverted). What’s funny about this is that they do not once mention why they added the kicking motion, and in the remainder of the paper collapse walking and kicking into a simple contrast between upright and inverted.
My p-hacking alarm bells started to make some clunking noises when I first read this. You just don’t add stuff to an experiment and then never mention it again, especially when it sets your study apart from previous ones. You only do that when you tried something and it didn’t work. Please tell me if this is an unfair assumption.
3: upright (walking & kicking) corrupted (walking with backward-flexing knees) impossible (kicking with spinning leg)
2: upright (walking, running, throwing a ball, kicking a ball) vs scrambled
* Refers to each cell of the original 2x2 (action x orientation) design if I understand correctly.
** Refers to the two main conditions (upright vs inverted).
**** Plus 12 kids who wouldn’t even wear the EEG cap. I suspect similar rates in the other papers that are not reported.
RHLS08 (Reid et al., 2008)
RHLS08 again investigate 8-month-olds, but throw another set of considerations into the mix: They compare upright motion (again walking and kicking) to a) a “corrupted body schema” condition where the walking PLDs were edited such that the knees bent backwards, and b) a “biomechanically impossible” condition where the kicking PLDs were edited such that the kicking leg seemed to come off and spin in a circle. Trial length is again 1000 ms.
The dropout rate in this study struck me as odd: with a final sample of N=15 and 40 exclusions, it is 3.2x as high as in RHS06 (N=12, 10 exclusions). What happened there?
This relatively high attrition rate was due to three experimental conditions in the present study when compared to the standard two conditions in most infant ERP studies. (p. 162)
Ok, but wait a second… Didn’t RHS06 start out with even more conditions (four)? Is that why they forgot about one of their factors halfway through their paper and changed it to a simple contrast between upright and inverted PLDs?
I’m not liking this.
MS09 (Marshall & Shipley, 2009)
MS09 go back to the basics – upright versus scrambled motion, but this time with 5-month-olds. Their trials are twice as long as RHS06’s and RHLS08’s (2000-2300 ms). For unnamed reasons they use four different types of action: walking, running, throwing a ball, and kicking a ball. Each condition consists of only four trials of each of these actions (16 upright, 16 scrambled). Here’s their justification for the low trial number:
ERP averages composed of less than 20 trials are not unusual in the infant visual ERP literature (e.g., de Haan & Nelson, 1997; Snyder, Webb, & Nelson, 2002), especially in studies involving dynamic social stimuli (Reid et al., 2008; Striano, Reid, & Hoehl, 2006). (p. 370)
Ah, the classic “we used a shitty design because others do it too” argument. What I find more worrying than the low total number of trials are the fairly heterogeneous stimuli: I would not expect the brain to react identically when viewing displays of continuous walking versus distinct goal-directed actions involving an inanimate object (throwing/kicking a ball). What can we expect from an average of only eight instances of each of these? I’m not an EEG expert but this simply isn’t going to work.
Summary: overview We have two studies comparing upright and scrambled motion (HH05 and MS09), one comparing upright and inverted motion (RHS06), and one comparing upright motion with a “corrupted body schema” condition and a “biomechanically impossible” condition (RHLS08). Three studies look at 8-month-olds and one looks at 5-month-olds.
Table 2: EEG recording and preprocessing
62, Geodesic Sensor Net
0.1 - 100 Hz bandpass
30 Hz low-pass
100 ms pre trial
100 ms pre trial
0.1 Hz high-pass
35 Hz low-pass
100 ms pre trial + first 100 ms of trial
(10-20 system), Electro-Cap
0.1 Hz high-pass
100 Hz low-pass
100 ms pre trial
Design & analysis
Which dependent variables did the studies look at? In other words: Which time windows at which electrode sites were analysed? (See Table 2 for boring EEG recording details.)
HH05 define a target time window at 200-300 ms after trial onset based on adult ERP data. To me this sounds surprising because as a rule of thumb I would expect infant ERPs to show up laterBecause let’s face it, babies are a bit dim. than corresponding adult ERPs. But anyway, at least they do give a justification. They pick 26 electrodes in the occipitotemporal region as their target area (see Figure 1) and compare right and left hemisphere (13 electrodes on each side). They do not provide any justification for either the chosen area or the fact that they compare left and right hemisphere (now their design turned into a 2×2 interaction: stimulus type x lateralisation).
RHS06 stick with the time window of 200-300 ms, with the analysis of lateralisation effects, and, broadly speaking, with the target area (“posterior”): They compare P3 and CP5 on the left with P4 and CP6 on the right. Interestingly, they do not cite HH05, even though they submitted their paper almost a year after HH05 had been published online. Instead, RHS06 justify the time window (and the search for lateralisation effects) by pointing to studies reporting an N170 in adults in response to BM and the claim that “in infant research, the P220–P290 waveform has been named the ‘infant N170’” (p. 212). Alright, sounds legit. Their justification for the target area is less consistent: Again, they cite the adult-N170 literature, which reported this effect “at a variety of posterior locations, including occipital (O1, O2), temporal (T7, T8) and parietal cortex (P7, P3, P4, P8)” (p. 212). Sadly, the reason why they then confined their analyses to P3, P4, CP5, and CP6 remains a mystery for the reader.
Somewhat unexpectedly, RHLS08 cite both themselves (RHS06) and HH05 as a reference for looking at parietal regions, but quietly drop CP5/CP6 and the lateralisation effect (P3 and P4 are now being analysed jointly and not compared with each other). What really stunned me is that they changed the analysed time window to 300-700 ms without any justification. This means their analysis window at the parietal region does not even overlap with HH05 and RHS06.
A variation of the old time window comes into play again for the newly-added frontal target area: They include a region composed of F7, F8, F3, F4, Fz, FC3, FC4, C3, C4, and Cz at 200-350 ms (again without justification), because they hypothesise “differential processing in parietal and frontal regions” (p. 162) for the contrast between corrupted and impossible PLDs.
There’s one more thing. All other papers use 100 ms directly preceding the trial for baseline correction, only RHLS use 100 ms pre trial and the first 100 ms of the trial. Their justification for this makes no sense in light of the other studies:
This ensured that differences in the ERP were due to factors associated with motion rather than a reaction to observed differences between the conditions in the initial configuration of the point lights. (p. 164)
MS09 go on a big fishing expedition and test the full trial length from 0-2000 ms in 100-ms bins separately for P3, P4, P7, P8, T7, T8, O1, and O2 (citing Jokisch et al., 2005; HH05; and RHS06). They also hypothesise a lateralisation effect, citing RHS06, but never directly compare any electrodes from the right and left hemisphere. MS09 thus run 20 separate tests for each of 8 electrodes (160 tests in total) and – spoiler alert – do not correct for multiple comparisons.
Summary: design & analyses We have three different time windows for the BM versus nBM contrast (HH05 and RHS06: 200-300 ms, RHLS08: 300-700 ms, MS09: 0-2000 ms), and a fourth one if we include RHLS08’s search for a frontal difference between corrupted and impossible motion (200-350 ms). All studies look at “somewhat” posterior/parietal electrode sites, but in many creative combinations: a large ill-defined area on the left vs on the right (HH05), P3 and CP5 on the left vs P4 and CP6 on the right (RHS06), P3 and P4 combined (RHLS08), and an 8-electrode carnage involving P3, P4, P7, P8, T7, T8, O1, and O2 (MS09).
Table 3: Analyses and results
target time window
upright vs scrambled
26 electrodes collapsed into 2 sites (left vs right)
laterality vs stimulus type
- main effects - upright vs scrambled in RH*
F(1,6)=7.1 reported as ns F(1,12)=7.1
upright vs inverted
left posterior: P3, CP5
right posterior: P4, CP6
laterality vs stimulus type
- interaction - main effects
- simple effects
F(1,11)=6.767 “no other effects found”
parietal: 3x1 ANOVA t-test upright vs impossible
t-test upright vs corrupted
t-test imposs. vs corrupted frontal: 3x1 ANOVA t-test upright vs impossible
t-test upright vs corrupted
t-test imposs. vs corrupted
F(2,28)=3.535 t(14)=2.312 not reported
t(14)=1.803 F(2,28)=5.517 not reported
Wilcoxon signed-rank tests on mean amplitude in 100 ms bins across whole trials & each electrode
No test statistics reported. Electrodes & time frames reported as “p<.05”:
P3: 800-2000 ms
P4: 1300-2000 ms
P7: 500-2000 ms
P8: 500-2000 ms
O2: 800-1300 ms
T8: 600-2000 ms
* RH = right hemisphere
** Reported as “a statistical trend” (p. 164)
*** Reported as “p = .05” (p. 164)
Test statistics and summary statistics are summarised in Table 3 and Table 4, respectively, and the directions of effects are shown in Figure 1. I will ignore the results for the frontal region examined by RHLS08, because they added this to investigate the perception of “corrupted body schema” motion and I decided to focus on the contrast of upright vs impossible motion.
Up until the result section, I expected HH05 to look for a main effect of stimulus type. This main effect is implied to be not significant: “only the laterality x stimulus type interaction was significant” (p. 302). Luckily they thought of lateralisation just in time!Phew! Taking this into account, they find a significant interaction: upright motion had a more negative amplitude than scrambled motion in the right hemisphere, but this contrast was reversed and not significant in the left hemisphere.
HH05 do not correct for multiple comparisons (despite testing one interaction effect and two main effects), which the interaction effect would not have held up to: F(1, 6) = 7.1, p = .037.
In contrast to HH05, RHS06 do predict an interaction of stimulus type and lateralisation, which is exactly what they find (F(1, 11) = 6.767, p = .025). Here, however, the amplitude for upright motion in the right hemisphere is significantly more positive than for inverted motion. One could argue that scrambled (HH05) and inverted (RHS06) PLDs may well elicit very different ERPs and that a reversed effect may thus not be surprising. But it’s important to note that the ERPs for upright motion look completely different in the two papers: Within the right hemisphere, mean amplitude in HH05 is roughly -9 μV, SE = 3 μV (taken from Figure 2B in the manuscript), whereas in RHS06 it is +1.95 μV, SE = 1.23 μV (p. 212-213). The difference between these values is d = 1.7!
RHLS08 do not mention lateralisation. They hypothesise a simple contrast between upright and impossible motion in the parietal area. What’s funny is that they cite HH05 to predict a more positive amplitude for upright stimuli even though we just saw that HH05 found a more negative amplitude:
Based on previous research (e.g. Hirai and Hiraki, 2005), we hypothesized that the perception of biological facets of the stimuli would manifest themselves in a parietal location with an increase in positivity for the biological motion compared to the biomechanically impossible motion. (p. 162)
They find a main effect of condition and a significant simple contrast between upright and impossible stimuli (t(14)=2.312, p = .037), which would not hold up to Bonferroni correction (they performed at least two post-hoc tests: corrupted vs impossible is not significant, upright vs corrupted is not reportedI have a quantum theory of unreported test results: They are simultaneously significant and not significant until you look at them, which immediately renders them not significant.). Interestingly, mean amplitude for upright motion is positive like in RHS06, but this time less positive than the amplitude for impossible motion despite being way larger than in RHS06: M = 6.28 μV, SE = 2.57 μV. This is noteworthy because the number represents an average across both hemispheres, not just of the right hemisphere as in RHS06. If the amplitude for upright motion had been smaller in the left hemisphere like it was in RHS06 and HH05, this should have attenuated the overall effect and an average of this magnitude would be even less likely.
It may be hard to believe that these contradictory results could become even messier, but MS09 add yet another pattern to the mix: For the mid-parietal electrodes P3 and P4, they find significantly more positive activation for upright motion from 800 ms onwards (well outside the analysis window of any of the other studies), but for lateral parietal electrodes P7 and P8, the amplitude is less positive/more negative from 500 ms onwards. I don’t quite know what to make of this due to their creative 160-uncorrected-tests approach and the fact that they do not report any test statistics but only state “p<.05” for any given effect. Sadly this means that their results cannot be used for a p-curve analysis.
Summary: results Two papers find an interaction of stimulus and lateralisation with a greater difference between BM and nBM stimuli in the right hemisphere (HH05 and RHS06) – but the differences are in opposite directions. The other two papers find a significant difference between BM and nBM across hemispheres at mid-parietal electrodes P3 and P4 (RHLS08 and MS09) – but these two difference are again in opposite directions. Additionally, MS09 find an effect on lateral parietal electrodes P7 and P8, which again is in the opposite direction of their mid-parietal effect.
I don’t think I could have made up four less congruent results if I’d tried.
Table4: Comparison of ERP amplitudes for upright motion
* Exact values were not provided in the text; the given values are estimates based on Figure 2B in the manuscript.
** MS09 do not provide amplitude means. Amplitude signs were taken from Figure 1 in the manuscript.
Table 4 summarises the incongruity of ERPs across papers for upright BM alone. The most tragic aspect of this is that we cannot even sum up all effects and conclude that taken together, there is none: The analysed time windows and scalp regions were shifted around so much between studies that these contradictory findings might still be compatible with each other!
So – do infants show an observable ERP effect when they’re viewing PLDs of biological versus non-biological motion? I ran a p-curve analysis on the results of HH05, RHS06, and RHLS08 (MS09 couldn’t be included because they don’t report test statistics or exact p-values). I stuck to the instructions of Simonsohn, Nelson, and Simmons and made a p-curve disclosure tableThe first time I did this so thoroughly and it was a great experience – I can very much recommend it. It forces you to track down the authors’ actual hypotheses and to think about which analysis would test them. It sounds trivial but it can be quite adventurous in the case of a not-so-tidy paper.. Three effectsI included the lateralisation x stimulus type interaction effect of HH05 and RHS06 and the upright vs impossible parietal contrast of RHLS08. are of course too small a sample for definitive conclusions, and the binomial tests for evidential value and lack of evidential value both come up not significant (p = .875 and p = .2557, respectively). But… Just look at it!
I have developed a new rule of thumb to decide if I believe the findings in a paper: If all p’s are ≧ .025, I’m not having any of it. Of course that can happen for true effects, but in three consecutive papers? Papers that weren’t preregistered? Papers that throw red flags of obscure if-I-fail-to-mention-it-it’s-not-a-lie phrasing in your face like confetti at a carnival parade? I don’t think so.
Now you may say: But this research is 8 to 12 years old! Times have changed and it seems like the motion perception folks have drawn the right conclusions from this carnage and stopped publishing papers on it. Right? Well. the reason I looked into this literature in the first place was that I reviewed a paper trying to build on it just this January.
I very much hope that infant ERP standards have improved since 2009, but the fact that a paper called “How to get statistically significant effects in any ERP experiment (and why you shouldn’t)” was published in Psychophysiology in December 2016 indicates that it’s probably not all good yet.
This story is an example for how cumulative science cannot work. If you want to build on the non-replicated work of someone else, we first need to know if you can replicate the original effect. If you can’t, that’s fine! If you find a somewhat different effect: That’s fine too! There might be truth in your findings. But we need to know about it. Tell us what happened at time X at electrodes Y and Z when you first looked at those, because we know you did.
Preregister your hypotheses and make exploratory analyses great again! Ok, I realise that preregistration wasn’t a thing in 2008, but from what I’ve heard, deduction and truthful reporting were. Exploratory analyses dressed up as confirmatory despite running counter to previous studies or even your own predictions are ridiculously easy to see through. Your readers aren’t that cheap. We can learn a lot from the results of data exploration, but only if we know the full context.
And, for the sake of completenessSigh.: No, N = 7 is not ok. N = 15 isn’t either, especially when we’re talking about 15 wiggly little monsters who hate EEG caps like nothing else and will decide that it’s time to go home after 35 trials. I’m not even criticising the huge exclusion rates – I have worked in an infant EEG lab and I know it’s impossible to get around that. But if you honestly don’t even have the resources for 20Writing this causes almost physical pain when you know what’s needed are 3-digit numbers. “good” participants, team up with other labsMASSIVE kudos to Michael Frank for starting this movement of saving developmental psychology’s neck. or just leave it be. Especially if your research isn’t a matter of life and death.
[Some wise yet witty concluding words I haven’t found yet]
A few more random oddities in case you’re interested.
At the end of their paper, HH05 briefly report having tested a younger age group: “In a preliminary study, five 6-month-old infants were also measured for their ERPs during perception of BM and SM. Contrary to the 8-month-old infants, we did not find a significant ERP difference between the responses to BM and SM. However, we cannot conclude that 6-month-old infants do not process BM in such a small subject pool” (p. 303). I like how N = 5 is too small a sample but N = 7 isn’t.
The (upright) stimuli used in RHS06 and RHLS08 sound identical, but RHLS08 do not cite their earlier paper (although the reason might have been to stay anonymous toward reviewers)
EEG recording in RHS06 and RHLS08 sounds identical too, but the former report recording 19 scalp electrodes and the latter 23, which seems strangely arbitrary. Also I would like to point out again that RHS06 do not report any filters.
“EEG was recorded continuously with Ag–AgCl electrodes from 19 scalp locations of the 10–20 system, referenced to the vertex (Cz). Data was amplified via a Twente Medical Systems 32-channel REFA amplifier. Horizontal and vertical electrooculargram were recorded bipolarly. Sampling rate was set at 250 Hz. EEG data was re-referenced offline to the linked mastoids” (p. 212)
“EEG was recorded continuously with Ag-AgCl electrodes from 23 scalp locations of the 10–20 system, referenced to the vertex (Cz). Data were amplified via a Twente Medical Systems 32-channel REFA amplifier. Horizontal and vertical electrooculargram were recorded bipolarly. Sampling rate was set at 250 Hz. EEG data were baseline corrected and re-referenced offline to the linked mastoids. Data were filtered with high and low-pass filters from 0.1 to 35 Hz” (p. 164)
RHS06 use a strange formulation to describe the time frame they analysed: “For statistical analysis a time window was chosen around the amplitude peak of the effect from 200 to 300 ms after stimulus onset” (p. 212). Does that mean they averaged the amplitude between 200 and 300 ms like HH05 did? Or did they look for a peak somewhere between 200 and 300 ms and then analysed a time bin of unknown onset and length around this peak?
RHLS08 use the same mysterious description: “For statistical analysis a time window was chosen in parietal regions (P3, P4) around the amplitude peak of the effect from 300–700 ms after stimulus onset” (p. 164). Interestingly, they use quite different language to describe the analysed time window in frontal regions: “For assessment of differences in frontal electrodes, we considered the mean amplitude in the three conditions from 200–350 ms after stimulus onset” (p. 164). Huh, so it’s not that they’re simply not able to use less convoluted language to tell us about how they computed an average. I can’t help but read these descriptions as “we looked at time window X but we won’t tell you which exact time bins within X we analysed”.
For a good laugh I recommend having a look at Table 1 for sample sizes and exclusion rates though.
It’s totally a thing.
Because let’s face it, babies are a bit dim.
I have a quantum theory of unreported test results: They are simultaneously significant and not significant until you look at them, which immediately renders them not significant.
The first time I did this so thoroughly and it was a great experience – I can very much recommend it. It forces you to track down the authors’ actual hypotheses and to think about which analysis would test them. It sounds trivial but it can be quite adventurous in the case of a not-so-tidy paper.
I included the lateralisation x stimulus type interaction effect of HH05 and RHS06 and the upright vs impossible parietal contrast of RHLS08.
Writing this causes almost physical pain when you know what’s needed are 3-digit numbers.
MASSIVE kudos to Michael Frank for starting this movement of saving developmental psychology’s neck.
But hey, I’m still somewhat optimistic about the future of psychology, here’s why:Alternative explanation: I’m just an optimistic person. But I’ve noticed that heritability estimates don’t really make for entertaining blog posts.
Sometimes, it helps to take a more historical perspective to realize that we have come a long way. Starting from a Austrian dude with a white beard who sort of built his whole theory of the development of the human mind on a single boy who was scared of horses, and who didn’t seem to be overly interested in a rigorous test of his own hypotheses to, well, at least nowadays psychologists acknowledge that stuff should be tested empirically. Notice that I don’t want to imply that Freud was the founding father of psychology.It was, of course, Wilhelm Wundt, and I’m not only saying this because I am pretty sure that the University of Leipzig would revoke my degrees if I claimed otherwise. However, he is of – strictly historical – importance to my own subfield, personality psychology. Comparing the way Freud worked to the way we conduct our research today makes it obvious that things changed for the better. Sure, personality psychology might be more boring and flairless nowadays, but really all I care about is that it is accurate.
You don’t even have to go back in time that far: Sometimes, I have to read journal articles from the 80s.Maybe the main reason why I care about good science is that sloppy studies make literature search even more tedious than it would be anyway. Sure, not all journal articles nowadays are the epitome of honest and correct usage of statistics but really you don’t stumble across “significant at the p < .20 level” frequently these days. And if you’re lucky, you will even get a confidence interval or an effect size estimate!
And you don’t even have to look at psychology. A Short History of Nearly Everything used to be my favorite book when I was in high school and later as a grad student, reading about the blunder years of other disciplines that grew up fine nonethelessto varying degrees, obviously. But hey, did you know that plate tectonics became accepted among geologists as late as the 1960s? gave me great hope that psychology is not lost.
Psychologists are starting to try to replicate their own as well as other researchers’ work – and often fail, which is great for science because this is how we learn things.For example, that some effects only work under very strict boundary conditions, such as “effect occurs only in this one lab, and probably only at that one point in time.”
We now have Registered Reports in which peer review happens before the results are known, which is such a simple yet brilliant idea to avoid that undesirable results simply disappear in the file drawer.
To date, 367 people have signed the Peer Reviewers’ Openness Initiative and will now request that data, stimuli and materials are made public whenever possible (it can get complicated though), and 114 people have signed the Commitment to Research Transparency that calls for reproducible scripts and open data for all analyses but also states that the grading of a PhD thesis has to be independent of statistical significanceReally this seems to be a no-brainer, but then again, some people seem to mistake the ability to find p < .05 with scientific skills. or successful publication.
The psychology department of the Ludwig-Maximilians-Universität Munich explicitely embraced replicability and transparency in their job ad for a social psychology professor. That’s by no means the norm yet, and I’m not sure whether this particular case worked out, but one can always dream.
The publication landscape is changing, too.
People are starting to uploade preprints of their articles which is a long overdue step in the right direction. Collabra is a new journal with a community-centered model to make Open Access affordable to everyone.
Old journals are changing, too: Psychological Science now requires a data availability statement with each submission. The Journal of Research in Personality requires a research disclosure statement and invites replications.There are more examples but these two come to my mind because of their awesome editors. There are also journals that take a, uhm, more incremental approach to open and replicable science. For example, I think it’s great that the recent editorial of the Journal of Personality and Social Psychology: Attitudes and Social Cognition concludes that underpowered studies are a problem, but somehow I feel like the journal (or the subfield?) is lagging a few years behind in the whole discussion about replicable science.
Additionally, media attention has been drawn to failed replications, sloppy research, or overhyped claims such as power pose, the whole infamous pizzagateNot the story about the pedophile ring, the story about the psychological study that took place at an all-you-can-eat pizza buffet. story, and the weak evidence behind brain training games. Now you might disagree about that, but I take it as a positive sign that parts of media are falling out of love with catchy one-shot studies because I feel like that whole love affair has probably been damaging psychology by rewarding all the wrong behaviors. Anne is skeptical about this point because she doubts that this is indicative of actual rethinking as compared to a new kind of sexiness: debunking of previously sexy findings. Julia is probably unable to give an unbiased opinion on this as she happens to be the first author of a very sexy debunking paper. Now please excuse me while I will give yet another interview about the non-existent effects of birth order on personality.
And last but not least, we are using the internet now. A lot of the bad habits of psychologists – incomplete method sections, unreported failed experiments, data secrecy – are legacy bugs of the pre-internet era. A lot of the pressing problems of psychology are now discussed more openly thanks to social media. Imagine a grad student trapped in a lab stubbornly trying to find evidence for that one effect, filing away one failed experiment after the other. What would that person have done 20 years ago? How would they ever have learned that this is not a weakness of their own lab, but an endemic problem to a system that only allows for the publication of polished to-good-to-be-true results?In case you know, tell me! I’d really like to know what it was like to get a PhD in psychology back then. Nowadays, I’d hope that they would get an anonymous blog and bitch about these issues in public. Science benefits from less secrecy and less veneer.
Sometimes I get all depressed when I hear senior or mid-career people stating that the current scientific paradigm in psychology is splendid; that each failed replication only tells us that there is another hidden moderator we can discover in an exciting new experiment performed on 70 psychology undergrads; that early-career researchers who are dissatisfied are just lazy or envious or lack flair; and that people who care about statistics are destructive iconoclasts.What an amazing name for a rock band.
This is where we have to go back to the historical perspective.
While I would love to see a complete overhaul of psychology within just one generation of scientists, maybe it will take a bit longer. Social change supposedly happens when cohorts are getting replaced.cf. Ryder, N. B. (1965). The cohort as a concept in the study of social change. American Sociological Review
30: 843-861. Most people who are now professors were scientifically socialized under very different norms, and I can see how it is hard to accept that things have changed, especially if your whole identity is built around successes that are now under attack.I have all the more respect for the seniors who are not like that but instead update their opinions, cf. Kahneman’s comment here, or even actively push others to update their opinions, for example, the old farts I met at SIPS. But really what matters in the long run – and I guess we all agree that science will go on after we are all dead – is that the upcoming generation of researchers is informed about past mistakes and learns how to do proper science. Which is why you should probably go out and get active: teach your grad students about the awesome new developments of the last years; talk to your undergraduates about the replication crisis.
Bewildered students that are unable to grasp why psychologists haven’t been pre-registering their hypotheses and sharing their data all along is what keeps me optimistic about the future.
Alternative explanation: I’m just an optimistic person. But I’ve noticed that heritability estimates don’t really make for entertaining blog posts.
Maybe the main reason why I care about good science is that sloppy studies make literature search even more tedious than it would be anyway.
to varying degrees, obviously. But hey, did you know that plate tectonics became accepted among geologists as late as the 1960s?
For example, that some effects only work under very strict boundary conditions, such as “effect occurs only in this one lab, and probably only at that one point in time.”
Really this seems to be a no-brainer, but then again, some people seem to mistake the ability to find p < .05 with scientific skills.
There are more examples but these two come to my mind because of their awesome editors. There are also journals that take a, uhm, more incremental approach to open and replicable science. For example, I think it’s great that the recent editorial of the Journal of Personality and Social Psychology: Attitudes and Social Cognition concludes that underpowered studies are a problem, but somehow I feel like the journal (or the subfield?) is lagging a few years behind in the whole discussion about replicable science.
Not the story about the pedophile ring, the story about the psychological study that took place at an all-you-can-eat pizza buffet.
Anne is skeptical about this point because she doubts that this is indicative of actual rethinking as compared to a new kind of sexiness: debunking of previously sexy findings. Julia is probably unable to give an unbiased opinion on this as she happens to be the first author of a very sexy debunking paper. Now please excuse me while I will give yet another interview about the non-existent effects of birth order on personality.
In case you know, tell me! I’d really like to know what it was like to get a PhD in psychology back then.
What an amazing name for a rock band.
cf. Ryder, N. B. (1965). The cohort as a concept in the study of social change. American Sociological Review
I have all the more respect for the seniors who are not like that but instead update their opinions, cf. Kahneman’s comment here, or even actively push others to update their opinions, for example, the old farts I met at SIPS.
In my scientific work I strive to be as open as possible. Unfortunately I work with data that I cannot de-identify well enough to share (aka weird sex diaries) and data that simply isn’t mine to share (aka the reproductive histories of all Swedish people since 1950). To compensate this embarrassing lack of openness I’ve tried to devise other ways of being transparent about my work. After a few failures (nobody ever tuned into my Rstudio-speedrun-Twitch-channel), this is one that’s worked quite well for me in the past. I now think it might be interesting even for people who don’t have to compensate for anything.
I did this, because there’s a bunch of problems you can run into if you just share data and code:
Incomplete directions or code. Maybe you have some private knowledge like “First you need to set the working directory to C:\DavemasterDave\Nature_paper\PNAS_paper\JPSP_paper\PAID_paper\data, load the statistical packages in the order passed down to us from our elder advisors, rename this variable, and pray to Meehl.”
Inconsistent package versions. You used dplyr v0.5.0 instead of v.0.4.9 to reproduce the analyses. Turns out they changed distinct() to mean “throw away all my data and don’t warn me about it”.
Hassle. First you download the code from the supplement file. Then get the data from Dryad. Then put them in the same folder and install the following packages (oh by the way you’ll need to compile this one from source for which you’ll need R, Rstudio, Rtools, and OS X Xcode command line tools). Oh and to get the codebook you just have to scale Mount Doom. Just don’t awake my advisor and you’re almost there.
Poor documentation of the code. What does the function transformData_???_profit do? What do the variables prc_stirn_b, fertile_fab mean? Is RJS_6R the one that’s reverse-coded or did you name it this way after reversing RJS_6? And the all-time classic: Are women sex == 2 or sex == 1 (or twice as sex as men)?
It starts with the cleaned data. When I work on a project, I spend 80% of my time on data preparation, cleaning, checking. Yet when I download reproducible files, they often start with a file named cleaned_final_2_Sunday23July_BCM.csv. Something went missing there. And there might be mistakes in that missing bit. In fact it’s approximately 99% more likely that I made a mistake cleaning my data than that Paul Bürkner made a mistake writing brms. As psychological science gets more complex, we should share our pipelines for wrangling data.
Last, but most aggravatingly: Loss. Turns out personal university pages aren’t particularly reliable places to store your data and code and neither is that backup you made on that USB stick which you lent to your co-worker who.. fuck.
Reproducible websites solve these problems. Side effects may include agonising uncertainty about whether you should release them and whether mistakes will be found (fun fact: nobody reads the supplement. I’ve been including a video of Rick Astley in every single one of my supplements and so far nobody noticed. Another fun fact: nobody reads blog posts longer than 1000 characters, so admitting this here poses no risk at all).
This is the stack that I use to make my reproducible websites.
R & RStudio. RStudio is an integrated development environment. This is not strictly necessary, but as RStudio makes or maintains a lot of the packages necessary to achieve reproducibility, using their editor is a smart choice.
Packrat. Packrat solves messes due to different R package versions. Unfortunately, it’s a bit immature, so I’d currently recommend activating it on only in the final stage of your project. If you’re often working on many projects simultaneously and you know the grief it causes when a package update in one project breaks your code in another, it might be worth the hassle to keep it activated.
Rmarkdown (knitr). Markdown is an easy to learn markup language. Rmarkdown lets you document your R code, put the graphs where they belong etc. It also now generates websites and bibliographies with little additional work.
Git & Github. Git does distributed version tracking. Many scientists work alone, so Git may seem like overkill, but a) when you go open, you will be never be alone again (cue sappy music) and b) the features Github offers (notably: hosting for your website through Github pages) make up for Git’s somewhat steep learning curve. RStudio provides a rustic visual interface for Git, I personally prefer SourceTree.
Zenodo. They will permanently archive your website (and anything else you care to share) for free. If you make your stuff public, you get a citable DOI, which will lead to your document, even if Github and Zenodo should one day cease to exist. Zenodo can be integrated with Github, so that releases on Github are automatically uploaded to Zenodo.
To make this stack work well together, there’s a few hurdles to clear. And let me be completely frank with you: You still need R, RStudio, and a working knowledge of Mount Doom’s geography. But your readers will only need a web browser to make sense of your work (okay, printing works too, but one of my co-authors once tried to print out all my Mplus outputs, deforesting a substantial part of Honduras in the process).
To make it easier, I’ve uploaded a starter RStudio project that you can fork on Github to start out with a configuration that worked well for me in the past. I’ve tried to paper over some of the rough edges that this stack still has and I added some initial structure and code, so you can adapt it.
With these projects I document my entire research process using Rmarkdown (e.g. loading & processing the raw data, wrangling it into the right shape, analysing it, making graphs).
But instead of sharing the raw scripts (which only make sense when run interactively with the data), I create a website where readers see the code together with the resulting graphs and other output.
I use the reports generated like this to make sense of my own results and to share extensive results with my co-authors. Some friends even write their complete manuscripts using Scholarly Markdown, and Papaja, but for me this is more about not losing all the interesting details that can’t make it into the manuscript.
Here’s two recent projects where I’ve used this stack (or an earlier version):
https://rubenarslan.github.io/paternal_age_fitness/ – the online supplement for this manuscript. This one is the largest I’ve made so far (~90 pages, >7000 graphs) and involved a complex setup including models run on a cluster, but documented offline. In this project I was in the situation that I wanted to repeat the same commands for different samples a lot, so that I had to learn to compartmentalise my Markdown into components (more on that on another day).
https://rubenarslan.github.io/generation_scotland_pedigree_gcta/ – the online supplement for this manuscript. This one is much simpler, just a single page with a lot of tabs. Here we presented a few models from our model selection procedure in the manuscript, but we wanted to show the results other choices would have had and how different components influenced one another.
The stack should also work with some modifications for Shiny apps.
(this post was jointly written by Malte & Anne; in a perfect metaphor for academia, WordPress doesn’t know how to handle multiple authors)
We believe in scientific openness and transparency, and consider unrestricted access to data underlying publications indispensable. Therefore, weNot just the authors of this post, but all four of us. signed the Peer Reviewers’ Openness (PRO) Initiative, a commitment to only offer comprehensive review for or recommend the publication of a manuscript if the authors make their data and materials publicly available, unless they provide compelling reasonsThe data-hungry dog of a former grad student whose name you forgot is not a compelling reason. why they cannot do so (e.g. ethical or legal restrictions).
As reviewers, we enthusiastically support PRO and its values.
Also as reviewers, we think PRO can be a pain in the arse.
Ok, not really. But advocating good scientific practice (like data sharing) during peer review can result in a dilemma.
This is how it’s supposed to work: 1) You accept an invitation for review. 2) Before submitting your review, you ask the editor to relay a request to the authors to share their data and materials (unless they have already done so). 3) If authors agree – fantastic. If authors decline and refuse to provide a sensible rationale why their data cannot be shared – you reject the paper. Simple.
So far, so PRO. But here’s where it gets hairy: What happens when the action editor handling the paper refuses to relay the request for data, or even demands that such a request is removed from the written review?
Here is a reply Anne recently got from an editor after a repeatedThe editor apologised for overlooking the first email – most likely an honest mistake. Talk to Chris Chambers if you want to hear a few stories about the funny tendency of uncomfortable emails to get lost in the post. PRO request:
“We do not have these requirements in our instructions to authors, so we can not ask for this without discussing with our other editors and associate editors. Also, these would need to involve the publication team. For now, we can relieve you of the reviewing duties, since you seem to feel strongly about your position.
Let me know if this is how we should proceed so we do not delay the review process further for the authors.”
Much like judicial originalists insist on interpreting the US constitution literally as it was written by a bunch of old white dudes more than two centuries ago, editors will sometimes cite existing editorial guidelines by which authors obligate themselves to share data on request, but only after a paper has been published, which has got to be the “Don’t worry, I use protection” argument of academia.We picked a heterosexual male perspective here but we’re open to suggestions for other lewd examples. Also, we know that this system simply does. not. work.
As reviewers, it is our duty to evaluate submitted research reports, and data are not just an optional part of empirical research – they are the empirical research (the German Psychological Society agrees!). You wouldn’t accept a research report based on the promise that “theory and hypotheses are available on request”, right?Except when reviewing for Cyberpsychology.
PRO sets “data or it didn’t happen” as a new minimum standard for scientific publications. As a consequence, comprehensive review should only be offered for papers that meet this minimum standard. The technically correctThe best kind of being correct. application of the PRO philosophy for the particular case of the Unimpressed Editor is straightforward: When they decide – on behalf of the authors! – that data should or will not be shared, the principled consequence is to withdraw the offer to review the submission. As they say, PRO before the status quo.
Withdrawing from the process, however, decreases the chance that the data will be made publicly accessible, and thus runs counter to PRO’s ideals. As we say in German, “Operation gelungen, Patient tot” – surgery successful, patient deceased.
Adhering strictly to PRO would work great if everybody participated: The pressure on non-compliant journals would become too heavy. Then again, if everybody already participated, PRO wouldn’t be a thing. In the world of February 2017, editors can just appoint the next best reviewerWithdrawing from review might still have an impact in the absence of a major boycott by causing the editors additional hassle and delaying the review process – then again, this latter part would unfairly harm the authors, too. who might simply not care about open data – and couldn’t you push for a better outcome if you kept your foot in the door? Then again, if all PRO signatories eroded the initiative’s values that way, the day of reaching the critical mass for a significant boycott would never come.
A major concern here is that the authors are never given the chance to consider the request although they might be receptive to the arguments presented. If increased rates of data sharing is the ultimate goal, what is more effective: boycotting journals that actively suppress such demands by invited reviewers, or loosening up the demands and merely suggest that data should be shared so at least the gist of it gets through?
There are two very different ways to respond to such editorial decisions, and we feel torn because each seems to betray the values of open, valuable, proper scientific research. You ask: What is the best strategy in the long run? Door in the face! Foot in the door! Help, I’m trapped in a revolving door! We would really like to hear your thoughts on this!
RE: Door in the face
Thank you very much for the quick response.
Of course I would have preferred a different outcome, but I respect your decision not to request something from the authors that wasn’t part of the editorial guidelines they implicitly agreed to when they submitted their manuscript.
What I do not agree with are the journal’s editorial guidelines themselves for the reasons I provided in my previous email. It seems counterproductive to invite peers as “gatekeepers” while withholding relevant information that are necessary for them to fulfill their duty until the gatekeeping process has been completed.
Your decision not even to relay my request for data sharing to the authors (although they might gladly do so!), unfortunately, bars me from providing a comprehensive review of the submission. It is literally impossible for me to conclude a recommendation about the research as a whole when I’m only able to consider parts of it.
Therefore, I ask that you unassign me as a reviewer, and not invite me again for review except for individual manuscripts that meet these standards, or until the editorial policy has changed.
RE: Foot in the door
Thank you very much for the quick response.
Of course I would have preferred a different outcome, but I respect your decision not to request something from the authors that wasn’t part of the editorial guidelines they implicitly agreed to when they submitted their manuscript.
In fact, those same principles should apply to me as a reviewer, as I, too, agreed to review the submission under those rules. Therefore, in spite of the differences in my own personal standards versus those presented in your editorial guidelines, I have decided to complete my review of the manuscript as originally agreed upon.
You will see that I have included a brief paragraph on the benefits of data sharing in my review. I neither demand the authors share their data nor will I hold it against them if they refuse to do so at this point. I simply hope they are persuaded by the scientific arguments presented in my review and elsewhere — In fact, I hope that you are too.
I appreciate this open and friendly exchange, and I hope that you will consider changing the editorial guidelines to increase the openness, robustness, and quality of the research published in your journal.
Not just the authors of this post, but all four of us.
The data-hungry dog of a former grad student whose name you forgot is not a compelling reason.
The editor apologised for overlooking the first email – most likely an honest mistake. Talk to Chris Chambers if you want to hear a few stories about the funny tendency of uncomfortable emails to get lost in the post.
We picked a heterosexual male perspective here but we’re open to suggestions for other lewd examples.
Withdrawing from review might still have an impact in the absence of a major boycott by causing the editors additional hassle and delaying the review process – then again, this latter part would unfairly harm the authors, too.
Academia is a strange place. There are a lot of implicit norms and unspoken rules which, to make it worse, can vary by field, subfield, across countries, and over time. For example: How do you write an email in an academic setting? Should your mails be polite or is it already impolite to waste the reader’s time with polite fluff? How do you address a professor who you (1) have never met in person, (2) met in person once but they likely don’t remember, (3) had a beer with at a conference but they likely don’t remember? Do you shake hands? How do you start a collaboration and are you sure you want to wear this pair of jeans/cat shirt/three piece suit to the next conference?
It takes time to figure these things out and to finally feel comfortable in academic interactions – even more so for students from working class families who can’t draw on experiences from their parents, or for students from parts of the world with considerably different academic norms.
So how can we help people to feel welcome in our strange insider’s club?
I have three suggestions, and as it happens, none of them is about tone. Actually, I believe that changing the tone is a quite ineffective way to make people feel welcome and valued because it is just cosmetics: Communication can be extremely hostile while maintaining a picture-perfect all friendly nonviolent surface. I suspect that people who champion tone monitoring hope that talking nicely for long enough will transform attitudes. However, I faintly remember learning in the first year of my undergraduate that Sapir-Whorf is not well substantiated.This memory also spoiled Arrival for me. Still a great movie though.After publishing this post, it has been pointed out three times that I shouldn’t bash linguistic relativity. To add more nuance to my argument, let me add that, as far as I know, there is substantial evidence for a weaker form of the Sapir-Whorf account (which also seems to be misnamed), which I consider plausible. Of course, I’m only bashing the form of linguistic relativity displayed in Arrival.After publishing this post, my boyfriend read it and I additionally have to add the disclaimer that yes, Arrival was a great movie and, yes, maybe, assuming that those aliens are so different from humans and way more advanced etc., probably in a parallel universe that does not adhere to our physics, maybe it could work like this.
Furthermore, setting well-intended rules about the tone of interactions might just add another layer of conventions that poses yet another obstacle for outsiders. What I suggest is that we do not tackle tone, but instead try to change the underlying climate.
Start admitting that you are sometimes wrong
Many students start with the assumption that people with the fancy “Dr.” attached to their name or (gasp) professors have privileged access to the secrets of the world and are thus close to infallibility. Anne pointed me towards Perry’s Scheme, a model of how college students come to understand knowledge, that succinctly summarizes this first level of understanding: The authorities know.
However, social interactions get pretty one-sided if one side assumes that the other side is never wrong, and it unnecessarily reinforces power differentials (that exist anyway, and that are probably not always conducive to scientific progress, but we will keep this for another blog post). It also greatly obscures how science – as opposed to esotericism – is supposed to work. Anecdotal data ahead: I have never felt particularly unwelcome in academia, and I blame this on the fact that both my parents have a PhD. Now before we all get excited about social transmission of educational attainment, I will quickly add that I was not raised by the doctors but by my down-to-earth mother-of-eight catholic grandma. However, I still got the strong impression that academic rank does not predict how often a person is right about things that are outside of their specific narrow subfield. Of course there is a German word for this idea: Fachidiot, a narrowly specialized person who is an idiot when it comes to anything else. In fact, I might have had a phase in which I firmly believed that a PhD indicates that a person is always wrong.I’m sorry, Mum. It wasn’t you, it was puberty.
Coincidentally, this also relates to the one piece of career advice I got from my dad: It’s important to hang out at conferences because there you can actually see with your own eyes that everybody cooks with water, which is the German way to say that everybody puts their pants on one leg at a time. There is a quick fix to the misconception that academics are always right: Just communicate that you are fallible and be honest about the things that you are uncertain about. If you need a role model for this type of behavior, I recommend Stefan Schmukle, who has been my academic advisor since the second year of my undergraduate and is probably the main reason why I did not leave psychology for a more lucrative and less frustrating career path. Stefan openly admits his knowledge gaps when he teaches and stresses that he keeps learning a lot. Funnily enough, it does not undermine his authorityYou know what does undermine your authority in front of the students? Pretending to know something that you don’t know while not even being aware that the smarter students can easily tell you are just pretending. There is a German word for the student’s feeling in such a situation, it’s called fremdschämen. in front of his students according to the data available to me, which includes both quantitative (student evaluations and teaching awards) and qualitative (intensive student interviews over a beer or two) evidence.
Positive side effects of admitting that you are sometimes wrong might include (1) students feeling more respected because of your honesty, (2) students learning that psychology is not an arcane art accessible only to privileged old white men, and (3) sending a strong signal that you are, in fact and despite all your glorious achievements, a human being. Which already leads to my second suggestion.
Show others it’s okay to have a life. Have a life.
This is important not only because you probably enjoy having a life, but it also avoids any sort of the mystification of what it means to be an academic. If we establish the norm that being an academic implies working from early morning until late in the night, seven days a week and especially between Christmas and New Year’s Eve, a lot of people might actually decide that they don’t want to feel welcome in academia. If your subfield actually requires this type of commitment, then please be frank about it so that junior researchers can decide early on whether they want to sacrifice literally everything else that makes life fun. However, if your job does not require to sacrifice your life completely, it’s great to signal to others that you are, in fact, a human being with a family, hobbies, and other stuff that you do in your free time, like binge-watching Gilmore Girls or blindfolded speed runs of your favorite childhood video games.The author of this article only indulges in one of these two activities but knows for a fact that at least one tenured individual in her proximity indulges in the other one.
I don’t have any data to back up this claim, but I’m pretty sure most humans enjoy the company of other humans above the company of restless and efficient publication machines. Overworking is not a sustainable lifestyle for most people, and it does not create a particularly welcoming climate. It also leads to a race to the bottom which makes life worse for everyone, so maybe work less (and unionize). As a senior scientist, don’t make overworking the norm.
As it happens, this point also maps onto the one piece of solid career-related advice my mother passed on to me. Her professor told her that she was spending too much time in the library instead of getting to know her peers in the evenings. In my personal interpretation, I’m not reading that as advice to go “networking”, but to do things that are actually fun because we know what all work and no play did to Jack.
Don’t act as if willpower/grit/self-control/discipline/ambition/perseverance will lead to success
In the current predominant culture, especially in academia – and a bit more in the US than in my control group, Germany – success is often equated with the result of some sort of internal strength. If only you tried a bit harder, if only you got a bit more organized, if only you started getting up earlier, if only you gave a bit more, if only you networked more efficiently, your efforts would finally pay off. It’s all fine and dandy to try your best and to try to actively regulate your behavior, but I fear we have brought this to a point at which the attitude is getting toxic. First, it opens the door to self-exploitation. Second, it makes people more willing to comply with exploitative structures, which is great for the maintenance of the status quo, but not so much for early career researchers who end up working endless hours. Third, if internal strength inevitably leads to success, having no success implies that you lack some sort of internal strength, or worse, that you are a failure.
But, most importantly, it’s just not true that trying as hard as possible will lead to success, and that success will lead to some sort of bliss that compensates for all the hard work. Success depends on multiple factors, and even if we assume that effort contributes quite a bit, there is still plenty of factors outside of our control: innate abilities, external factors such as being surrounded by people that support you (vs. having an advisor who is still fully absorbed in the rat race and exploits you for their own purposes) and a lot of randomness. Anyone who has ever submitted a paper to a journal will probably agree that there is a lot of randomness in the current academic system in psychology – if you’ve never encountered some level of arbitrary decision making, you’ve been pretty lucky (q.e.d.). Then, the story goes, you should of course accept the things outside of your control but work hard with those that you can control – such as your ambitions and your perseverance. But can we even control these things? Frankly, I don’t know. But Ruben pointed out that we know few interventions that improve conscientiousness reliably, and that grit (which is basically conscientiousness) is partially heritable. Based on my experience, trying hard is much harder for some people than for others. I can indeed be as disciplined as I want, but I cannot will what I want.I’m pretty sure there is a reason why Schopenhauer is not particularly popular with motivational coaches.
Last but not least, I don’t think that bliss necessarily awaits those who work hard and end up being successful. We have yet to hear of the lucky person who got tenured and immediately reached a state of inner peace as a result. In fact, when I look into the office next door, I get the impression that the daily grind is not that different with a nice title in front of your name. (It certainly is more comfortable with respects to financial security, but not everybody can end up being a professor, so maybe we need structural change instead of individual struggle to tackle the precarious employment situation in academia.) However, this outlook does not seem too dull to me: In our lab, we are being nice to each other and we agree that our job is (to some extent) about doing science – not that much about gaming the system to get somewhere where you can finally, if you are lucky, do science.
TL;DR: It’s fine if you are sometimes wrong, don’t sweat it. Don’t make overworking the norm. Don’t give students the impression they just have to try hard enough to make it because deep down, we all know that this is not how it works.
This memory also spoiled Arrival for me. Still a great movie though.
After publishing this post, it has been pointed out three times that I shouldn’t bash linguistic relativity. To add more nuance to my argument, let me add that, as far as I know, there is substantial evidence for a weaker form of the Sapir-Whorf account (which also seems to be misnamed), which I consider plausible. Of course, I’m only bashing the form of linguistic relativity displayed in Arrival.
After publishing this post, my boyfriend read it and I additionally have to add the disclaimer that yes, Arrival was a great movie and, yes, maybe, assuming that those aliens are so different from humans and way more advanced etc., probably in a parallel universe that does not adhere to our physics, maybe it could work like this.
I’m sorry, Mum. It wasn’t you, it was puberty.
You know what does undermine your authority in front of the students? Pretending to know something that you don’t know while not even being aware that the smarter students can easily tell you are just pretending. There is a German word for the student’s feeling in such a situation, it’s called fremdschämen.
The author of this article only indulges in one of these two activities but knows for a fact that at least one tenured individual in her proximity indulges in the other one.
I’m pretty sure there is a reason why Schopenhauer is not particularly popular with motivational coaches.
A research parasite, a destructo-critic, a second-stringer, and a methodological terrorist walk into a bar. Their collective skepticism creates a singularity, so they morph into a flairless superbug and start a blog just to make things worse for everyone.
This is, roughly, our origin story. Who are we? We are The 100% CI, bound by a shared passion for horrible puns and improving our inferences through scientific openness and meta science.
You know what I need in my life right now? Another blog on meta science!, said no one ever. Ok, sure, that’s fair, BUT:
We are 4 Germans, which approximates 1 Gelman according to our calculationsAnalysis scripts are available upon request.If you request them, we will not respond to your emails for several months. Also a grad student ate lost them.
We will be blogging about other stuff. This week alone we will have posts on