On the origin of psychological research practices, with special regard to self-reported nostril width

The longer I have been in psychological research, the more I wonder about why we do things the way we do. Why do we sometimes get all fancy about some aspects of statistical modeling (e.g., imputation of missing values on a high performance cluster that takes days) and sometimes stay rather unsophisticated (e.g., treating a three-point response scale as a continuous outcome), often within a single manuscript? Why has literally any possible methods criticism been voiced before, with virtually all of it going unheeded (e.g., mediation analysis still going strong despite clear problems) except for the rare blockbuster (e.g., awareness of p-hacking leading to preregistration)? 

Of course there are many possible levels of explanation; science is a complex endeavor by a community of actors embedded in a wider society; some fields of psych seem to be doing much better than others; and I should obviously dive into the deep literature on philosophy and history of science instead of indulging in blogging; yadda yadda. But, at least for now, fuck nuance: In this post, I will present my personal Theory of Everything (TOE, not to be confused with TROETE).

An Exciting New Theory of Everything

Scientists want to answer research questions–I think it’s fair to start from that premise rather than from a more cynical one (“scientists want to further their career”).[1]I’m including myself in any criticism voiced in this blog post. If you get the impression that my TOE steps on yours, please simply assume that I’m just talking about myself here. To do so, they need to choose the right means, and this will be influenced by their knowledge, material constraints, and whatever the rest of the field does. The last point is crucial, and it does not contradict the notion that researchers genuinely want to answer questions, since learning from each other is a good idea under many circumstances: developing new practices is hard, and so researchers will be more likely to use practices they were taught in grad school or picked up from their peers.

So, when does a certain practice–e.g., a study design, a way to collect data, a particular statistical approach–”succeed” and start to dominate journals?

It must be capable of surviving a multi-stage selection procedure: 

  1. Implementation must be sufficiently affordable so that researchers can actually give it a shot
  2. Once the authors have added it to a manuscript, it must be retained until submission
  3. The resulting manuscript must enter the peer-review process and survive it (without the implementation of the practice getting dropped on the way)
  4. The resulting publication needs to attract enough attention post-publication so that readers will feel inspired to implement it themselves, fueling the eternally turning wheel of Samsara publication-oriented science

Figure 1. The multi-stage selection model of psychological practices.

Step 1 selects for practices of data collection that are comparably cheap, such as self-reports or proxies of the thing you’re actually interested in, or sometimes self-reports of proxies of the things you’re actually interested in (self-reported nostril width on a scale from 1: Disney’s Pocahontas to 10: saiga antelope). Of course, affordability is not all that matters–sometimes it may even be a disadvantage in the grantd scheme of things–and cutting edge brain imaging research shows that expensive modes of data collections can still become popular, in particular if they result in advantages later in the process. What is also really helpful for mastering Step 1 is an SPSS implementation.

Optimizing Interest

Steps 2, 3 and 4 largely select for the same features, because (1) these steps involve human judgment about whether an implementation of a certain practice is interesting and (2) researchers often optimize for publishability and try to anticipate what reviewers and editors will want to see.[2]If you’re really good at this, you won’t deliver exactly what they want to see, but intentionally leave some minor totally inconsequential deviations. That way, they can fulfill their duties and raise some issues without thinking too hard about your manuscript, which is an absolute win-win situation. Or at least I’m trying to do that, to some degree (it’s not the only relevant criterion!), and from collaborating with others, I got the impression that it’s a general phenomenon (more so among more senior researchers). Does that contradict the notion that researchers genuinely want to answer research questions? I don’t think so, in particular given that finding an answer is rather pointless if you never get to show it to anybody.

So, what determines whether an implementation of a practice is interesting? 

One low-hanging fruit are nice plots because most people like to look at nice pictures [add citation to neuroscience study proving this]. They don’t even need to be colorful. For example, the standard plots that accompany mediation and moderation analyses (and any conceivable combination of the two) aren’t exactly impressionist masterpieces, but sure suggest that some profound understanding has been gained. And who doesn’t like a good profound understanding? Word clouds are also pretty cool because they are easy to understand yet give readers the impression they may discover something exciting.[3]Examples in this blog post have been carefully hand-picked to throw shade on its author. Likewise, I do think that Specification Curve Analysis has an edge over its competitors because the plot that comes with it is pretty nice (and it can return a p value, yay). That said, Steegen et al. (2016) didn’t come up with a particularly fancy plot for their “Multiverse Analysis”, but they surely chose the catchier name, so it was only a matter of time until somebody would call a Specification Curve plot a “multiverse plot”.

Another obvious plus is the capability of producing false positive results. This is comparable to the advantage that fake news enjoy: If the constraints of reality are removed, it’s easier to produce interesting stories. For example, moderation analysis allows researchers to test a large number of product terms, which will most likely lead to some false-positive finding at some point. Mediation analysis is really good at producing indirect effects, in particular if certain assumptions are ignored (see Indirect Effect Ex Machina for instructions). Both get bonus points for demonstrating a more nuanced understanding of reality — surely the human mind is complex, so there must be many “boundary conditions” and “processes.” 

Can you feel SE-Magic?

But in some situations, a significant result would actually not be considered “positive.” For example, when you’re testing a cool idea with the help of structural equation modeling (SEM), the default null hypothesis is that your awesome model could have actually generated the data–and surely you wouldn’t want to reject your own model! One way to quantify model misfit is the Root Mean Square Error of Approximation (RMSEA), and I’ve been informed[4]Example courtesy of Paul Bürkner, via Ruben. that there is an extraordinarily popular paper that suggests an RMSEA cut-off of .08 (instead of the usual more stringent .05). Surely, the practice of setting a less demanding cut-off is more capable of producing interesting results as it becomes easier to not reject one’s preferred model. 

Actually, one could argue that SEM fit-indices were created for the very purpose of allowing researchers to retain interesting models–Karl Jöreskog, the developer of the SEM software package LISREL, supposedly self-reported that GFI was invented to please unhappy LISREL users who complained about their large chi-squares (the dominant metric for evaluating models before fit indices entered the scene) always leading to the rejection of their models (see listen Quantitude, Episode 14: Model Fit & The Curse of the Black Pearl). Now, that doesn’t mean that SEM fit-indices are bad or shouldn’t be used![5]If they didn’t exist, people would probably simply use very small samples to be able to retain their favored model. It’s just telling that in other contexts, psychologists don’t seem quite so eager to come up with alternative metrics just so that they don’t have to discard the null hypothesis when p < .05.

SEM really is a treasure trove of practices to achieve interesting results. Consider the popular practice of latent variable modeling to assess the bivariate association between two constructs: it’s pretty neat because the correlation will always increase as measurement error is taken into account.[6]It may also increase because the model is inappropriate, but I don’t think word has spread yet, and in any case the reason doesn’t really matter–stats are mostly causally opaque to mortals. Now one could naively think that this would lead researchers to apply latent variable modeling in other scenarios as well, e.g., when controlling for confounding constructs. But this actually decreases the chances of concluding that a certain construct of interest has “incremental validity”, which is really much more exciting than concluding that whatever you’ve come up with is actually redundant. So, no big surprise that the practice hasn’t really spread despite this excellent article by Jacob Westfall and Tal Yarkoni.

Figure 2. A quote always enhances an article.

Of course, there are also smaller fish out there, like those hacky data transformations that are taken for granted in certain subfields that would really astonish any reader from outside of the field (luckily, such readers are rare). There are all these papers out there essentially saying “common practice X is fucked”, but hey, maybe the fuckedness is a feature, not a bug?

Understanding Goldilocks understanding

On a less obvious note, Step 2 to 4 also favor methods that researchers in the particular field understand a little bit, but not too much.

Considering the people doing the research, it’s really hard to use a practice you don’t understand at all, but it also gets harder if you understand too much and become aware of all the potential assumptions and pitfalls.

Considering the editors and reviewers, if they get the feeling that they are reading something closer to quantum mechanics (e.g., a manuscript containing formulas that are not merely decorative but must be actively parsed), they might reject right away, or at least pass the manuscript to another person who is more likely to have the relevant expertise. But if editors and reviewers have an excellent grasp of a certain practice, they will be able to spot all sorts of flaws and issues, which may result in a rejection or at least turn the whole review process into a drag. There’s a sweet spot in between where papers can be churned out quite effectively, although its size depends on how brazen researchers are, how hard reviewers actually try to understand the papers they are assessing (one might think that not all of them try at all, exhibit A), and how lenient they’re willing to be when it comes to common flaws.

Figure 2. The Goldilocks principle of optimal understanding. Illustration courtesy of Dr. Lexis “Lex” Brycenet

Now, all of this means that psychological practices are evolving. But unfortunately, the selective pressures don’t seem to favor practices particularly suitable to provide valid answers–there are even reasons to believe that unsuitable practices might have an edge.[7]Smaldino & McElreath arrive at the same conclusion in The Natural Selection of Bad Science, although they model somewhat different dynamics (with a focus on labs rather than on practices) and certainly do a much more thorough job because (1) they actually set up a formal model and (2) actually know stuff about evolution. In any case, this post certainly was inspired by their work. Also their title is so great I considered stealing it. This is a bit disheartening, in particular because it means that the initial goal gets lost on the way. Can you still have a long and rewarding professional career? Sure! Will there still be some cumulation of knowledge? It’s possible! But there may also be a lot of wasted effort, busywork, and the nagging feeling that things aren’t really going anywhere.

Discussion

What could be done to intervene?

An individual opt-out from dominant practices may work for some–sure, it’s inconvenient and it might hinder one’s career, but there’s always niches to be crafted. This may lead you into one of two different directions: (a) you end up publishing your work, but maybe at a somewhat slower pace because you need to explain yourself a lot, and others don’t pick up your practices, so they remain a small blip in the overall picture or (b) whatever you’re doing differently actually takes off–congratulations! Now people start doing the same thing, and it’s only a matter of time until it is used in completely unsuitable contexts. So, the overall dynamics don’t change at all unless we assume that everybody decides to opt out (“hey look, science can easily be fixed if everybody simply decides to do good science!”).

There’s been lots of talk about getting the incentives right by adding badges, asking for Open Science practices in job ads, etc. While I don’t oppose any of these measures, I’ve gotten a bit cynical about incentives. It sure does seem like badges are getting gamed quite a bit, and while slightly different job ads send a nice signal, it seems to have turned certain practices into yet another box to tick, regardless of whether their specific implementation did anything to improve the quality of the research. So some of these measures may help, but I’m not convinced they’re sufficient to change the equilibrium.

Lastly, I have seen lots of talk about major system overhauls, like completely abandoning any sort of formalized metric during the hiring process so that researchers are finally “free” to do the best possible science. I was on that team for a long time, but two things changed my mind: Ruben, and the insight that “just reading the freaking papers” doesn’t help much if the people doing the reading are from the same pool as the reviewers and editors who select papers for publication.

Sorry if that all sounds horribly depressing or cynical. Maybe things are fine the way they are because scientific progress is bound to only occur at the very top of the field and all the busywork heading nowhere is necessary inefficiency? Okay that doesn’t sound less cynical (and elitist on top). Actually, I do think that change might be possible–we might be able to reach escape velocity. 

Improved training may lead to reviewers and editors more capable of questioning standard operating procedures (generational replacement for the win). Registered Reports may embolden researchers to actually try the best possible analysis without the risk of non-publication, if only to satisfy pesky reviewers. Maybe we can get in some outside reviewers who question established practices. And a stronger focus on quality rather than quantity during hiring and promotion may give everybody more room to breathe and ever so slightly shift priorities. 

Am I still mildly optimistic about the future of psychology? I’d say I’m a solid 6 on a scale from 0 to 10.

I would like to thank a literal banana whose criticism of the Big Five pushed me to finally finish this post.

Update [12th November 2020]: Paul Smaldino and Cailin O’Connor have published an article in which they demonstrate how input from outside disciplines (such as peer reviews) can break down barriers to methodological improvement. I find their paper very convincing, and after reading it, I have to adjust my self-report: My optimism about the future of psychology is now more of an 8.

Footnotes

Footnotes
1 I’m including myself in any criticism voiced in this blog post. If you get the impression that my TOE steps on yours, please simply assume that I’m just talking about myself here.
2 If you’re really good at this, you won’t deliver exactly what they want to see, but intentionally leave some minor totally inconsequential deviations. That way, they can fulfill their duties and raise some issues without thinking too hard about your manuscript, which is an absolute win-win situation.
3 Examples in this blog post have been carefully hand-picked to throw shade on its author.
4 Example courtesy of Paul Bürkner, via Ruben.
5 If they didn’t exist, people would probably simply use very small samples to be able to retain their favored model.
6 It may also increase because the model is inappropriate, but I don’t think word has spread yet, and in any case the reason doesn’t really matter–stats are mostly causally opaque to mortals.
7 Smaldino & McElreath arrive at the same conclusion in The Natural Selection of Bad Science, although they model somewhat different dynamics (with a focus on labs rather than on practices) and certainly do a much more thorough job because (1) they actually set up a formal model and (2) actually know stuff about evolution. In any case, this post certainly was inspired by their work. Also their title is so great I considered stealing it.

5 thoughts on “On the origin of psychological research practices, with special regard to self-reported nostril width”

  1. Delightfully engaging, witty, on target.

    Your parenthetical about finding a citation for the claim that fMRI images enhance belief in textual claims brought a smile. As you may know, Michaels, Newman, Vuorre, Cumming, and Garry (2013)
    https://link.springer.com/article/10.3758/s13423-013-0391-6
    reported multiple efforts to replicate the effect reported by McCabe and Castel (2008).

    1. Say what! I didn’t know about the replications, thanks for pointing that out!

  2. Great post, as always.

    I only disagree with “Now, that doesn’t mean that SEM fit-indices are bad or shouldn’t be used!” I would contend that they really should never be used, because the degree of misfit does not reliably correspond to the degree of misspecification. As far as I can tell, this is not just due to the influence of nuisance parameters on model fit, but rather an inescapable problem with fit indices– we will never know how far off from the true model we are just from looking at the discrepancy between the actual and model-implied matrices. SEM is, I think, a great example of ‘goldilocks understanding’: Lots of social scientists know that they can advance their careers using models that seem profound and are easy to fit. But the review process is extremely lenient on fundamentals, like model testing, making plausible distributional assumptions, etc.

    If I think my model is still “approximately” correct even after the chi-square test is telling me that something is wrong beyond what can be explained by sampling error, I should simply argue for that conclusion, preferably after examining local fit issues and alternative models. (I am entirely guilty of relying on approximate fit myself, but I hopefully have a career ahead of me to correct course.) If I can’t convincingly argue for that conclusion, then a whole bunch of sensitivity checks are probably in order, which might be the best way to go anyway. I think the use of fit indices has given researchers license to not even ponder what might be going wrong in their model. If they did, they might realize that they should maybe try to improve existing measures’ limitations (rather than the current practice as taking the measure as given, no matter what mistakes the developers inevitably made in creating it) or re-think the construct (which, as you allude to with regards to the big five, should probably be the top priority of personality psychology right now).

    1. Thank you! Your point sounds fair to me — maybe there really isn’t any scenario where fit indices make sense. When we teach CFA in the context of our test design seminar, it’s all about investigating local misfit. That said, I suspect the issues with SEM are much more fundamental (i.e., a model may fit well but still be nonsensical). I think psych is great at fooling itself that those arrows don’t imply assumptions about the underlying causal structure. But thinking about those issues and the related hard measurement issues would mean that we had to stop business as usual…

      1. I was taught a bit about local fit in my coursework, but I failed to understand its importance until recently because I had internalized the idea that it is OK to assume that whatever parallel analysis+EFA tells you is approximately correct. I couldn’t agree with you more that resisting causal interpretations of SEM is standard. I wish there was a way to stop business as usual; I think the hardness of a measurement issue is probably a good heuristic for how theoretically important it would be to solve it.

Comments are closed.