The 100% CI

It is the curse of transparency that the more you disclose about your research process, the more there is to criticize. If you write a preregistration, every minor change of protocol can be uncovered and held against you. If you share your analysis code, every minor typo can be detected and interpreted as a sign of your general incompetence. If you share your data, your worst enemy may figure out that including one particular covariate reverses the result.

If you don’t do any of these things, nobody will ever find out that your study on the effects of birth order position on dairy preferences started out as a longitudinal study on political polarization, and that you determined statistical significance by having your hamster run over a printout of the data table, which actually contains 95% missing values that were coded as zero.

Transparency at times appears masochistic because it reduces the room for strategic ambiguity.^[1]Strategic ambiguity is an idea which I picked up from Smaldino (2016) who in turn cites Eisenberg (1984) If you instead remain intransparent, the picture of the actual research process stays vague and readers have to choose their own interpretation of what precisely happened. These interpretations seem, at the time and on average, more charitable than reality. For example, we know by now that computational reproducibility is low in certain fields (see section on reproducibility in Nosek et al., 2021). Yet when we hear of a finding and don’t know anything about its computational reproducibility, we probably implicitly assume that it reproduces just fine (unless we’re already completely disillusioned about the scientific process). Some people even champion the view that we should be as charitable as possible, unless we have evidence to the contrary. Which, of course, renders strategic ambiguity even more attractive.

Shocked Season 4 GIF - Find & Share on GIPHY

Analysis Goal Ambiguity

But strategic ambiguity in psychological research goes beyond standard issues of research transparency. If you take a look at studies involving observational data, there is often a ton of strategic ambiguity about the analysis goal (we discussed this in Grosz et al., 2020). If you read the introduction and the discussion, the goal of the study seems to be to provide an estimate of a certain causal effect. But if you look at the methods and results, you often find language that carefully maintains plausible deniability. Here, we are trying to predict things after adjusting for…; this one thing accounts for that other thing; oh look there is a longitudinal association! These could as well be statements about statistical features about the data, nothing else, and they lead to a weird disconnect between the different sections of an article.

For example, if this is about prediction—why isn’t the predictive performance of the model evaluated? Why is there no actual real-life use case for the predictive model? Why are so few predictors included and individual coefficients interpreted? When we say that something “accounts for” something, that feels like an explanation of some sort—but which one is it? Maternal characteristics, such as education, account for much of the association between breastfeeding and the child’s cognitive abilities. According to Cesario, different rates of exposure to police through violent crime account for the overall per capita racial disparaties in being fatally shot by police. One thing accounts for the other, but the structure of the suggested explanations is completely different (confounding vs. mediation). And why would we get more excited about a longitudinal association than about a cross-sectional one? The answer is that one of them rules out certain alternative explanations—but alternative to what exactly? Alternatives to a causal interpretation, yet despite all the enthusiasm for longitudinal data, it is surprisingly hard to find psychologists who explicitly talk about the relationship between longitudinal data and causal inference.^[2]Relationship status: It’s complicated

Discussion sections also routinely include some hedging that results cannot be interpreted as causal effects, future experimental studies are needed. If we took these words literally, they would not only contradict half of the remaining article, but also prompt us to wonder (1) why we even read the article if the most interesting interpretation is not warranted and (2) when all those future experimental studies are going to happen.

This type of strategic ambiguity about the analysis goal is not limited to observational studies — experimental manipulations can also leave you to wonder: what the heck just happened? What were the authors even trying to show? How does the design and the statistical analyses relate to whatever it is they wanted to show? I have now spent enough time in psychological research that I no longer chalk my confusion up to a lack of experience. The studies really are confusing.^[3]However, as the confusion has intensified over time, early-onset cognitive impairment remains a plausible alternative explanation.

The Simpsons Adult GIF - Find & Share on GIPHY

Don’t Hate the Player

At this point, I should point out that I do not believe that all this strategic ambiguity is anybody’s fault in particular. Eisenberg (1984) argues that ambiguity in organizational communication develops to respond to multiple and often conflicting goals. That may as well be true for research ambiguity. If I want to write a paper for a top psychology journal, I need to tell an interesting story (which will likely involve some form of general causal claim, because that is how stories work), but I also need to respect the doctrine “correlation does not imply causation.”^[4]“Near a great forest there lived a poor woodcutter and his wife, and his two children. They had very little to bite or to sup, and once, when there was great dearth in the land, they decided to take the children into the forst. However, correlation does not imply causation, and we cannot rule out that they would have decided to do so regardless of famine. Future fairy tales should…” Ambiguity allows me to enjoy the taste of causal inference without ever touching the forbidden fruit, or put another way, to have my fruit cake and eat it too. I may also just imitate what I see in successfully published articles, because of course we learn from each other.

All that ambiguity isn’t always unsatisfying for readers either. For example, it can create a great illusion of understanding, albeit at the cost of actual understanding (Smaldino, 2016). The pinnacle of this are empirical studies that are essentially elaborately told stories interspersed with occasional, vaguely related empirical illustrations. I have heard people rave about such articles, and they are occasionally expanded to book length format. While there is nothing wrong with storytelling and well-crafted narratives, something seems off when data play such a minor role in a field that is quick to point out that it is in the business of very serious quantitative empirical science.
Ambiguity is also intellectually frustrating, and it often results in studies whose actual contribution to the literature is questionable—it is impossible to assess whether a study provides a good answer if the question remains unclear in the first place. And it seems virtually impossible to do cumulative science in this manner; empirical observations just co-exist horizontally.^[5]To quote Anne: chaos-in-the-brickyard metaphor alert. In the end, it probably boils down to a lot of wasted talent and time because I’m sure we could collectively do better.

Of course, my diagnosis is far from original, and others have suggested remedies for ambiguity and related ailments of psychology. Among these are a greater focus on theory (e.g., see this Perspectives on Psychological Science special issue) and/or more formal modeling (e.g., Smaldino, 2016, Guest & Martin, 2021, in the aforementioned special issue). There have also been calls for greater consensus building (e.g., Leising et al., 2021) which could be counted here. When I read these suggestions, I find myself nodding along, but at the same time they are very far removed from the average study in my field. The counterfactual world in which psychologists very carefully consider how theory is linked to the empirical phenomena of interest, or in which they routinely build formal models of the process of interest, is certainly one in which strategic ambiguity is less of a problem. I just don’t see us getting there anytime soon.

So the intervention I am going to suggest is a bit humbler in scope:

Quantitative empirical journal articles should be accompanied by structured abstracts that contain the sections “Theoretical Estimand” and “Identification Strategy.”

The theoretical estimand is “the central quantity of each analysis […] that exists outside of any statistical model.” This quote is from Lundberg, Johnson, & Stewart (2021) who, in my opinion, wrote the definitive piece on theoretical estimands that everybody should read. The identification strategy is your strategy to ensure that the empirical estimand (whatever you get out of your observed data) actually corresponds to the theoretical estimand of interest.

Let’s unpack that a bit to see what it could look like in practice.

In the framework by Lundberg et al., the theoretical estimand consists of a unit-specific quantity and a target population. In the simplest case, we might want to describe the distribution of some variable. For example, our unit-specific quantity may be an individual’s self-reported life satisfaction, and we want to aggregate this over the population of a part of Germany. The unit-specific quantity may also involve unobservable quantities. For example, if we are interested in the effect of employment status on life satisfaction, the unit-specific quantity would be the difference between two potential outcomes: one’s life satisfaction if one were employed minus one’s life satisfaction if one were unemployed. The possibility to incorporate unobservables, which Lundberg et al. call “liberating”, also makes the abstraction quite versatile. For example, we could accommodate estimands that involve latent variables; say, an individual’s subjective well-being rather than their reported life satisfaction.

Experimental studies of course also have theoretical estimands, and here, there’s some interesting room for differentiation. Imagine you did a study in which you let people rate the attractiveness of potential dating partners, depicted either with or without a pet. Now your estimand may be the contrast in the attractiveness ratings (with vs. without pet) for the particular people depicted on the photos you use within your study; or instead it may be more generally the contrast in the attractiveness ratings averaged across all potential photos of people, even those you did not use for your study. In the latter case, a generalization across stimuli is already folded into the theoretical estimand, and it may thus be closer to the hypothesis that you actually intend to test.^[6]I think this is what Tal Yarkoni is trying to say in his piece on the “generalizability crisis”, although I’m not sure I can follow all of his points.

The target population may often be neglected in psychology, but explicating it nicely fits with the idea of “constraints of generality” (Simons, Shoda, & Lindsay, 2017). For convenience samples, there’s a cheap way out: “well, I intend to generalize to whatever population that my sample happens to be representative of, checkmate!” Although it is quite possible that the sample happens to be representative of nothing in particular, so it would be preferable to consider whether we can nonetheless aggregate the unit-specific quantity over a more meaningful population, for example, with the help of poststratification. I’ve sometimes encountered the notion that saying one’s research is merely descriptive dodges the many requirements and assumptions underlying causal inference. But even mere description requires a meaningful target population to be of interest, unless one means to say that mere description means abandoning inference altogether.^[7]There has been a call to abandon statistical inference, but it turns out that it calls for treating “statistical results as being much more incomplete and uncertain than is currently the norm”, which is probably rather uncontroversial.

Another potential twist occurs when the theoretical estimand actually concerns a situation that is different from the experimental set-up. For example, we may be interested in the effects of a drug on performance in a “real-life” situation, but we won’t actually let drugged pilots fly airplanes, so instead we use a flight simulator. Again, this could be folded into our theoretical estimand, with downstream consequences for our identification strategy. This can be conceptualized as the “transport” of a causal effect.

Now, I’m open to the possibility that some meaningful quantitative empirical research questions do not result in well-defined theoretical estimands. In that case, it can be substituted with a very clear explanation of what exactly the authors are trying to reach with their analysis. Of course, one might argue that any article should include such an explanation anyway, so why bother with theoretical estimands? But I do believe that theoretical estimands are an immensely useful tool to elicit clear descriptions of study goals. Telling people “just be clear!” does not tell them how they could achieve clarity; asking them for their theoretical estimands sets a transparent expectation for what this clarity is supposed to look like.

Identification Strategies

Once it comes to the identification strategy, we have to pay for our flights of fancy. How can we link our theoretical estimand to the cold and harsh reality of observable data? What assumptions are necessary?

If we settled for something “easy”, like self-reported life satisfaction across the people of Saxony, it is sufficient to collect a random sample of people from Saxony and ask them to self-report their life satisfaction. Now if you think about that more carefully, it probably is not that easy—Will you get all people to participate? What if missingness is non-random?—which only highlights that even description requires careful (causal!) thinking.

Within causal inference frameworks, considering identification is a central task, and links can be formalized with different approaches (see section “Identification” in Lundberg et al.). Assumptions may, for example, take the form of “There are no common causes of X and Y (except for…)”, or if longitudinal data were involved, “there are no time-varying confounders that affect X and Y.”

If our theoretical estimand involves latent variables, our identification strategy will have to involve measurement assumptions to ensure that our statistical analyses can inform us about these unobservable quantities.

If our theoretical estimand was supposed to be some generalization across stimuli, our identification strategy will have to involve a stimuli set that is representative of the space of interest.

If our theoretical estimand is about flying airplanes and we only got a flight simulator, we will have to make assumptions about how these situations map onto each other—for example, about potential effect modifiers—and these can be derived systematically (Pearl & Bareinboim, 2014).

A sophist may remark that any inference involves an unlimited number of auxiliary assumptions, including unknown or even unknowable ones. You unfortunately won’t be able to squeeze them into the 150-word limit for Psychological Science abstracts. I concede that this may be the case (there’s always more space in the supplement though). My suggestion is not meant as a bullet-proof strategy to unearth literally all assumptions. Life is short, scientists have to prioritize what they care about, history may always prove us wrong. Some assumptions may have become invisible to us and may introduce biases that only people from other fields can spot. But that does not mean that we should turn into methodological nihilists—we can still try to be transparent about the central assumptions underlying our inferences.

Obstacles

I see a couple of potential problems with my suggestion. First, many may struggle to define their theoretical estimands. That is of course a feature, not a bug. Statistical consultants routinely report on Twitter how their first job is to figure out what on earth their clients are even trying to do, and being a peer-reviewer often feels similarly. But this, for once, is a task that I believe only authors themselves can take charge of; iteratively fixing it during the peer-review process can result in a lot of frustration for all parties involved. Demanding a theoretical estimand transparently communicates what is expected of them, prior to submission.

More importantly, we would need to provide resources to help reviewers assess whether theoretical estimands are well-defined, whether the identification strategy is valid, whether the central assumptions are spelled out appropriately. For this, we would need a collection of worked examples—common theoretical estimands, common identification strategies and their assumptions. If reviewers cannot assess the information provided by the authors, there’s always a risk that such a structured demand turns into an empty box-ticking exercise.

Lastly, it may take considerable effort to convince people that structured abstract sections add anything. I considered structured abstracts quite off-putting—I really dig the narrative and verbose style^[8]You may have noticed, if you have come this far.—until I participated in a language coding project by Noah Haber, Sarah Wieten et al. We tried to code the language used to describe the association between the central variables in observational health studies. Most of the abstracts I got to review were structured and coding them was a breeze. But I had two unstructured abstracts in the coding set, and explicitly trying to recover a central piece of information (“how are X and Y related according to the authors”) was a nightmare. This really gave me the impression that the structure imposed by abstracts enforces a certain discipline in thinking. People may be afraid that this results in a conceptual straitjacket, and in any case structured abstracts probably reduce the effects of “being able to craft nice narratives” on “being read widely”, which may be deemed a good or a bad thing. If we want to keep the “flair” of unstructured abstracts, a dual system could also be a solution. Give people 200 words to tell a nice story, but then add a box with important key information (theoretical estimand, identification strategy, sample size etc.).

Here’s a challenge, if you want to give structured abstracts a shot: Take an abstract you wrote, write the sections “Estimand” and “Identification Strategy” and post them on Twitter with the hashtag #estimand.

Appendix: An annotated example

In 2015, we (Rohrer, Egloff & Schmukle) published a study on the effects of birth order position on personality. Here is how I’d phrase the structured abstract sections:

Theoretical Estimand(s): The causal effects of birth order position on personality scores in three populations (contemporary Germany, Great Britain, and the United States). This effect is defined as contrast between the potential outcomes if one were born first vs. second (vs. third, etc.) within one’s family.

Notes: The estimand of this study was underspecified, probably because the counterfactuals involved are so confusing. If you’re a firstborn, what’s the counterfactual world in which you are a secondborn? Did your parents have a kid before you were born? Or are we talking about a scenario in which the fertilized eggs of both you and your future siblings can be swapped out? In-vitro birth order experiments, anyone? I still think it makes sense to think of the effects of birth order position, and that they are in principle independent of the effects of parental age etc., but some more work would be needed to conceptualize them properly–which would be a precondition for successful identification.

Identification Strategies and Assumptions: Our central identifying assumption is that, conditional on the number of children within a sibship and on age, birth order position is as good as random. In our between-family analyses, we stratify by sibship size. In the within-family analyses, we only compare children from the same family, thus sibship-size is already accounted for. The representative nature of the used survey studies is meant to ensure that we actually recover the average for the target population.

Why do we need to control for sibship size? The larger the family, the more laterborn children. Thus, if we naively compared firstborns to laterborns, the second group would include more individuals from larger families (which may systematically differ on many variables). Why do we need to control for age? In within-family analyses in particular, the firstborn is always older than the secondborn and so on. Thus, one may accidentally confuse age effects with birth order effects.

When would our identification strategy fail? Consider a situation in which parents decide whether or not to have another child based on the personality of their previous children. In such a situation, particularly challenging kids may end up lastborn because of their personality. In such a scenario, our analyses would not successfully recover the effects of birth order position. This potential threat to validity is quite real, see e.g. Jokela (2010).

One issue we sort of ignored is parental age. We have a hand-wavey footnote on the subject, but there is a real chance that any birth order effect may also simply reflect an effect of parental age (or parental age may cover real birth order effects). I was aware of the issue at the time because Ruben told me about it at the European Conference on Personality 2014.

What I have learned from this exercise: In hindsight, we should have put more thought into the definition of birth order effects and the involved counterfactuals. That would have probably affected how we presented results. For example, we could have started with a naive comparison (here is what you get if you simply compare first- and laterborns across all sibship sizes), then the same comparison stratified by sibship size, which, I believe, is close to what people have in mind when they think of birth order effects — is the first of two kids systematically different from the second of two kids? And then discuss under which assumptions this plausibly recovers the effects of birth order position on personality, depending on how we conceptualize them. I don’t think our central conclusions would have changed much, but we would have arrived at them through slightly different arguments, which ideally would have been somewhat more consistent.

Footnotes[+]

Footnotes
↑1	Strategic ambiguity is an idea which I picked up from Smaldino (2016) who in turn cites Eisenberg (1984)
↑2	Relationship status: It’s complicated
↑3	However, as the confusion has intensified over time, early-onset cognitive impairment remains a plausible alternative explanation.
↑4	“Near a great forest there lived a poor woodcutter and his wife, and his two children. They had very little to bite or to sup, and once, when there was great dearth in the land, they decided to take the children into the forst. However, correlation does not imply causation, and we cannot rule out that they would have decided to do so regardless of famine. Future fairy tales should…”
↑5	To quote Anne: chaos-in-the-brickyard metaphor alert.
↑6	I think this is what Tal Yarkoni is trying to say in his piece on the “generalizability crisis”, although I’m not sure I can follow all of his points.
↑7	There has been a call to abandon statistical inference, but it turns out that it calls for treating “statistical results as being much more incomplete and uncertain than is currently the norm”, which is probably rather uncontroversial.
↑8	You may have noticed, if you have come this far.

Posts

Who would win, 100 duck-sized strategic ambiguities vs. 1 horse-sized structured abstract?