The cross-lagged panel model debate in psychology provides the backdrop for this blog post; knowledge of its dark secrets is not necessary to follow along. But if you *do* want the CliffsNotes, read this footnote^{[1]}Starting point: Once upon a time, psychologists used the cross-lagged panel model (CLPM) to draw sort-of-causal-but-maybe-it’s-not-causal-more-research-is-needed inferences without a care. What the cross-lagged panel model does is essentially regress Y on both Y at an earlier point in time and X at an earlier time point. If the coefficient of X is significant, that’s a significant cross-lagged effect. Simultaneously, you regress X on both X at an earlier time point and Y at an earlier time point to test for cross-lagged effects into the other direction. Maybe you can already tell why it would be contentious to interpret those estimates causally – it seems a bold proposition that controlling for an earlier outcome would be sufficient to take care of all confounding. Cue vaguely threatening music. Enter Hamaker et al. (2015), whose “critique of the cross-lagged panel model” points out that these “effects” will be confounded if the two constructs have some degree of trait-like stability, and if their stable parts are correlated. This paper had a huge impact on the field (easter egg: make sure you check out footnote 1 to get some idea of how causality was treated in the psych methods literature, back in the day). In any case, Hamaker et al. say that correlated random intercepts should be included for both constructs. This accounts for potential confounding via the traits (and essentially results in “within-person” estimates, similarly to what you would get in a fixed-effects model). The carefree days are over. Once the random intercepts are included, fewer cross-lagged effects turn out to be significant. Orth et al. (2021) to the rescue: maybe it is fine to use the model *without* the random intercepts, as it supposedly tests “conceptually distinct psychological and developmental processes.” This is a tempting proposition because it implies both theoretical nuance and more significant findings. Further vindication for the original cross-lagged panel model is provided by Lüdtke and Robitzsch’ “critique of the random intercept cross-lagged panel model” (2021). Suffice it to say that this text will be subject to exegesis for the years to come. Meanwhile the editor of at least one major personality journal appears slightly exasperated because time is a flat circle (Lucas, 2023: Why the cross-lagged panel model is almost never the right choice). My own small contribution to this debate is pointing out that if people mainly use these models to draw causal inferences, maybe we should focus on causal inference (Rohrer & Murayama, 2023: These are not the effects you are looking for). It’s not like longitudinal data somehow magically solved causal inference. (SCP-47258, containment class: Keter).

What I realized during this panel is that personality psychologists (and, by extension, other subfields of psychology that like to do fancy modeling) very much approach statistical modeling from statistics. They run some analyses that return numbers. Then, they work their way toward an interpretation of those numbers.

In this blog post, I want to argue that this leads to a lot of confusion, and that there is another way – starting from the desired interpretation and then working towards a statistical analysis that can (hopefully) deliver.

For example, let’s consider longitudinal modeling. Researchers have two variables and are interested in their “interrelations” or “how they relate over time” or “how X and Y contribute to each other.” Then they mostly apply some out-of-the-box model which is usually referred to by acronym, such as the CLPM (cross-lagged panel model) or the RI-CLPM (random intercept cross-lagged panel model) that were the subject of the conference symposium mentioned above, or maybe the ARTS (autoregressive trait state model) or the STARTS (stable trait autoregressive trait model), or maybe the LGCM, or a LCSM. These are all part of the SEM family, but to spice things up, somebody may also pick an HLM/MLM/REM/MEM.^{[2]}These all refer to the same class of models, FML. You get the idea. Finally, all or most resulting coefficients are interpreted, one way or another, often following some template people acquired when learning about the model (“this number here means that”) combined with some freestyle variations.

Of course, not everybody works with longitudinal data, and so there are variations of approach. For example, researchers may set up some model to “decompose the variance” in some measures and then see how the variance components “are related.” Or they may work with self- and other-report data of some construct and somehow transform and aggregate these in various ways to derive some metric, and then correlate said metric with something else, and then try to interpret the resulting correlations between correlations. In a sense, there’s a lot of highly creative data analysis going on in personality research.

What these approaches have in common is that the starting point is the “method”, the statistical model or analysis that is applied. What method is chosen is determined partly by the data structure but also by tradition. For example, if you have nested data (of the non-longitudinal type), there is some default sentiment that you have to deal with that by means of a multilevel modeling – multiple articles have tried to make the point that if you are not interested in the whole multilevel structure, maybe there are more minimalistic solutions, but that remains a fringe viewpoint. If, instead (or additionally) you have self- and other-reports, you have to “control for normative desirability” to derive “distinctive personality profiles” which you can then correlate to arrive at supposedly interpretable correlations. Certain trends come and go; for example, bifactor models were popular in certain literatures that are rather foreign to me, but they seem to have fallen out of favor. One may be forgiven to think that the whole CLPM debate is another example of such a trend—who knows what will come after the RI-CLPM.^{[3]}I disagree with the underlying sentiment though. If causal inference is the goal and longitudinal data are meant to improve the chances of successful identification, accounting for “the trait”—stable between-person differences—isn’t a question of fashion; it’s the sensible default.

This particular way of dealing with statistics leads to some peculiarities. Even substantive personality researchers will often specialize in certain model classes and become very apt at interpreting their moving parts (“oh, this variance term here means that…”). From the outside, this may give the impression that personality researchers have particularly well-developed methods skills, and I wouldn’t disagree with that.^{[4]}The comparison group obviously matters. But in principle, I believe that personality researchers are on average quite technically competent; I also believe that we could make better use of those powers (hence, this blog post). We also end up with a weird mix of conservative rigidity and postmodern nihilism. Considering the former, if you have a particular type of data, people will act as if one particular model was obviously the only right choice. Incidentally, this makes it very hard for non-psychologists to publish in personality journals; your fixed-effects model might get rejected because in this house we analyze longitudinal data differently. Considering the latter, there is still an underlying sentiment that “any model goes.” For example, if the CLPM results in confounded inferences, that’s your problem as the researcher who overinterpreted the model. The model did nothing wrong, it did precisely what it was supposed to do. What is wrong are your inferences. Classic beginner’s mistake. But surely somebody out there will have the right research question for the model you reported.

Now, here’s an alternative approach one could envision. One could start from a clearly defined analysis goal, such as “I am interested in the causal effect of X and Y in target population Z.” Or maybe one could simply be interested in the distribution of X in a certain population Z; or maybe even just in the (unconditional, bivariate) correlation between X and Y in Z. I am the last person to tell people what they should be interested in—but the first one to tell them that if they don’t tell me what they are trying to estimate in the first place, why even bother.

These analyses goals are so-called theoretical estimands, and the wonderful paper by Lundberg et al. (2021) explains that they should be described in precise terms that exist outside of any statistical model. It also illustrates how to do so. To supplement this approach, given how many researchers insist that they are interested in prediction rather than causal inference, I am willing to concede that one could also start from a clearly described scenario in which one actually wants to make predictions—predictions in the sense of predictions (e.g., trying to predict how satisfied two romantic partners will be after one year, based on their personality right now), not in the oh-it’s-only-a-correlation-hm-but-maybe-it-is-also-more-no-but-definitely-not-causal sense in which it is often used in psychology (e.g., this new questionnaire predicts 0.1% of the variance in subjective well-being above and beyond this 120 item Big Five questionnaire; for more ramblings on incremental validity see also section 3.2 in Rohrer, 2024).

Now, we still need to figure out how to actually learn something about the theoretical estimand of interest, or alternatively, how to best predict the outcome. Depending on the estimand and the available data, we may actually end up using a CLPM/RI-CLPM/ARTS/STARTS/LGCM/LCSM/SEM/HLM/MLM/REM/MEM after all. But if the analysis goal is causal inference, then quite likely we will realize that we additionally have to adjust for at least some covariates to reduce confounding. And we wouldn’t want to interpret every single coefficient that the model returns. In fact, many of the coefficients may be uninterpretable (this is known as the Table 2 fallacy). But that’s not a bug; we don’t need to be bothered by it if the coefficient corresponding to our estimand of interest is interpretable (big if). So we might end up with a similar model, but approach it in a different spirit, and our interpretation of the results may be a lot more focused.

If instead the analysis goal is just estimating a correlation, we might as well end up just calculating a correlation. Researchers sometimes come up with a hierarchy from “spurious” correlations to “true”/”interpretable”/”robust” correlations. The latter are usually correlations after more or less successfully conditioning on confounders that may bias the correlation relative to some estimand of interest. But the estimand of interest isn’t spelled out, and so it seems like a convoluted attempt to tweak the concept of a correlation until it provides the answer to the unarticulated research question. But if you are really just interested in the correlation, all that statistical control and other contortions may be unnecessary.^{[5]}This is an oversimplification – the correlation in your possibly selective sample may be a biased estimator of the correlation in the population of interest; in that case you may still need to worry about and take into account third variables. But this is usually not why psychologists invoke third variables. Additionally, there may be concerns about measurement biases; I think those mostly can be rephrased as concerns about confounding (which is also reflected by the models that people usually use to tackle them, mostly under the implicit assumption that whatever biases the measurement is not correlated with some underlying substantive variables). And if you are *really* interested in *actual* prediction, knock yourself out but do it properly.

Estimands aren’t magic; being upfront about the analysis goal does not guarantee valid inferences. To ensure that we can actually recover the theoretical estimand of interest from the data, assumptions are necessary.^{[6]}These should *also* be spelled out explicitly. But I have come around to believe that explicit estimands are the thing we have to tackle first to get anywhere. If I know your estimand but not your assumptions, I can figure them out on my own and there’s little room to argue about that. If I know your assumptions but not your estimand, I can *maybe* figure out your estimand. But reverse-engineering is tedious and error prone, and the vagueness with which researchers articulate their research questions makes it frustrating—they may always claim that the estimand implied by their analyses and assumptions was not the one they had in mind. Peer-review in a quantitative empirical science shouldn’t have to involve that much hermeneutics, and yet here we are. For example, if we want to learn about causal effects based on observational data, those assumptions will often have to be quite strong. Maybe they are prohibitively strong, which leaves us in the same spot as doing things the other way around—here are some numbers, we are not sure what to make of them. So:

We should bother because moving from statistics to interpretations is clearly confusing people. A lot. I would be willing to say that in personality research, people being confused about what they try to achieve with their often supposedly sophisticated statistical models is among the top sources of research waste. Years of PhD students’ lives are consumed learning and implementing arcane analyses that may not even be the best way to address their research questions at hand. Months of reviewers’ and editors’ lives are wasted trying to figure out what the hell the authors are even trying to do, often in a lengthy back-and-forth. There are just so many debates consuming the time and energy of researchers that could be resolved if everybody had a clearer idea of their estimand in the first place.^{[7]}Examples involve the back-and-forth regarding the whole purpose of the marshmallow test (prediction or explanation?), the endless debate about the age trajectory of happiness (what’s an age trajectory anyway?), and the major confusions that arise in Many Analysts projects. More details and references in section 4.1 here.

I know the whole estimand thing is a tough sell for multiple reasons.

First of all, if we explicitly acknowledge that there’s something out there in the world about which we want to make statements, it makes our findings potentially fallible. In contrast, mere statistics won’t let you down. A conditional association is what it is. Maybe we shouldn’t overinterpret the data and instead just let them speak for themselves? The bad news is that if you’re doing substantive research, substantive interpretations are going to happen either way—otherwise the whole exercise of running statistical analyses would be pointless. Being vague about the desired interpretation may maintain some degree of plausible deniability and potentially offload the error to others (“oh, you shouldn’t have interpreted our numbers like that!1”). But honestly, that’s just a cowardly way to do science.^{[8]}To me, it’s most clearly exemplified by how psychologists treat causal inference based on observational data. They will just try so hard to imply causality without ever owning their causal claims. Life is too short for that.

Naturally, you may still feel bad about communicating your preferred interpretation with too much certainty. That’s good, actually. You can communicate this throughout your manuscript and use the limitations section to point out how mistaken assumptions may invalidate inferences.^{[9]}That results in much more interesting limitations sections than the usual nod to external validity (“Because this study took place in Luxembourg and only included psychology undergraduate students, findings may not generalize to the Global South”). Just don’t try to squeeze that uncertainty into the analysis goal (“maybe we’re trying to do causal inference; maybe we’re trying to do prediction; maybe it’s a secret, more complex third thing”). Own your estimand (“we’re trying to make statements about this; here’s our answer; here’s how it could be wrong”).

Another part is that people will say “oh, but of course you need to pick the right model for the research question at hand, everybody knows that!” I admit the underlying insight may be trivial—if you are not clear about what you want to do, it’s really hard to do it well. But the devil is in the detail. I see a lot of researchers pay lip service to the notion that you need to pick the right model for your research goal. Then they will memorize which vaguely phrased research questions can supposedly be answered by which coefficient in what model. And then if you look into the literature, it’s predictably still a mess.^{[10]}There’s a parallel phenomenon in the “theory crisis” discussion. People will point out that psychology often lacks rigorous theorizing, which is very true. But then sometimes you look at what those people consider serious theorizing, and it turns out that it’s mostly boxes containing a hodgepodge of variables, haphazardly connected by arrows based on either common sense or flimsy/confounded empirical studies; the central theoretical prediction being that “everything may be connected.” So if I hear somebody talk about how psychology needs more theory, I keep my guard up until I can confirm that this is not the type of theorizing they have in mind. So maybe having a clear research question and picking the right model for the task is not all that trivial. Here, the estimands framework enforces rigor because it provides a systematic way to spell out the analysis goal. And then in the next step, thinking about the necessary identifying assumptions enforces rigor about how to connect the research question to the statistical model.

Another reason why this is a tough sell is that it to some extent devalues researchers’ hard-earned stats skills. *Nobody* likes to see their skills devalued. If you have invested years into mastering the details of all moving parts of SEM, or HLM, or weird index variables derived from other variables (looking at you, profile correlation and euclidean distance people), surely you would want to apply that some more rather than switching to an estimand angle. The estimand angle may sometimes result in the insight that some sort of (generalized?) linear model may be sufficient for the task; and maybe also that, for example, the choice of covariates (which connects to the identification assumptions) matters more than the choice of statistical analysis.

So, it’s important to point out that the skills that are needed to move from statistics to substance—the skills to figure out how certain model parameters can be interpreted, and when inferences go wrong—are of course valuable and useful within the estimands framework, that is, when moving from substance to statistics. The mapping still goes both ways, and in practice there will be some degree of back and forth. For example, one may realize that one’s statistical model behaves weirdly in certain circumstances, which in turn alerts us that we may have missed assumptions necessary to link the theoretical estimand to the empirical estimand. Really, committing to estimands does not change anything about the underlying statistics.

It’s just a different angle from which to approach things that may help clear up some confusion. And, beyond this, it also just makes sense to take the theoretical estimand as a guiding light if you’re a substantive researcher. If you’re a substantive researcher, you mainly do statistics to answer substantive questions, not because you love statistics so much.^{[11]}Unless that’s *really* your thing; this is a kink-shaming free space. So let’s start from clearly defined theoretical estimands and move on from there.

Further readings:

- Lundberg et al. (2021): What is your estimand? Defining the target quantity connects statistical evidence to theory. This is an instant modern classic for the social sciences. Admittedly a bit much if you are completely new to causal inference, but in any case worth the effort.
- Auspurg & Brüderl (2021): Has the credibility of the social sciences been credibly destroyed? Reanalyzing the “Many Analysts, One Data Set” project. This provides a great illustration of how unclear estimands create confusion and connects it to meta-scientific discussions about so-called researcher degrees of freedom.
- Kahan et al. (2024): The estimands framework: a primer on the ICH E9(R1) addendum. If there’s any field in which estimands have become somewhat mainstream, it’s medical research/biostatistics/health research/epi/not-sure-how-to-call-it. This article nicely explains the intricacies of estimands within the “simple” context of medical trials (i.e., in a context where it may not be obvious that one could come up with different estimands).

Further ramblings on this blog:

- Mülltiverse Analysis. It has become “a thing” to run a lot of analysis in psychology (e.g., multiverse analysis, specification curve analysis). This raises questions about the underlying estimand, which I discuss in this post.
- Who would win, 100 duck-sized strategic ambiguities vs. 1 horse-sized structured abstract? In which I go all in and demand that we make it mandatory to spell out a clear estimand in the abstract.
- Causal Inference | Hypothesis Testing | All at Once. Maybe you don’t have an estimand because you’re just testing some empirical prediction of your theoretical model? Here, I argue that the same rules still apply.

Footnotes

↑1 | Starting point: Once upon a time, psychologists used the cross-lagged panel model (CLPM) to draw sort-of-causal-but-maybe-it’s-not-causal-more-research-is-needed inferences without a care. What the cross-lagged panel model does is essentially regress Y on both Y at an earlier point in time and X at an earlier time point. If the coefficient of X is significant, that’s a significant cross-lagged effect. Simultaneously, you regress X on both X at an earlier time point and Y at an earlier time point to test for cross-lagged effects into the other direction. Maybe you can already tell why it would be contentious to interpret those estimates causally – it seems a bold proposition that controlling for an earlier outcome would be sufficient to take care of all confounding. Cue vaguely threatening music. Enter Hamaker et al. (2015), whose “critique of the cross-lagged panel model” points out that these “effects” will be confounded if the two constructs have some degree of trait-like stability, and if their stable parts are correlated. This paper had a huge impact on the field (easter egg: make sure you check out footnote 1 to get some idea of how causality was treated in the psych methods literature, back in the day). In any case, Hamaker et al. say that correlated random intercepts should be included for both constructs. This accounts for potential confounding via the traits (and essentially results in “within-person” estimates, similarly to what you would get in a fixed-effects model). The carefree days are over. Once the random intercepts are included, fewer cross-lagged effects turn out to be significant. Orth et al. (2021) to the rescue: maybe it is fine to use the model without the random intercepts, as it supposedly tests “conceptually distinct psychological and developmental processes.” This is a tempting proposition because it implies both theoretical nuance and more significant findings. Further vindication for the original cross-lagged panel model is provided by Lüdtke and Robitzsch’ “critique of the random intercept cross-lagged panel model” (2021). Suffice it to say that this text will be subject to exegesis for the years to come. Meanwhile the editor of at least one major personality journal appears slightly exasperated because time is a flat circle (Lucas, 2023: Why the cross-lagged panel model is almost never the right choice). My own small contribution to this debate is pointing out that if people mainly use these models to draw causal inferences, maybe we should focus on causal inference (Rohrer & Murayama, 2023: These are not the effects you are looking for). It’s not like longitudinal data somehow magically solved causal inference. |
---|---|

↑2 | These all refer to the same class of models, FML. |

↑3 | I disagree with the underlying sentiment though. If causal inference is the goal and longitudinal data are meant to improve the chances of successful identification, accounting for “the trait”—stable between-person differences—isn’t a question of fashion; it’s the sensible default. |

↑4 | The comparison group obviously matters. But in principle, I believe that personality researchers are on average quite technically competent; I also believe that we could make better use of those powers (hence, this blog post). |

↑5 | This is an oversimplification – the correlation in your possibly selective sample may be a biased estimator of the correlation in the population of interest; in that case you may still need to worry about and take into account third variables. But this is usually not why psychologists invoke third variables. Additionally, there may be concerns about measurement biases; I think those mostly can be rephrased as concerns about confounding (which is also reflected by the models that people usually use to tackle them, mostly under the implicit assumption that whatever biases the measurement is not correlated with some underlying substantive variables). |

↑6 | These should also be spelled out explicitly. But I have come around to believe that explicit estimands are the thing we have to tackle first to get anywhere. If I know your estimand but not your assumptions, I can figure them out on my own and there’s little room to argue about that. If I know your assumptions but not your estimand, I can maybe figure out your estimand. But reverse-engineering is tedious and error prone, and the vagueness with which researchers articulate their research questions makes it frustrating—they may always claim that the estimand implied by their analyses and assumptions was not the one they had in mind. Peer-review in a quantitative empirical science shouldn’t have to involve that much hermeneutics, and yet here we are. |

↑7 | Examples involve the back-and-forth regarding the whole purpose of the marshmallow test (prediction or explanation?), the endless debate about the age trajectory of happiness (what’s an age trajectory anyway?), and the major confusions that arise in Many Analysts projects. More details and references in section 4.1 here. |

↑8 | To me, it’s most clearly exemplified by how psychologists treat causal inference based on observational data. They will just try so hard to imply causality without ever owning their causal claims. |

↑9 | That results in much more interesting limitations sections than the usual nod to external validity (“Because this study took place in Luxembourg and only included psychology undergraduate students, findings may not generalize to the Global South”). |

↑10 | There’s a parallel phenomenon in the “theory crisis” discussion. People will point out that psychology often lacks rigorous theorizing, which is very true. But then sometimes you look at what those people consider serious theorizing, and it turns out that it’s mostly boxes containing a hodgepodge of variables, haphazardly connected by arrows based on either common sense or flimsy/confounded empirical studies; the central theoretical prediction being that “everything may be connected.” So if I hear somebody talk about how psychology needs more theory, I keep my guard up until I can confirm that this is not the type of theorizing they have in mind. |

↑11 | Unless that’s really your thing; this is a kink-shaming free space. |

It’s probably fair to say that many psychological researchers are somewhat confused about causal inference. That’s very understandable given the minimal amount of training most of us receive on the topic, but also rather unfortunate given that a big chunk of psychological research is about understanding broad causal patterns.^{[1]}I lifted the phrase “broad causal patterns” from Angela Potochnik’s “Idealization and the Aims of Science” which argues that it is such patterns that are the path to human understanding. Due to the causal complexity of the world, getting there requires idealizations – assumptions that are made without regard for whether they are true (and often in full knowledge that they are false); hence the title of the book.

In this blog post, I want to tackle two related misconceptions that I have encountered. The first one is the notion that whether something is a causal effect or not depends on the specific mechanisms involved. The second one is the notion that when causal effects vary between people, that’s somehow a big issue and invalidates our inferences. To get these out of the way, we will start with some basics—i.e., the big secret of what we actually mean when we talk about causal effects.

Nobody knows for sure. I have read a surprising number of articles in psychology that vaguely point to philosophical problems in the definition of causality and then somehow end up citing David Hume (“There must be a constant union betwixt the cause and effect”).^{[2]}And that’s still the better version of this genre, the worse one starts talking about quantum physics. There are indeed philosophical discussions surrounding the concept of causality, and I’m sure some of these are also relevant to applied researchers. At the same time, it’s probably fair to say that most researchers across most quantitative empirical fields have subscribed to an interventionist framework of causality—sometimes explicitly, in psychology usually implicitly.

In that framework, causal effects are defined with reference to some (sometimes merely hypothetical) intervention. For example (Figure 1), what would it mean for my well-being right now (Y) if I had taken an aspirin this morning (X = 1)? The causal effect of said aspirin on my well-being is then defined as the contrast between my well-being with aspirin (Y^{X=1}) and my well-being without aspirin (Y^{X=0}).^{[3]}In Pearl’s framework, the notion of a hypothetical, surgical intervention is represented by the do()-operator. As I cannot simultaneously take (X = 1) *and* not take it (X = 0), we can only observe one of the outcomes involved in the causal effect, the other one remains a so-called counterfactual. Y^{X=0 }and Y^{X=1 }are referred to as *potential outcomes* because they are, well, the potential outcomes that could be observed depending on the state of the world (aspirin vs. no aspirin).

This is how we define individual-level causal effects. It’s really quite narrow and unassuming; the complications and assumptions all enter because one of the potential outcomes is destined to remain unobserved. This is also referred to as the fundamental problem of causal inference (a phrase attributed to Holland, 1986). The standard “solution” involves declaring that we cannot possibly know the individual-level effects and then trying to come up with some smart solution that still allows us to make certain statements. For example, if we randomly assign people to conditions (aspirin versus no aspirin), the mean difference between the groups should be an unbiased estimator of the average of the individual-level causal effects.

Importantly, if my well-being is better if I take the aspirin rather than not (Y^{X=1 }> Y^{X=0}), that’s a causal effect of taking the aspirin—regardless of the specific underlying mechanism. Maybe it’s because the aspirin exerted anti-inflammatory properties, thus biologically reducing some underlying cause of suffering. Maybe it’s because I told myself “now that I have taken an aspirin, I’ll be fine” and subsequently went out and enjoyed the day rather than hiding in the dark bedroom wearing one of those cooling masks that make you look like migraine man, one of the less effective superheroes. There’s a certain black box character to causal effects, but that’s not really a bug. Imagine we always needed to understand the full underlying mechanism to be able to declare a causal effect. The causal chain could be broken down further and further, leading to levels of abstraction that are probably not very conducive to human understanding (“and then this specific precursor of prostaglandin is blocked from reaching this one particular active site of the enzyme, and then…”).^{[4]}*Quantum physics intensifies.*

Still, defining causal effects without regards to mechanisms can lead to some degree of confusion. Consider Sandy Jencks’ thought experiments (recounted by Kathryn Paige Harden in the Genetic Lottery) of a world in which a nation refuses to send children with red hair to school. That would constitute a causal effect of red hair on literacy. The intervention here is easy enough to imagine; if for some sick reason we decided to dye a kids’ hair red, that would (causally) decrease their chances of becoming literate. We can even go a step further and say that red hair *genes* causally affect literacy (that’s the original point of the thought experiment). All of this may feel wrong for various reasons.

People’s ideas of causality are sometimes entangled with the notion of blame (“Guns don’t kill people, people kill people”)—and clearly, we should not blame redheads for their bad outcomes, but rather society for its sick ways. So how could we say that red hair is a (or even *the*) cause? And the idea of causal effects of red hair seems to imply certain pathways, even more so when genes are invoked as a cause. It feels like the affected individual should do something that results in the outcome (maybe redheads are intrinsically lazier?), or maybe there should even be some biological explanation (maybe genes that cause red hair also impair brain function?). In any case, for a causal effect of red hair, it seems like something more deterministic and inevitable should be going on than “this weird nation decided that redheads are not allowed to go to school, you won’t believe what happened next.”

But the definition of a causal effect within the interventionist framework is indifferent to all of that. Within the specific population, changing hair color at a young age changes literacy later in life, so it’s a causal effect. The precise mechanisms don’t matter.

This indifference to causal pathways also applies to experiments. Consider the notion of demand effects: In one of the experimental conditions, the experimenter implicitly communicates that the participant ought to behave a certain way and the participant complies. This may not be the mechanism you had in mind when planning the experiment, but it’s still a causal effect of the experimental condition—had the participant been in a different condition, the experimenter would have communicated something else implicitly, and the participant’s behavior would have been different. Sometimes, experimentalists will refer to such unintended pathways from the experimental condition to the outcome as “confounds.” Fair enough, they do confound conclusions with respects to the effects of the *intended* pathway (Figure 2). However, they do not constitute confounders in another sense; they are not common causes of both the independent and the dependent variable of interest. Instead, they are unwanted mediators.^{[5]}In the Campbellian validity system, those would be threats to construct validity (but not to internal validity).

From the redhead example above, it should already be clear that causal effects are not to be thought of as immutable building blocks of reality, as fixed laws of nature. If the anti-redhead nation stops discriminating based on hair color, or if we look at a neighboring nation that instead discriminates against blondes, the effects of red hair on literacy will look different. These would usually be filed under concerns of generalizability.^{[6]}Which can also be tackled from within a causal inference framework, see e.g., Deffner et al., 2022. In this particular example, we would be able to supplement our knowledge that the effects of red hair on literacy fully depend on laws that prohibit redheads from attending school. We would then reasonably conclude that the effects only generalize to other nations with such laws.

But there is really no reason to only think in terms of “tractable” variation that we can explain. Before, we discussed the notion of the individual-level causal effect which already sort of implies that every individual may have their own causal effect. In many ways, such (unexplained) effect heterogeneity is the default assumption in the causal inference literature, and then we try to work around it and somehow try to estimate some average of such individual-level effects.

I often have the impression that the psychological literature starts from the opposite notion that causal effects are the same for everyone. If somebody then raises the possibility of heterogeneity (or “omitted moderators”), some people will be like “stop the presses, this changes everything.” Some go so far as to say that the estimated (average) effects are suddenly meaningless because they do not necessarily reflect anybody’s individual-level causal effect.^{[7]}I have rambled about this before in footnote 2 of this blog post. I know, I know; I’m getting old and repetitive. Sure, it would be very *nice* to know everybody’s individual-level causal effect, but for many research questions, that’s simply out of the question due to the fundamental problem of causal inference; so some sort of average is often the best we can get. If we can get that at all.

On a related note, psychological researchers will sometimes hear (or actively teach) that main effects cannot be interpreted when there is an interaction. One version restricts this prohibition to cross-over interactions (in which the effect actually changes sign depending on a third variable), another one says “don’t interpret main effects in the presence of an interaction PERIOD.” From the perspective that effects may *always* be heterogeneous, that prohibition appears a bit puzzling. After all, it may as well be possible that the treatment interacts with something we did not observe, or just happens to have an effect with the opposite sign in some random individuals for inexplicable reasons—would the prohibition extend to those scenarios?

But I think it helps make sense of the prohibition if we consider it within the context of a fully factorial experimental study. Let’s say that A has a positive effect on the outcome when B = 0, and a negative effect when B = 1; these are the conditional effects. What’s the main effect of A? Some average of the two conditional effects; conventionally we may want it to be precisely in the middle between the two. This would be the average effect of A in a population in which B = 0 for half of the people, and B = 1 for the other half. But remember that this is a fully factorial experimental design, so we decide how many people get B = 0 and how many get B = 1. So we could adjust those numbers to get literally any average effect that lies between the two conditional effects; the average effect would be up to us (Figure 3).^{[8]}Do the main effects in your ANOVA output actually correspond to meaningful average effects? Maybe; it depends on the design and the sum of squares used (Graefe et al., 2022). Frankly, everything that I learn about ANOVA squarely sums up to the conclusion that ANOVA is just too confusing. ENOV already!

But now imagine that only A is an experimentally manipulated factor, maybe some emergency medication, administered after accidents to stop bleeding. B is a particular genetic mutation. For people without the genetic mutation (B = 0), the medication works just fine. For people with the genetic mutations (B = 1), it actually makes things worse. How many people have B = 0 and how many people have B = 1 is completely outside of our control. Let’s say only 0.05% of the population carry the mutation. Now, the average effect would come out in the positive; if we give somebody the medication we can expect to help them, potentially saving lives. We may thus recommend usage of the medication in emergency situations, even if we usually won’t know whether the patient belongs to the 0.05% who are hurt. Of course, if we learned that in fact 30% of the population carry the mutation, the average effect would change and a different recommendation may result. But in any case, we need to worry about the average effect in the actual population that we want to treat, which is meaningful and can inform our decisions, despite the presence of a cross-over interaction.^{[9]}When I worked on a manuscript with Arthur Chatton, a biostatistician, I noticed that he seemed to care *a lot* about target populations. But that makes a lot of sense if you start from the notion that your intervention may even harm some people – any relevant conclusion will depend on how things average out in your population of interest. In contrast, if you think that the effect is about the same for everyone, you don’t really need to care about representing your target population well, and I guess that’s how psychologists usually operate (despite their insistence that everything is super complex and moderated in subtle ways).

Some psychologists have taken the idea that things may be moderated to an extreme. First, if you don’t find an effect, maybe check for moderators; there could be some subtle cross-over interaction that leads to an average effect of zero when there is so much exciting stuff happening. Second, if somebody else fails to replicate your study, clearly that means that there must be some hidden moderator that just happens to have the wrong value in the replication study. While both of these things can be plausible in certain scenarios, in combination, they lead to a literature filled with unreplicable zombie claims that just cannot be killed. So at this point, we should also acknowledge that (1) cross-over interactions that lead to effect cancellation are probably rare in the wild^{[10]}Although it looks like they occupied the fantasies of experimental social psychologists of a certain era. and (2) if you start invoking hidden moderators, it probably means that you failed to clearly define a target population in the first place.^{[11]}To be fair, most of psychology is bad at that. So maybe we do deserve the endlessly repetitive “hidden moderators” debate for our sins.

So: Causal effects are indifferent to the mechanisms that contribute to them.^{[12]} This does not imply that mechanisms are indifferent to causality; claims about mechanisms are claims about how things causally unfold in the world. That means that claims about mechanisms come with all the standard causal inference problems, and then some – because they require the successful causal identification of multiple path specific effects. Sometimes people try to weasel their way out by claiming they are merely “demonstrating that the data are compatible with a theoretically plausible mechanism”; alas, such demonstrations only provide a severe rest of the underlying theory if the underlying causal assumptions are plausibly met. For example, if we estimate the causal effect of gender on income, that will include all sorts of things, including differences in interests and preferences, women dropping out of the labor force after children, direct discrimination behavior by decision makers. If that “feels wrong” to you because some of these things shouldn’t count, it’s likely that you are not interested in the causal effect of gender on income per se, but in some more specific mechanism (maybe you’re just interested in the effect that remains when removing the influence of certain pathways that you deem justified, i.e., a bias). Or maybe you are not interested in the lifetime effects of gender (“What if you had been born a boy?”) but rather in the effects of a more immediate gender change (“What if you suddenly turned into a boy today but kept all of your previous credentials?”). Admittedly, for gender (and for biological sex),^{[13]}There is a whole literature on the question whether sex and/or gender can be a meaningful causal variables. The funny thing about biological sex in particular is that its effects can be identified quite plausibly, as biological sex seems to be pretty much randomized at conception, turning this into a natural experiment. But if you actually try to define the individual-level causal effect for biological sex, you are comparing “you with the biological sex you actually have” with “you if you had received different chromosomes etc, which arguably might no longer be you but is instead a different person.” the interventions are a bit more hypothetical than for red hair; but if it is possible in Animal Crossing, certainly it’s in the realm of the conceivable.

One area in which people seem to have a particularly hard time to differentiate between questions about causal effects and questions about specific mechanisms is genetics. It just seems very tempting to interpret the effects of genes as immutable laws of nature that arise in all contexts, which makes them a lot more contentious than they would be in some alternative universe in which everybody is assigned to read Harden’s “The Genetic Lottery.”^{[14]}Chapter 5, “A Lottery of Life Chances”, includes a very nice discussion of what causal effects of genes are (and aren’t). There is also an article by Madole and Harden (2022) which elaborates on the matter and which is likely well worth your time. Maybe there is something special about the possibility of biological pathways that triggers brain areas responsible for specific deterministic causal narratives. More fMRI studies are needed.^{[15]}I am currently reading Sarah Thornton’s “Tits Up” in which at some point Thornton argues that associations between breastfeeding and IQ are not just confounding because “recent MRI scans have revealed a human milk ‘dose-response’ in brain morphology.” That does not seem like a particularly compelling argument from a causal inference perspective – after all, there may as well be common cause confounders between breastfeeding and white matter – but the argument “feels” like it works.

Causal effects may also vary in arbitrary ways. Some of this variability may be tractable, some may be intractable. Either way is fine, it doesn’t mean that average effect estimates are somehow wrong—if all goes well (big if!) they are exactly just that, average effect estimates. If it feels like that’s “not enough” to do proper science, it may be helpful to recall that due to the fundamental problem of causal inference, the individual level effects will often be out of reach. Holland (1986) dinstinguishes between the “scientific solution” and the “statistical solution” to this problem. In the former, scientists can employ reasonable invariance assumptions. For some laboratory equipment, I may assume that its measured outcome at an earlier time point is equivalent to its potential outcome for the same condition right now (i.e., if I did the same thing to it, the same thing would happen), so if I do something else to it and then something different happens, it’s plausible to conclude that my action had a causal effect. This type of invariance assumption may not always be feasible for living beings,^{[16]}Except maybe for some cases of repeated within-subject experimentation for which it is plausible to assume no carry-over effects. If this happens to be applicable to your research question—knock yourself out, it’s a great design. and so we need some other solution. The statistical solution relies on the fact that things can average out to return meaningful answers. Maybe the naming of the two solutions is unfortunate given how researchers sometimes regress into physics envy (“Are you saying that behavioral research does not rely on scientific solutions?”).

But, to put it another way: the scientific solution works if your research object is so well-behaved that it makes causal inference easy. The statistical solution is needed if your research object is trying its very best to give you a hard time, as humans, cats, and other critters are prone to do.

Footnotes

↑1 | I lifted the phrase “broad causal patterns” from Angela Potochnik’s “Idealization and the Aims of Science” which argues that it is such patterns that are the path to human understanding. Due to the causal complexity of the world, getting there requires idealizations – assumptions that are made without regard for whether they are true (and often in full knowledge that they are false); hence the title of the book. |
---|---|

↑2 | And that’s still the better version of this genre, the worse one starts talking about quantum physics. |

↑3 | In Pearl’s framework, the notion of a hypothetical, surgical intervention is represented by the do()-operator. |

↑4 | Quantum physics intensifies. |

↑5 | In the Campbellian validity system, those would be threats to construct validity (but not to internal validity). |

↑6 | Which can also be tackled from within a causal inference framework, see e.g., Deffner et al., 2022. In this particular example, we would be able to supplement our knowledge that the effects of red hair on literacy fully depend on laws that prohibit redheads from attending school. We would then reasonably conclude that the effects only generalize to other nations with such laws. |

↑7 | I have rambled about this before in footnote 2 of this blog post. I know, I know; I’m getting old and repetitive. |

↑8 | Do the main effects in your ANOVA output actually correspond to meaningful average effects? Maybe; it depends on the design and the sum of squares used (Graefe et al., 2022). Frankly, everything that I learn about ANOVA squarely sums up to the conclusion that ANOVA is just too confusing. ENOV already! |

↑9 | When I worked on a manuscript with Arthur Chatton, a biostatistician, I noticed that he seemed to care a lot about target populations. But that makes a lot of sense if you start from the notion that your intervention may even harm some people – any relevant conclusion will depend on how things average out in your population of interest. In contrast, if you think that the effect is about the same for everyone, you don’t really need to care about representing your target population well, and I guess that’s how psychologists usually operate (despite their insistence that everything is super complex and moderated in subtle ways). |

↑10 | Although it looks like they occupied the fantasies of experimental social psychologists of a certain era. |

↑11 | To be fair, most of psychology is bad at that. So maybe we do deserve the endlessly repetitive “hidden moderators” debate for our sins. |

↑12 | This does not imply that mechanisms are indifferent to causality; claims about mechanisms are claims about how things causally unfold in the world. That means that claims about mechanisms come with all the standard causal inference problems, and then some – because they require the successful causal identification of multiple path specific effects. Sometimes people try to weasel their way out by claiming they are merely “demonstrating that the data are compatible with a theoretically plausible mechanism”; alas, such demonstrations only provide a severe rest of the underlying theory if the underlying causal assumptions are plausibly met. |

↑13 | There is a whole literature on the question whether sex and/or gender can be a meaningful causal variables. The funny thing about biological sex in particular is that its effects can be identified quite plausibly, as biological sex seems to be pretty much randomized at conception, turning this into a natural experiment. But if you actually try to define the individual-level causal effect for biological sex, you are comparing “you with the biological sex you actually have” with “you if you had received different chromosomes etc, which arguably might no longer be you but is instead a different person.” |

↑14 | Chapter 5, “A Lottery of Life Chances”, includes a very nice discussion of what causal effects of genes are (and aren’t). There is also an article by Madole and Harden (2022) which elaborates on the matter and which is likely well worth your time. |

↑15 | I am currently reading Sarah Thornton’s “Tits Up” in which at some point Thornton argues that associations between breastfeeding and IQ are not just confounding because “recent MRI scans have revealed a human milk ‘dose-response’ in brain morphology.” That does not seem like a particularly compelling argument from a causal inference perspective – after all, there may as well be common cause confounders between breastfeeding and white matter – but the argument “feels” like it works. |

↑16 | Except maybe for some cases of repeated within-subject experimentation for which it is plausible to assume no carry-over effects. If this happens to be applicable to your research question—knock yourself out, it’s a great design. |

*p*values (*the*canonical point of contention)- Bayes factors
- structural equation modeling
- instrumental variable estimation
- propensity score matching/weighting/anything
- point-and-click moderated mediation analysis
- causal inference based on observational data in general
- logistic regression
- linear probability models
- complex systems modeling
*anything*involving machine learning for psychologists- multiverse analysis
- preregistration
- etc.

I will use “statistical approaches” as a label for this collection of heterogeneous things. Including preregistration is a bit of a stretch – it is a whole workflow and the people who argue about it are not necessarily methods/stats people – but I think many of the same arguments apply, so bear with me for now.

Does it make sense to say that a statistical approach is good or bad? People certainly do pass judgment on them.

One way to evaluate statistical approaches is to think of them as potential interventions into the world (independent variable) and consider the quality of the resulting scientific inferences (the outcome). If people used Bayes factors, would science be better off?

This is a causal question, and it is an underspecified one.

First of all, causal effects involve contrasts between at least two states of the world, so we need to compare the Bayes factor world to some other world. Here, this would likely be the incumbent – mostly a *p* value world with some confidence intervals thrown into the mix – but unless specified, it could also be some other hypothetical world (e.g., a world in which a lot more people subscribe to the not-so-new new statistics of effect size estimation).

Then, we need to think about the treatment. For now, imagine that we could surgically intervene on practices — at least hypothetically, at some point in time, we can change precisely which statistical approach people use, without changing anything else. For example, we could get people to use propensity score matching in precisely those situations in which they used to use regression adjustment before.

Even with a well-defined treatment, causal effects can be heterogeneous (what works for you and your research needn’t work for everybody else) and we need to consider how we want to deal with that when judging a statistical approach.

For example, researchers’ pre-existing skill level may vary, which may in turn affect whether and to which degree their work is improved if they use statistical approach A rather than B. For researchers at high skill levels, using propensity scores for third variable adjustment may afford them higher flexibility, resulting in better inferences as they can more readily model complex interactions. Researchers at a low skill level may fail to realize those benefits and actually end up making *worse* inferences because the complexity of the procedure distracts them and gives them the impression that something magical was happening, rather than third variable adjustment.

These are conditional causal effects, holding a third variable (researcher skill) constant at a given level (high, low). Evaluating the conditional causal effect at a low skill level is a good way to bash any new and sufficiently complex statistical approach; unskilled researchers will probably botch new and complex things. Evaluating the conditional causal effect at a high skill level is a good way to defend any new and complex statistical approach; skilled researchers will probably apply it well and arrive at good inferences.

Both of these can be true at the same time; which one should we consider more important in our overall judgment? If we’re afraid that bad science crowds out all the good stuff and wastes huge amounts of money, we may be more concerned about what happens at the low skill level. If we believe that actual scientific advancement only happens at the top level and nobody reads the bad stuff anyway, we may be more concerned about what happens at the high-skill level.^{[1]}Adam Mastroianni writes about weak-link versus strong-link problems: Weak-link problems means that the overall quality depends on the quality of the worst stuff; strong link-problems means that the overall quality depends on the quality of the best stuff. He argues that science is a strong-link problem, which would imply that we should mostly care about what happens at the high-skill level. Simine Vazire points out that even if science is a strong-link problem, we’re in trouble if we can’t tell what’s strong vs. weak science.

More “agnostically”, we could simply average across researchers’ pre-existing skills in the population; thus weighting the conditional effects: if there are many more people who will apply this badly, we may end up at a negative effect despite a positive effect among a minority. This is akin to evaluating a marginal effect rather than a conditional effect.^{[2]}In psychology, I sometimes encounter the misconception that when effects are heterogeneous, the average effect is wrong in some meaningful way. It is true that the average effect may not apply to any single individual; just like no single individual in a sample with an average of *M* = 100 may actually precisely score 100, even if the distribution is perfectly normal. But the average effect is still the average of the effects; nobody (I hope) says that the mean is wrong just because it does not always reflect anybody’s actual value. Maybe people are confused because in psychology, the implicit assumption is that the coefficient we estimate is indeed the effect for every single person; when they see evidence that this is not the case, their brain short-circuits. Maybe this has to do with statistical training focusing on factorial experiments, in which researchers can control the distribution of the factors. In case of interactions, the average effect of one factor will be a feature of the distribution of the other factor, which is a feature of the design and can be arbitrarily chosen by the experimenter. In the causal inference literature, the default assumption is that effects vary between individuals. From that angle, effects that average over one thing or another are often the best we can get; individual-level effect estimates would require within-subject experimentation with additional assumptions.

The problem is that we can’t know for sure how skills are distributed in the population.

The average skill level of course lies somewhere below the skill level of you, my dear reader. But where does it lie in absolute terms? If you think that most people struggle to grasp the basics of simple linear regression, you may think that adding structural equation modeling on top will do little to improve things. If you think that people already have a firm grasp on the methods they use, adding complexity and flexibility likely won’t hurt. Furthermore, intervening on statistical approaches may also have effects on subsequent skill level. Maybe if people start using some more complex approach, their skill level slowly catches up, and in the end, inferences get better over time and they lived happily ever after. So we can also evaluate causal effects at different points in time and arrive at different conclusions. An example where the time point of evaluation *could* make a difference is multiverse analysis/specification curve analysis: maybe the paper reporting the analysis itself doesn’t arrive at the best possible inferences (no positive effect on research quality here and now), but the statistical approach could improve researchers’ understanding of the role of various data analytic decisions (e.g., which ones are arbitrary vs which ones change the analysis target; positive effect on research quality some time in the future).

Apart from researcher skill, the effects of any statistical approach may vary by research question. Some statistical approaches already entail a certain research question. For example, if you use a predictive machine learning algorithm, you will address a predictive research question. So we could evaluate the effects of a certain statistical approach conditional on the matching research question: Will the predictive machine learning algorithm improve inferences when a predictive research question is of interest? Unfortunately, that’s not always how approaches are adopted. For example, researchers trying to address causal research questions may also try to apply the predictive machine learning algorithm, even though it’s not what they’re actually interested in. For those scenarios, the evaluation may come out less positive. Opponents of an approach may naturally focus on the effect of the statistical approach conditional on people applying it as mindlessly as possible; proponents may naturally focus on people applying it precisely where it is advantageous over the status quo. In practice, we are guaranteed to get a mix with some applications that are very sound and others that are very stupid and should somehow average over that.

Here again, things may change over time because the statistical approach affects the research questions people ask. I am fully convinced that this happens; if you’ve used predictive models to address causal questions for long enough, it may be easier to simply change your research questions to predictive ones. Whether that’s good or bad is a question for another blog post.

In reality, we cannot surgically intervene on research practices. We can at the very best suggest what people ought to do (maybe a bit forcefully during the review process) and decide what to pass on to the coming generations of researchers through teaching. These complex interventions will often deliberately target researchers’ skills as well — if your whole point is getting people to use approach X through this intensive tutorial, then evaluating the effect of X at the pre-intervention skill level seems moot. Because you don’t want people to apply X at their pre-intervention skill level, you want them to take the whole intervention and develop the skills to use X as well as possible. These complex interventions will also affect research questions. For example, I have written a lot about causal inference based on observational data; this may be taken as active encouragement for people to switch to causal research questions based on observational data, for which I have been criticized. That criticism has a point; I am certain some very bad studies drawing causal conclusions based on observational data were inspired by my work. (Of course, this itself is causal inference on observational data, so I feel vindicated regardless. Let’s hope my *average* effect on research quality is positive.)

The effects of the full-blown complex intervention will exert themselves through more convoluted paths. For example, if an approach X results in particularly sexy results, it may in the short term be taken up selectively by researchers who don’t care much for rigorous implementation. In the long term, that may enable those researchers’ careers to thrive, leaving less space for researchers who do better work, thus shifting the pool of researchers in the field toward a lower level of skills. That could, in principle, be true even if the hypothetical effect of a surgical intervention in which everyone in the original population adopted X on research quality was positive. Additionally, many things will come with opportunity costs. The time you spend writing down a preregistration is time that you cannot spend improving your research design in some other way; and so for some people,^{[3]}who would never *p*-hack and who report everything transparently anyway preregistration may not only be “redundant at best” but actively preventing them from realizing the best possible research they can. Whether that is enough to render the average effect of preregistration negative is a different question.

So we are in a situation in which we can evaluate a statistical approach in many different ways that can result in seemingly contradictory but fully compatible answers. On top of that, any evaluation will involve a number of unknown variables. What do researchers’ skills actually look like? Which research questions are they trying to tackle? We can only guess the answers based on information that is available to us, which will be a mix of our reading of the published literature and local information (“Everybody in my department is doing median splits, send help!”). The latter is very likely to be biased because we are not surrounded by a representative sample of our field of research. The former is also biased because the published literature is not representative of the average research that is being conducted; publication itself acts as a distorting selection filter. And how will things play out in practice, who will pick up what, and how? Again we can only guess based on our experience with how the world and people work.

I thus think it’s inevitable that people will disagree about the merits of various statistical approaches, even if there is high agreement about all facts that can be verified. People put higher weights on different types of effects (“sure, this would work if everybody was a saint, *however*…”) and have different guesstimates of both the status quo (e.g., researchers’ skills) and how things will work out in the long run (can researchers simply ignore misapplications of this approach when the resulting plots look *so nice*?).

There are some edge cases for which it’s easy to actually average over a lot of factors and still come out with a pretty solid assessment. If the math underlying a certain approach is just wrong in a meaningful way, it’s very unlikely that the net effect is positive. If the approach tackles a research question that elicits neither interest nor understanding, maybe just don’t (looking at you, MANOVA, keep on burnin). If the data requirements for valid inferences just won’t be met anytime soon and if things go haywire in unpredictable ways whenever requirements aren’t met, maybe just don’t.^{[4]}“Error-free measurements of all variables contained in the system? At this time of year? At this time of day? In this part of the country, localized entirely within your dataset?”

But oftentimes there will be a lot of room for disagreement. And so much uncertainty at every step that I’m having a hard time coming down with strong opinions myself. That said,

I still often enjoy watching from the sidelines. Things get a bit repetitive as one ages; I personally don’t need another season of Bayesians vs. Frequentists.^{[5]}They should have ended after Reverend Bayes woke up to realize that he had not in fact mathematically proven the existence of God, but instead been in a coma, during which the love of his life eloped with a Frequentist. But somebody else may still profit. Because whenever I encounter a debate for the first time, I feel like there’s something to learn. Sometimes, it’s something about the relative strengths and pitfalls of various statistical approaches and how things can be botched at various skill levels. Sometimes, it’s something about people’s underlying expectations of how other researchers behave. Sometimes, it’s about underlying values, such as whether or not statisticians ought to be held accountable for misapplications of stats. Sometimes, it’s even about people’s beliefs about the role of science in society.

Sometimes there is educational value, sometimes there is at least entertainment value. But I can’t get very passionate myself.

…UNLESS somebody says something against defining estimands and spelling out assumptions. If you want to do that, how about we go outside and settle this like emotionally stunted men.

Footnotes

↑1 | Adam Mastroianni writes about weak-link versus strong-link problems: Weak-link problems means that the overall quality depends on the quality of the worst stuff; strong link-problems means that the overall quality depends on the quality of the best stuff. He argues that science is a strong-link problem, which would imply that we should mostly care about what happens at the high-skill level. Simine Vazire points out that even if science is a strong-link problem, we’re in trouble if we can’t tell what’s strong vs. weak science. |
---|---|

↑2 | In psychology, I sometimes encounter the misconception that when effects are heterogeneous, the average effect is wrong in some meaningful way. It is true that the average effect may not apply to any single individual; just like no single individual in a sample with an average of M = 100 may actually precisely score 100, even if the distribution is perfectly normal. But the average effect is still the average of the effects; nobody (I hope) says that the mean is wrong just because it does not always reflect anybody’s actual value. Maybe people are confused because in psychology, the implicit assumption is that the coefficient we estimate is indeed the effect for every single person; when they see evidence that this is not the case, their brain short-circuits. Maybe this has to do with statistical training focusing on factorial experiments, in which researchers can control the distribution of the factors. In case of interactions, the average effect of one factor will be a feature of the distribution of the other factor, which is a feature of the design and can be arbitrarily chosen by the experimenter. In the causal inference literature, the default assumption is that effects vary between individuals. From that angle, effects that average over one thing or another are often the best we can get; individual-level effect estimates would require within-subject experimentation with additional assumptions. |

↑3 | who would never p-hack and who report everything transparently anyway |

↑4 | “Error-free measurements of all variables contained in the system? At this time of year? At this time of day? In this part of the country, localized entirely within your dataset?” |

↑5 | They should have ended after Reverend Bayes woke up to realize that he had not in fact mathematically proven the existence of God, but instead been in a coma, during which the love of his life eloped with a Frequentist. |

The piece was short and opinionated and we reported only a few graphs ^{[1]}one of which I had to correct. Together with Farid Anvari in the lead and Lorenz Oehler, we followed up on this piece with a preprint, in which we take a more in-depth look at measure and construct proliferation in psychological science. In it, we break things up by subdiscipline and reflect on what exactly these patterns can tell us.

Just as we were wrapping up the preprint, Communications Psychology published an opinion piece by Iliescu et al. (2024) that is sort of but not really a reply to our Toothbrushes paper. The authors argue, as they write in their title, that “Proliferation of measures contributes to advancing psychological science”.

Now, to be honest, we were very tempted to turn this into an epic^{[2]}academic battle^{[3]}armchair back-and-forth that’s less about being right and more about who gets the last word. We may even have created an AI-written, AI-vocalised rap battle. However, on closer inspection, the steam kind of dissipated. We now suspect we agree more than the exchange (or a rap battle) would lead a cursory reader to think. We decided to write this blog anyway because if there is one thing that advances psychological science, it’s the proliferation of comments and replies. So, some quick clarifications are in order. We hope that the preprint also helps to clarify how we view the problem. In it, we had more room for the nuance that the brief commentary may not have had.

As is so often the case in psychology, some of the disagreement may be coming from an unclear definition of the concepts involved. Some of the authors’ responses can be read as if they understood us as saying that the creation of new measures should be curtailed. For example, they say that “We suggest that proliferation of measures is not per se a negative phenomenon, but strongly depends on how it is situated and that it can be bound into the very fabric of how psychological science develops.”

We do not disagree that growth in the number of available measures can be good. However, by *proliferation* we mean excessive, unchecked growth. We especially find the “unchecked” part problematic — many measures are not pretested, subjected to stringent construct validation, checked for redundancy, or in fact ever used again outside of the primary study for which they were developed and used. We created the SOBER guidelines precisely because we want *checked* growth, not because we want measurement development to cease.

The key problem we’d therefore like to solve is fragmentation, which we operationalize using normalized Shannon entropy (“closer to 0%, a few measures dominate; closer to 100%, measures are used an equal number of times across publications”). We argue that as long as measures are created and discarded rapidly without integrating them into a broader body of work through construct validation etc., we will see fragmentation. We are absolutely fine with new measures being created, if they meet the quality standards (e.g., our SOBER guidelines).

We suspect one of the key things causing disagreement is the area of focus. Some of the authors are or were editors of measurement-focused journals. They see researchers who invest their time to push the envelope and improve measurement in psychology at the cutting edge, authors who are on top of how best practices evolve in their field. But a lot of measures are developed outside of assessment-focused journals, with comparatively little effort, in the course of primary research. Few of them may be memorable, but collectively they are many. One of the key points illustrated by the data in our Toothbrushes paper, the new preprint and of course Flake & Fried’s schmeasurement work is that developing bespoke measures as a sort of afterthought in primary research appears to be common: most measures are used just once or only a few times and researchers are often not transparent about their procedure and adjustments.

We think operationalising and quantifying fragmentation is very helpful because, clearly, informed readers can disagree. For example, Iliescu et al. write

“In fact, one could argue that there are few (if any) psychological constructs that exhibit as much measurement proliferation as [general mental ability (GMA)] showing that sometimes theory and validity development go hand in hand with the proliferation of measures.”

Iliescu et al. 2024

This was surprising to us. We *agree* that the GMA/intelligence literature is a good literature to emulate with respect to measurement. Research validating novel measures against existing measures is common, finds convergence (latent correlations approach 1), and the effort going into measure development is high. Most people do not create an intelligence test ad hoc, in fact we doubt there exist tests of general mental ability that have been used only once or twice. We agree this is a comparatively good field.

But we think things are worse for many or most other constructs and measures. According to our operationalisation of fragmentation, it is not true that there are “few (if any) psychological constructs that exhibit as much measurement proliferation”. We have data to support our position. In APA PsycTests, tests are classified in broad rubrics. With a normalised entropy of .51 for *aptitude and achievement* and a .53 entropy for *intelligence*, these two classifications are in the bottom five for entropy out of 31 classifications. So, other classifications, like *Personality* with .69 or *Organizational, Occupational, and Career Development* with .79 clearly have higher normalised entropy. Now, maybe the authors think our operationalisation of fragmentation is incorrect, or they consider the comparison unfair in other ways. But with a precise operationalisation and quantification, we can discuss this and maybe come to an agreement.

We don’t rely exclusively on the measure of fragmentation to make the case that proliferation occurs. We also point to the fact that there are thousands of measures and constructs published every single year and that the absolute majority of these are used only a very few times. And the thousands of measures and constructs that are recorded by the APA in their databases are only a part of the full number of constructs and measures that are published, since the APA’s recording standards involve recording only measures that are not “marginal”. APA’s recording also does not take into account the sort of flexibility in measurement that focused reviews of the literature often unearth.

While we agree with the authors of the commentary on some cases of desirable growth, we also see evidence for unchecked, excessive growth: i.e., proliferation. Some of the authors of the reply seem to agree with this.

So, maybe there is more agreement about the state of specific fields, but we differ in what measures we are commonly exposed to and hence what we focused on in our recommendations and comparisons. Some of the authors of the reply work in cognitive or clinical psychology. According to our new preprint, these subdisciplines are comparatively less fragmented than personality and social psychology or than I/O psychology, our respective backgrounds. A good way to see this is to look at the treemap plots. I/O looks a lot more fragmented than clinical (see all treemaps here, but give it some time to load).

In summary, we do not want to prevent the development of new measures. We do indeed want to raise the bar for releasing new measures into the wild and for modifying existing measures. However, our hope is that a higher bar will lead teams to pool their efforts, so that we get a few good measures and well-validated revisions and translations rather than hundreds of poorly validated measures that are hardly comparable.

Our vision is one of open, living standards that evolve over time at internet speed, like HTML, not one of copyrighted, crusty standards that cannot evolve until all the subcommittees and sub-subcommittees have found a way to meet, like the DSM. We are very interested in discussing how to best achieve this vision, what role bottom-up and top-down processes have to play, how to avoid the failure mode of proliferating standards, and in hearing about the experiences of those involved in related efforts like HiTOP.

Footnotes

↑1 | one of which I had to correct |
---|---|

↑2 | academic |

↑3 | armchair back-and-forth that’s less about being right and more about who gets the last word |

Measurement invariance is usually defined as a statistical property that shows that a measure (e.g., an ability test, or a personality questionnaire) measures the same construct in different groups (e.g., in North Americans and South Africans; in young people and old people). Another way to put it is that the measure works the same way in each group. In either case, when a measure is invariant, we can use it to compare the construct between the groups (e.g., to test whether there are personality differences between young people and old people). Or so they say.^{[2]}John Protzko told me he received pushback for saying that people say measurement invariance implies that the measure measures the same thing in different groups. So, in his preprint on measurement invariance, he now reviews the evidence that measurement invariance is commonly interpreted this way–which it is.

This always sounded like magic to me. I’m sometimes uncertain whether a given measure measures anything at all – how could a statistical procedure be so much smarter and determine whether the *same* potentially ineffable thing is being measured in different groups, just based on some correlations and mean values? It just strains credulity.

In the following, I will provide a brief conceptual introduction to measurement invariance, but I will also try to convince you of two things. First, bad news: tests of measurement invariance indeed cannot achieve magical things. Second, good news: measurement invariance is much more interesting than the literature pretends. Measurement invariance often appears like a mandatory statistical bar that needs to be cleared before interesting substantive statements can be made. This mindset renders a lack of invariance (i.e., variance) a tragedy to be avoided at all cost because it puts a premature end to a substantive project. But in fact, it is a lack of measurement invariance that tells us there may be something *substantively* interesting going on. As always, the secret ingredient will be a causal perspective.

Configural invariance means that the “form” of the models is the same in the groups of interest. Form entails both the number of latent variables and whether the loadings are non-zero to begin with. For example, let’s say we were interested in whether an extraversion measure consisting of five items is “configurally invariant” for young and old people (Fig. 1). What we would do to check this is run a one-dimensional factor model in the young group, and another one in the old group. If the one-dimensional model fits in both groups, and no item belongs to the construct in one group but not the other one, we could claim configural invariance. Or, more correctly, we could not reject the null hypothesis of configural invariance.

Configural invariance is mostly like a weaker form of metric invariance, which is why we are going to discuss the interesting substantive stuff in the next section.

Metric invariance means that for each item, the loading of the factor on the item is the same in the two groups (or, again more precisely, that we cannot reject the null hypothesis that the loadings are the same), see Fig. 2. So, for example, the loading of extraversion on “I am the life of the party” needs to be the same for young people as for old people. The standard way to test this is to run a multi-group structural equation model (SEM) and then constrain the loadings to be equal across groups and check how the model fit changes; if it gets a lot worse, we would reject the equality of loadings and thus reject metric invariance.^{[3]}A little bit worse may be okay. Always check with your local statistician or the literature if you have any concerns about debates about the appropriate cut-offs for measurement invariance testing.

What does that mean, though? Let’s take a step back. In structural equation modeling, we are trying to explain the data through a model. The data are our observed item responses; in particular the associations between them.^{[4]}More technically, the data that are being explained in structural equation models are usually the elements of the variance-covariance matrix (a matrix that for each item contains its variance and its covariance with all other items; think correlation matrix but in unstandardized). Thus, SEM results can often be fully reproduced from the variance-covariance matrix alone. Sometimes, the mean structure is also modeled (e.g., when testing scalar invariance, see later section). The model is a latent variable model in which some (unobservable) construct affects the item responses, and the magnitude of these causal effects is quantified by the corresponding loadings. So, this is a causal model. Measurement invariance means that the same causal model can explain the data in all groups.

Before we get to the galaxy-brain interpretation of this, let’s talk about a corollary: when we are testing for measurement invariance, we are testing causal models based on observational data. Any factor model is a causal model based on observational data. Oops-a-daisy! The results are always conditional on causal assumptions; here, the assumed factor model. Without those assumptions, we don’t get anywhere because the observed associations between the items could be explained by an infinitude of alternative data-generating mechanisms.^{[5]}Here I am subscribing to a realist interpretation of factor models, in which the arrows are arrows and not a shorthand for “an association but *somehow* also directed.” The realist stance is forwarded in older writings by Borsboom, Mellenbergh and van Heerden (2003, 2004). A lot of researchers are probably more comfortable subscribing to a pragmatist interpretation in which a factor model is just a good way of summarizing a large number of items. I am not sure such an interpretation is 100% consistent. First of all, when calculating a factor score, items with higher loadings will have a bigger impact. Statistically speaking, those will be the items that have higher correlations with other variables. But why would we want to give higher weight to those items when the goal is to simply summarize the variables? Giving equal weights to the items (aka, just taking the average) seems more justifiable if we are completely agnostic to what caused the data (and thus also about how error may have affected values). Second, once we try to do anything else with those latent variables, inferences can end up badly biased if the true data generating mechanism was a different one (Rhemtulla et al., 2020). Unless somebody convinces me otherwise (come at me pragmabro), I think the pragmatist interpretation is mostly a cope to avoid facing that much of empirical psychology rests on very strong assumptions. That puts “I’m just using this factor model for pragmatic reasons” into the same category as “oh no I do not want to make causal claims I am just trying to do prediction” (see footnote 12) and “non-causal mediation analysis” (which isn’t a thing; mediation is statistically indistinguishable from confounding–the difference lies in causal assumptions).

So saying that measurement invariance establishes that the same thing is measured in different groups is a bit of a stretch. It’s more that measurement invariance tests whether certain things are the same across groups, conditional on the assumptions that the latent variable model holds.

But let’s get to the galaxy-brain take, which was once revealed to me in a ~~dream~~ manuscript by Denny Borsboom.^{[6]}And then shortly afterwards I stumbled across it again in Paulewicz and Blaut’s paper on the general causal cumulative model of ordinal response. Baader-Meinhof much Measurement invariance implies that all effects of the group variable on the items are fully mediated by the latent construct.^{[7]}More generally speaking (and still keeping things casual), measurement invariance implies that all group differences in the items “go through” the latent construct. So, measurement invariance would also be violated if there is a common cause that confounds the group variable with the items. In such a scenario, the measure would be non-invariant, but it is a type of non-invariance that can be “explained away” *if* we additionally observe and model the common cause (which would allow us to arrive at unbiased conclusions about group differences in the latent construct, despite non-invariance). After posting the blog post, Borysław Paulewicz kindly reached out to me and patiently explained what a complete, causal definition of measurement (non-)invariance has to look like. You can find it on page 14 of his manuscript on the general causal cumulative model of ordinal response. It says that a response R (in our case, the observed items) is a biased measure of the latent construct T with respect to some manifest variable V (in our case, the group variable) if there exists an active causal path (one without a collider) between the response R and the manifest variable V such that the latent construct T is not a node on this path.

Let that sink in (Fig. 3).^{[8]}I have collapsed the multiple items into a single “observed measures” nodes because that looks more elegant (in DAGs, a single node for multi-dimensional variables is fair game), and also because we can think about measurement (in)variance in scenarios in which there is only a single item or outcome. We wouldn’t be able to test it with the standard procedures, but that doesn’t mean that it couldn’t be a potential concern if we have some indication that the single item could function differently in groups.

In the scenario with measurement invariance, all the group differences in the observed measure can be “explained away” by group differences in the latent construct. That also means that if we do know the group differences in the latent construct (Group → Latent construct), we essentially already know what’s happening on the item level. To figure out the precise group differences in the items, we would still additionally need the measurement model (Latent construct → Observed measures); this would for example inform us that the latent construct has only weak effects on a particular item, which in turn tells us that the group difference on that item will be smaller. But anyway, in this scenario, the group differences on the latent construct are a good summary of the group differences in the observed items and we might as well skip reporting the group differences for each single item.

That is not the case in the scenario in which there is no measurement invariance. Here as well, group differences in the latent construct will lead to group differences in the items, but there are also group differences in the items that arise through some other mechanism. Thus, the group differences in the latent construct no longer give us the full story of what is going on on the item level.

Let’s tie that back to metric invariance and our extraversion example, Fig. 4. Among very young people (who cannot yet decide on their own whether they can attend parties) or very old people (for which party attendance is no longer normative), reporting to be “the life of the party” may be less reflective of extraversion. This means that the effects of extraversion on item responses are modified by age, which could be depicted by an arrow pointing on the loading. Alternatively,^{[9]}If you don’t want to make Richard McElreath unhappy in DAG style, you could just draw an arrow from age to the item; because everything is fully non-parametric in a DAG, all variables that jointly affect another one may interact in any conceivable way. In any case, just looking at the Age → Extraversion part wouldn’t tell us the full story about what is going on with the “variant” Item No. 5.

Probably the most intuitive examples for metric non-invariance are cases in which the effects of the factor on at least one item will vary dramatically between groups, to the point that it ceases to be a good item in one group. On Bluesky, Ed Hagen shared a wonderful example: an intelligence test item in which kids are supposed to identify what is missing from a picture. The picture shows two ice skaters, and on one of the ice skates a blade is missing. That’s easy enough to spot if you know anything about ice skating. However, the test was used to measure intelligence in South African children. That’s probably not the best item to measure intelligence in a context in which ice skating likely is a much more niche hobby than in North America.

Scalar invariance means that for each item, the intercept is the same. The statistical way to put this is a bit opaque; it means that when the observed scores are regressed on each factor, the intercepts are equal across groups. This means that group differences in the item responses are fully accounted for by group differences in the latent construct.

From a causal perspective, scalar invariance is violated if the group variable affects the item scores through some mechanism that is not the latent construct. Now, above we already said that this is the case when metric invariance is violated, but the causal story behind the violation is a slightly different one. Above, the group variable affected the magnitude of the effect of the construct on the item (which led to differences in the loadings). One could also say that when metric invariance is violated, there’s an interaction effect that we don’t want (group variable x construct → items). Here, considering violations of scalar invariance, the group variable “directly” affects the item responses. One could also say that there is a main effect that we don’t want (group variable → items).^{[10]}If you’re more of a dag person: In a proper DAG, violations of metric invariance and violations of scalar invariance would look the same because all variables that jointly affect another one are allowed to interact by default without any need for arrows on arrows.

For example (Fig. 5), regardless of whatever effect age has on extraversion, age may additionally make respondents feel less vital and thus less likely to report that “lively” describes them well. In such a scenario, there will be age differences in the variant item that are not explained by age differences in extraversion.

That’s what we found in an analysis of Australian data: “Lively” goes down quite a bit with age (Fig. 6), and this isn’t just age differences in extraversion (Seifert et al., 2023). There is something else that makes people feel less lively as they age, and it kicks in early. Now that I think about it, maybe this depressing finding isn’t the best way to sell our paper.

Anyway, we were interested in age effects on self-reported personality and analyzed longitudinal panel data from Australia, Germany and the Netherlands. In general, metric invariance actually seemed unproblematic in those data; the data were mostly compatible with a model in which the effects of the personality traits on the item responses are the same for people at different ages (equal loadings). But things looked much worse for scalar invariance.^{[11]}To test measurement invariance over age, without artificially splitting people into age groups, we applied local structural equation modeling (LSEM). Think of it as multigroup SEM across individual years of age, but estimated in a manner that parameters change smoothly with age rather than bouncing around. This, in turn, opens up the possibility that for individual items, the age effects look quite different than for the factor. So we also looked at the age trajectories of the individual items to see which items diverge. That may seem like an anticlimactic “solution” to measurement invariance; it doesn’t involve any sophisticated modeling. But from my perspective, it’s true to the idea that if there is measurement invariance, that means that the latent variable does not tell the whole story that is to be told on the item level. So, why not tell the story on the item level? That may not always be an option in small data sets, but for us it was.

What other ways forward could there be?

Regarding liveliness—if we want to insist that all age effects on such items must be mediated through latent constructs—one solution would be to consider that extraversion may consist of multiple subfactors with different age trajectories. The Big Five Inventory 2 (BFI-2) splits up extraversion into three dimensions, sociability, assertiveness, and energy level. “Lively” belongs to energy level which likely shows the most pronounced age decline; “Shy”, “Extroverted”, and “Talkative” likely belong to sociability which we would expect to show less pronounced age effects. In such a manner, measurement invariance could be restored once the set of included items had been expanded, and we could once again start working with the latent variable.

Another way to more immediately restore measurement invariance would be to kick out the offending item. That may of course shift the meaning of the latent factor away from what was originally intended and may reduce comparability across studies, but I’m sure this could sometimes be a sensible solution–if implemented transparently. Readers should be fully aware of which items end up in the model that produces the final results, but that of course holds for all latent variable models. (#AlwaysLookAtTheItems)

Another approach aims at establishing “partial invariance”, which relies on some items being assumed to be invariant (thus providing an anchor for comparisons), while others are allowed to “function differentially.” And there is *a lot* more stuff being discussed in the psychological methods community–see for example Robitzsch and Lüdtke (2022) who argue that measurement invariance is not a prerequisite for meaningful and valid group comparisons.

I feel like in psychology, there’s a tendency to only consider the measurement model itself as if it were some free-floating thing, and then go on and tinker with it. But, for example, there is no reason to insist that all effects on personality items are mediated through invariant personality constructs. So, alternatively, we could try to model additional pathways from age to lively; maybe the “invariant intercepts” can partly be explained by adding some measure of physical health that provides an alternative pathway from age to liveliness. This step requires some degree of domain knowledge, which is always the case when it comes to causality.

Residual invariance means that for each item, the residual variance—the variance of the ominous E pointing into the items—is the same. We can again phrase this statistically: if we regressed the item scores on the factor, then the variance of the remaining residual would be the same in the groups (i.e., there would be homoscedasticity). If this is violated, multiple causal stories could go on here. For example, with increasing age, the effect of the latent construct on the item responses could remain the same on average but become more uncertain, leading to more variability in the residuals at higher ages. But it could also be possible that, with increasing age, the effects of some variables unrelated to the construct of interest become more pronounced, or maybe new factors start to affect the item response with increasing age and thus induce additional residual variance. The thing about the residual is that it captures everything that’s not explained in the model, and explaining changes in the amount of unexplained things seems a bit futile. Residual invariance is often not tested because it’s not necessary for latent mean comparisons. It’s a bit of an anticlimactic level to end on.

Sometimes, people will bring up measurement invariance when they have a particular mechanism in mind that may render item responses incomparable between groups. To figure out what will be flagged as a violation of measurement invariance, it helps to draw the causal graph one has in mind.

For example, consider a hypothetical scenario in which teenage girls today report higher levels of mental distress than teenage girls ten years ago. Somebody may raise the issue that nowadays mental health is discussed a lot more, and so teenage girls today may simply be more sensitive to symptoms, giving higher responses to items asking about mental distress even when the underlying level of mental distress is the same. That would render item responses incomparable in a certain sense, and one might thus suspect that this would be revealed once we test for measurement invariance.

But consider the graph in Fig. 7, in which “girls these days” are indeed reporting different levels of mental distress at least partly because of differences in sensitivity. But this actually does not violate measurement invariance (as tested in tests of measurement invariance), because all group differences are still mediated by the latent variable that is being modeled, reported mental distress: Teenage girls today may have different levels of true mental distress which leads to reported mental distress which leads to higher item responses (the “good” path that we would want to target); teenage girls today may have different levels of sensitivity to mental distress which leads to reported mental distress which leads to higher item responses (the “bad” path that some would consider spurious).

So, this is a measurement concern that would not result in a statistically detectable violation of measurement invariance. You *are* measuring “the same thing” (reported mental distress), but *it is not *exactly the thing you are trying to get at (“true” mental distress). The flip side of this is that it may look like your measure is invariant, but that doesn’t mean that the intended inferences are automatically valid.

Going a step further, measurement invariance may also occur…without any meaningful measurement. John Protzko has a study in which respondents filled out either the search for meaning in life [sub]scale, or the very same scale, but any mention of meaning or purpose was replaced with the (meaningless) term “gavagai.” Taking the questionnaire version as group variable, search for meaning in life and search for meaning in gavagai seemed to be invariant (equal form, equal loadings and equal intercepts couldn’t be rejected). He concludes that measurement invariance is at best a necessary condition for showing evidence that one is measuring the same thing in different populations; given that it may hold even when nothing meaningful is being measured in one group. I find this quite comforting: we humans may sometimes be unable to tell whether a scale measures anything at all, but at least measurement invariance doesn’t know better either. So I concur with John’s conclusion to the extent that I am willing to simply copy it to conclude this content.^{[12]}Crémieux kindly reached out to me to inform me that actually, John’s analysis may simply be missing violations of measurement invariance that are clearly there when the data are analyzed correctly. Jordan Lasker makes this point in a preprint and provides a re-analysis. And here I thought I had found a nice way to wrap things up with an empirical finding…

Statistically, when we test measurement invariance, we test whether certain parameters are equal across groups within factor models. From a causal perspective, measurement invariance means that all effects of the group variable on the observed measures are mediated by the latent factor. The whole reasoning is conditional on the assumption that the latent variable model holds to begin with.

Philipp Sterner, Florian Pargent, Dominik Deffner, and David Goretzko just dropped a cool preprint titled “A Causal Framework for the Comparability of Latent Variables.” Many of our points overlap, but they use a different framing and make use of selection nodes as they are building on a framework for generalizability. While the serious article format doesn’t allow for many memes (booh!), they make up for it by providing a lot more technical detail and examples.

*Many thanks to John Protzko who provided helpful feedback that led to some additions to this blog post, and to Ingo Seifert who took a critical look as well. As always, all errors aren’t theirs but my co-bloggers’.*

Footnotes

↑1 | A possible exception to this may be the literature on intelligence, in which measurement invariance is a relevant topic of substantive debate. |
---|---|

↑2 | John Protzko told me he received pushback for saying that people say measurement invariance implies that the measure measures the same thing in different groups. So, in his preprint on measurement invariance, he now reviews the evidence that measurement invariance is commonly interpreted this way–which it is. |

↑3 | A little bit worse may be okay. Always check with your local statistician or the literature if you have any concerns about debates about the appropriate cut-offs for measurement invariance testing. |

↑4 | More technically, the data that are being explained in structural equation models are usually the elements of the variance-covariance matrix (a matrix that for each item contains its variance and its covariance with all other items; think correlation matrix but in unstandardized). Thus, SEM results can often be fully reproduced from the variance-covariance matrix alone. Sometimes, the mean structure is also modeled (e.g., when testing scalar invariance, see later section). |

↑5 | Here I am subscribing to a realist interpretation of factor models, in which the arrows are arrows and not a shorthand for “an association but somehow also directed.” The realist stance is forwarded in older writings by Borsboom, Mellenbergh and van Heerden (2003, 2004). A lot of researchers are probably more comfortable subscribing to a pragmatist interpretation in which a factor model is just a good way of summarizing a large number of items. I am not sure such an interpretation is 100% consistent. First of all, when calculating a factor score, items with higher loadings will have a bigger impact. Statistically speaking, those will be the items that have higher correlations with other variables. But why would we want to give higher weight to those items when the goal is to simply summarize the variables? Giving equal weights to the items (aka, just taking the average) seems more justifiable if we are completely agnostic to what caused the data (and thus also about how error may have affected values). Second, once we try to do anything else with those latent variables, inferences can end up badly biased if the true data generating mechanism was a different one (Rhemtulla et al., 2020). Unless somebody convinces me otherwise (come at me pragmabro), I think the pragmatist interpretation is mostly a cope to avoid facing that much of empirical psychology rests on very strong assumptions. That puts “I’m just using this factor model for pragmatic reasons” into the same category as “oh no I do not want to make causal claims I am just trying to do prediction” (see footnote 12) and “non-causal mediation analysis” (which isn’t a thing; mediation is statistically indistinguishable from confounding–the difference lies in causal assumptions). |

↑6 | And then shortly afterwards I stumbled across it again in Paulewicz and Blaut’s paper on the general causal cumulative model of ordinal response. Baader-Meinhof much |

↑7 | More generally speaking (and still keeping things casual), measurement invariance implies that all group differences in the items “go through” the latent construct. So, measurement invariance would also be violated if there is a common cause that confounds the group variable with the items. In such a scenario, the measure would be non-invariant, but it is a type of non-invariance that can be “explained away” if we additionally observe and model the common cause (which would allow us to arrive at unbiased conclusions about group differences in the latent construct, despite non-invariance). After posting the blog post, Borysław Paulewicz kindly reached out to me and patiently explained what a complete, causal definition of measurement (non-)invariance has to look like. You can find it on page 14 of his manuscript on the general causal cumulative model of ordinal response. It says that a response R (in our case, the observed items) is a biased measure of the latent construct T with respect to some manifest variable V (in our case, the group variable) if there exists an active causal path (one without a collider) between the response R and the manifest variable V such that the latent construct T is not a node on this path. |

↑8 | I have collapsed the multiple items into a single “observed measures” nodes because that looks more elegant (in DAGs, a single node for multi-dimensional variables is fair game), and also because we can think about measurement (in)variance in scenarios in which there is only a single item or outcome. We wouldn’t be able to test it with the standard procedures, but that doesn’t mean that it couldn’t be a potential concern if we have some indication that the single item could function differently in groups. |

↑9 | If you don’t want to make Richard McElreath unhappy |

↑10 | If you’re more of a dag person: In a proper DAG, violations of metric invariance and violations of scalar invariance would look the same because all variables that jointly affect another one are allowed to interact by default without any need for arrows on arrows. |

↑11 | To test measurement invariance over age, without artificially splitting people into age groups, we applied local structural equation modeling (LSEM). Think of it as multigroup SEM across individual years of age, but estimated in a manner that parameters change smoothly with age rather than bouncing around. |

↑12 | Crémieux kindly reached out to me to inform me that actually, John’s analysis may simply be missing violations of measurement invariance that are clearly there when the data are analyzed correctly. Jordan Lasker makes this point in a preprint and provides a re-analysis. And here I thought I had found a nice way to wrap things up with an empirical finding… |

There is a peculiar pattern in psychology: researchers discover that their construct of interest can be “disentangled” into two subfactors. Those two subfactors show a small to medium positive correlation with each other, but their associations with *other* variables diverge. One usually seems to be adaptive, in the sense that it is positively associated with good things (well-being, social functioning,…). That’s the bright side of the construct. The other one seems to be maladaptive in the sense that it is negatively associated with good things and positively associated with bad things. That’s the dark side of the construct. Some examples:

Personality researchers distinguish between different subfactors of narcissism. There are many different models here, and to pick one example, Back et al. (2013) suggest the two subfactors *narcissistic admiration*—a tendency to search for social success by means of self-promotion (e.g., “I will someday be famous”)—and *narcissistic rivarly*—a tendency to prevent social failure by means of self-defense (“I enjoy it when another person is inferior to me”). While these two factors show a medium to large positive intercorrelation, their associations with social success diverge: admiration is associated with increasing popularity over time; rivalry is associated with decreasing popularity (Leckelt et al., 2015).

Enns, Cox and Clara (2002) distinguish between adaptive perfectionism (e.g., “I strive to be as perfect as can be”) and maladaptive perfectionism (e.g., “If I fail at work/school, I am a failure as a person”). While the two factors show a medium to large positive intercorrelation, only maladaptive perfectionism shows a large positive correlation with depression proneness. If adaptive perfectionism is linked to depression proneness at all, the correlation is negative.

Krasko, Schweitzer and Luhmann (2021) suggest two subfactors for people’s happiness goal orientations, which are interindividual differences in the extent to which people value and pursue happiness. Happiness-related strivings refers to a tendency to actively move toward happiness (“I often overcome challenges to become happy”); happiness-related concerns refers to worries about one’s happiness (“I am worried that I might be unhappy in the future”). The two factors are moderately correlated. Happiness strivings are associated with high positivity and successful mood regulation strategies; happiness concerns are associated with anxiety and poor regulation strategies.

And for a change of pace, one from health research: Barthels, Barrada and Roncero (2019) suggest two subfactors of eating styles related to orthorexia. Healthy orthorexia is characterized by seeking out healthy foods (“I feel good when I eat healthy food”), orthorexia nervosa in contrast is characterized by negative affect related to food (“Thoughts about healthy eating do not let me concentrate on other tasks”). These two factors show a medium intercorrelation. Healthy orthorexia is associated with increased positive affect, whereas orthorexia nervosa is associated with increased negative affect.

More constructs that may fit the bill:

- affiliative motive (“bright positive side” and “dark negative side,” see here)
- envy (“benign” and “malicious,” see here)
- passion (“harmonious” and “obsessive,” see here)
- pride (“authentic” and “hubristic,” see here)
- rumination (“positive” and “negative,” see here)
- self-monitoring (“acquisitive” and “protective,” see here).

How come? One explanation is that there is some overarching law of personality. In my forthcoming book, “The Yin and Yang of Personality,” I will explain… Just kidding.

Such “bright side/dark side” constructs can be generated if we (intentionally or unintentionally) pick items that combine aspects of one construct of interest (e.g., perfectionism) and a second potentially unrelated construct (e.g., emotional stability) which overall tends to correlate with things that are considered desirable (or, alternatively, undesirable; the signs of the association don’t matter). In this scenario, the standard data analytic approach will result in two correlated factors, one being “adaptive,” the other one being “maladaptive.” The thing is, those factors don’t necessarily reflect the reality of the factors that generated the data.

First, I will present simulated data to explain what’s going on on the statistical side of things. Then, I will present an empirical example to show how the way we slice the construct may affect subsequent substantive conclusions. In the discussion, I will suggest different ways to think about what’s going on with all those bright/dark side constructs.^{[1]}Lastly, I will demonstrate that academic signposting (explicitly telling readers what comes next) is a bad writing habit that can only be justified if there is a mismatch between readers expectations and reality (“It’s more of an article than a blog post”).

Let’s assume we were interested in assessing interindividual differences in people’s relationship with chocolate. How much people like chocolate may be deemed a neutral trait: Some people like chocolate more than others, and there is no a priori reason to assume that they differ much in terms of psychological adjustment (except for some edge cases, looking at you 100%-cocoa-no-added-sugar sickos).

To develop a self-report measure of chocolate liking, we generate eight items. But here comes the twist: We select our items so that four of them not only reflect chocolate liking but also high levels of emotional stability. Possible items could be “As long as chocolate exists, I can see the bright side of life”; “I can really enjoy eating chocolate”; “When I eat chocolate, I feel great”; and “A good bar of chocolate makes my day.” For the other four items, we go for phrasings that not only reflect chocolate liking, but also low levels of emotional stability. Possible items could be “I get concerned if there is no chocolate available”; “A lack of chocolate makes me very sad”; “Sometimes I worry I might run out of chocolate”; “Chocolate helps me to deal with the high amount of stress in my life.” Why, you may be wondering, would you add such mix-ins to your chocolate? We will discuss that later, so just stick with me for now.

In this thought ~~experiment~~ observational survey study, we know the underlying process that generates people’s responses to these items, see Fig. 2. Responses are generated by two underlying latent factors, chocolate liking and emotional stability. These factors are uncorrelated.^{[2]}Ruben maintains that this is not the case and that there should be a small positive correlation between the two. Allowing for such a correlation would not change the general pattern and our next example will include for a correlation between the two underlying factors, so all the Bridget Jones lovers out there can mentally insert *r* = .2 here. Chocolate liking loads on all items to the same extent, but emotional stability loads negatively on half of the items and positively on the second half of items.

Now, let us collect data from 2,000 fictitious respondents and analyze it following standard practices. In a first step, we perform an exploratory factor analysis (EFA) with the help of the R package *psych*. For this purpose, we first need to determine how many factors to extract. There are different ways to inform this decision, but here, fortunately, they all converge anyway (Fig. 3): We definitely should extract two factors.

If we move on and extract two factors, *psych *will automatically return a rotated solution, and even if you use software that doesn’t rotate by default, you’re probably going to rotate anyway. However, let’s first take a look at the unrotated solution in Table 1.

If you compare those numbers to Fig. 2, you can see that they reflect the underlying data-generating mechanism well. There is one factor that loads quite strongly on all items (chocolate liking), and there is a second factor (emotional stability) with somewhat weaker absolute loadings, half of which have a positive sign and half of which have a negative sign. So far, so good.

But results from exploratory factor analysis are usually rotated to improve interpretability. In the unrotated version, the first factor will “soak up” as much explanatory power as possible and have many high loadings; rotation distributes the explanatory power of the factors more evenly and we often aim for a so-called simple structure in which each item “belongs” to a single factor. So, let’s rotate those factors!

We are going to use the default setting (oblimin rotation), although which of the standard rotations you pick here won’t make a big difference.

As we can see in Table 2, newly added columns to the right, we still have two factors—the rotation doesn’t really “change” the factors, it only rotates them in the eight-dimensional plane defined by the responses to all eight items. If you struggle to visualize this, for an imperfect approximation, stretch out your fingers like in Fig. 4 and then vigorously twist your hand around.^{[3]}This corresponds to a situation in which you use three orthogonal factors (your three fingers) to explain the space around you. Given that the space around you has three dimensions to begin with, your hand model is perfect. Hands! Wow.

But the pattern of loadings now looks completely different. The first factor loads on the first four items and the second factor loads on the second four items, all cross-loadings are close to zero. The resulting factors are correlated, *r *= .49.

What happened here? First, the unrotated factor analysis solution recovered the data-generating mechanism that we actually plugged into our analysis: one factor that strongly loads onto all items, and then a second factor that loads positively onto half of the items and negatively onto the other half.^{[4]} That the unrotated solution corresponds to the truth is only because of the data-generating mechanism we simulated. If it always worked like that, this blog post would be a lot shorter. But then rotation led to a “simpler” version in which only one factor loads onto each of the items. From a “purely statistical” perspective, the rotation makes no difference whatsoever—any rotation of the solution is observationally equivalent, which means that it fits onto the observed data equally well. Conventionally, the rotated solution is reported. It’s just like in the song: *oblimin, oblada, life goes on, brah*.

But we can see that if we aim to give a substantive interpretation to these numbers, rotation makes all the difference. The unrotated version tells us that there are two independent factors which one could, for example, label “chocolate liking” and “emotional stability.” The rotated version tells us that there are two correlated factors. The first one may be labeled “chocolate obsession” (“I get concerned if there is no chocolate available”; “A lack of chocolate makes me very sad”; “Sometimes I worry I might run out of chocolate”; “Chocolate helps me to deal with the high amount of stress in my life”), the second may be labeled “chocolate appreciation” (“As long as chocolate exists, I can see the bright side of life”; “I can really enjoy eating chocolate”; “When I eat chocolate, I feel great”; “A good bar of chocolate makes my day”): The bright and dark side of liking chocolate.

We know that the Chocolate Obsession/ChocOlate Appreciation (COCOA) structure did not actually generate the data. Maybe its emergence from the data is a quirk of exploratory factor analysis? So let’s check whether confirmatory factor analysis would be able to discard the COCOA model.^{[5]} We should do this with new data; we could do this with the data underlying the EFA (to cite Anne: “CFA confirms the factor structure you found in an EFA in the same dataset? d’uh”). Luckily, it won’t make a difference for the bright/dark side thing because it is not a matter of sampling variability. In any case, we specify a model with two factors which are allowed to correlate. One of the factors is measured with the first four items; the other one is measured with the second four items, as suggested by our exploratory factor analysis after rotation.

This model fits the data well according to conventional criteria (CFI = .99, RMSEA = .02, SRMR = .02). In fact, the data didn’t really have any chance to reject the model because *the true data-generating model (Fig. 2) is observationally equivalent to the bright/dark side version*. That means that both models predict the same empirical data (the same variance-covariance matrix of the item responses); both models will fit the data equally well.

With this additional “confirmation” of the COCOA model, we may want to look for further validation by investigating how the two factors are associated with other constructs. For example, we may want to explore how these two factors are related to depressiveness. We find that chocolate appreciation is associated with reduced depressiveness (*r* = -.29); chocolate obsession is associated with increased depressiveness (*r *= .31). This pattern further supports the idea that there are two distinguishable constructs, and the difference matters from a substantive perspective. Chocolate obsession is worrying from a mental health perspective; chocolate appreciation in contrast may even be a protective factor. More research is needed.

What happened here? We simulated depressiveness as a random variable that is negatively affected by emotional stability. Since chocolate appreciation mixes chocolate liking with high emotional stability, it will be negatively correlated with depressiveness. And since chocolate obsession mixes chocolate liking with low emotional stability, it will be positively correlated with depressiveness. In fact, the whole chocolate thing doesn’t have anything to do with the correlation with depressiveness.^{[6]} Or just a tiny bit, for proponents of the Bridget Jones hypothesis.

This toy example assumes a very simple and exaggerated data-generating mechanism, but it captures a crucial feature of both exploratory and confirmatory factor analysis. These analyses do not necessarily uncover the latent factor structure that actually underlies the data. Another way to put this is to say “correlation does not imply causation.” As an input, factor analysis takes the associations between the items. As an output, it returns a structure that may have caused the item responses. But a pattern of correlations is always compatible with multiple causal stories, so different factor models (and alternatives, including network models) will fit the data equally well.

Illustrating a point with data tailored for the purpose always feels a bit like cheating, so let’s look at some real data. As mentioned in the introduction, Krasko, Schweitzer, and Luhmann (2021) developed a questionnaire that assesses two aspects of happiness goal orientations, happiness-related strivings (the propensity to move actively toward happiness) and happiness-related concerns (the propensity to worry about threats to happiness). Their article is great, it’s trying to make sense of previous contradictory findings and takes into account multiple possible model specifications. And Krasko et al. made the present re-analysis possible by openly sharing their data and code and providing excellent documentation. In other words, it’s the type of article that contributes to my researcher happiness goals.

A central substantive question in this literature is whether striving for happiness is good (or bad) for one’s happiness, and to demonstrate how the bright/dark side factor structure may affect conclusions about that, we will model multiple constructs simultaneously:

- the independent variable(s) of interest: happiness-goal orientations (5 items measuring strivings, 5 items measuring concerns)
- the dependent variable of interest: life satisfaction (5 items,
*the*Satisfaction With Life Scale) - the potential third-variable offender: neuroticism (3 items from the Big Five Inventory 2), which in personality lingo is the same as reverse-coded emotional stability.

First, we fit a model that includes the two happiness goal orientation factors (Fig. 6): happiness strivings (“bright side”) and happiness concerns (“dark side”). The model fit is okay-ish (definitely better than many model fits in the literature which I have seen described as good), so let’s not worry about that too much. The correlation between happiness-related concerns and happiness-related strivings is quite low (*r* = .14). In contrast, the correlation between happiness-related concerns and neuroticism is really strong (*r* = .81). As would be expected, there is a strong negative correlation between neuroticism and life satisfaction (*r* = -.66). And, in line with the findings reported by Krasko et al., strivings are positively associated with life satisfaction (*r* = .22), whereas concerns are negatively associated (*r* = -.46).

We may want to check whether happiness goal orientations predict life satisfaction beyond neuroticism, using this conceptualization of the factors. We can do that by taking the model from Fig. 6 but instead of allowing all factors to correlate, we include directed arrows that point from happiness concerns, happiness strivings, and neuroticism to life satisfaction.

In this analysis, neuroticism ends up with a very large negative coefficient (*b* = -.77, *p* = .001). Both happiness goal orientations have smaller coefficients but fail to reach statistical significance; concerns: *b* = -.15, *p* = .436, strivings: *b* = .12, *p* = .163. Thus, assuming that neuroticism is causally prior to happiness goal orientations (and to life satisfaction) and thus a confounder, one may conclude that the observed correlations—the negative one for happiness concerns and the positive one for happiness strivings—are at least partly, if not mostly, due to confounding.

As we have seen above, the correlation between happiness concerns and neuroticism is quite high (*r* = .81). So let’s consider a model in which there is actually only a single happiness goal orientation factor. Additionally, neuroticism is allowed to load onto all items that were previously part of the happiness concerns factor (Fig. 8). You may notice that this is a slightly different structure than the chocolate obsession/chocolate appreciation data-generating mechanism, in which the second factor loaded onto *all *items.^{[7]}If we wanted to fit that type of structure, the model would not be identified (unless we incorporated some additional constraints, such as the equality of certain loadings). But you don’t need the full-blown “second factor affects *every single item *either positively or negatively” structure to induce a bright/dark side structure, and maybe that’s the more realistic scenario to begin with: some but not all items are affected by a second factor. This has the added benefit that the correlations with other constructs are not just going to be mirror images of each other (which would start to look suspicious at some point), even if it is less pleasingly symmetrical. The model fit is now slightly worse but still very much in the same ballpark.

Neuroticism is still negatively correlated with life satisfaction (*r* = -.60) while the happiness goal orientation factor is positively correlated with life satisfaction (*r *= .26). Furthermore, there is a slight negative correlation between neuroticism and happiness goal orientation (*r* = -.18). If we look at the loadings, we can see that the happiness goal orientation factor loads more strongly onto the striving items (loadings between .50 and .65) than onto the concern items (loadings between .11 and .33). The latter are more strongly determined by neuroticism (loadings between .69 and .82). We may take this as a sign that maybe the happiness concerns items don’t really fit the construct of happiness goal orientation to begin with, but note that in this model, the meaning of neuroticism will have shifted because in the end, the content of the factor is determined by the indicators—and we have just added a bunch of happiness concerns to the construct of neuroticism.

Again we may want to check whether happiness goal orientation predicts life satisfaction beyond neuroticism, using this alternative single-factor conceptualization. So we re-draw the model with arrows pointing into life satisfaction. Again, we end up with a large negative coefficient for neuroticism (*b* = -.57, *p* < .001). Happiness goal orientation now has a positive coefficient (*b* = .16, *p* = .038) that may be considered significant, depending on your statistical creed. So here, assuming that neuroticism is causally prior to happiness goal orientation, one may conclude that not all of the association between happiness goal orientation and life satisfaction boils down to neuroticism.

Now we have two models which make different statements about (1) the structure of happiness goal orientation and about (2) the relationship between happiness goal orientations and life satisfaction:

- Bright/dark side: two types of happiness goal orientation, associations to life satisfaction could be mostly explained away by confounding with neuroticism
- Neuroticism all the way down: one type of happiness goal orientation but half of the items also happen to be affected by neuroticism, maybe there is a positive association with life satisfaction even after accounting for neuroticism. Admittedly, the latter claim hinges on weak statistical evidence, but
*p*< .05 still counts for something in many articles.

I can’t tell you. If you are really anal about model fit, you may prefer the two-factor bright/dark side model. But the difference in fit seems quite small, and in any case, if you’re that anal about model fit, you might not like either model because the Chi-Square test rejects both of them (see Fig. 6, Fig. 8).^{[8]}Not that anybody cared about the Chi-Square test these days. If it keeps rejecting our models, it’s only fair that we reject it in return.

From the perspective of parsimony, we may prefer to add fewer “new” factors—if a single happiness goal orientation factor in combination with neuroticism can explain the data just as well, why add two? But then again, Krasko et al. (2021) started their observation from contradictory findings in the literature concerning the associations between happiness goal orientation and well-being. Splitting the construct in two factors, one of which is strongly infused with neuroticism (eau de despair), potentially clarifies the situation. But even if that’s the case for happiness goal orientation, what about other bright/dark side constructs? What are they good for?

So we are in a situation in which we got a bunch of items that don’t really fit a one factor model. If we split it up in two factors, it can end up two ways:

- bright/dark side model with two correlated factors that both reflect the focal construct of interest (e.g., happiness goal orientations), one being correlated with good things, the other one being correlated with bad things
- two-factor model in which we have one factor for the construct of interest (e.g., happiness goal orientations) and a second one for whatever else affects some or all items (e.g., neuroticism).

It’s probably fair to say that the bright/dark side model is the sexy one. The other one would probably often be described as “contamination bias,” because we did not set out to measure neuroticism in the first place. This sets up a situation in which we can expect the bright/dark side model to dominate the psychological literature over its competitor. In the worst case, there will be natural selection for bad science here—assuming that the bright/dark side thing is bad science. But is it? Are those models wrong? What does *wrong *even mean in the context of factor analysis? Heck, should it even be called factor analysis, and not assumor analysis instead?

Let’s say we take a realist position and actually believe that some latent factors exist (in some sense of the word) and our research goal is to uncover them.^{[9]}I think not many psychologists would claim they are factor realists *when asked about it*, but many articles nonetheless seem to imply some degree of realism. Which of the two competing models is the true one? Can we design a situation in which the two models make different predictions and then check which one fits better?

That’s where our original draft started to frazzle. I thought that maybe longitudinal or multi-rater studies could be helpful, but then I did some simulations and it seemed like that wasn’t really a way out (both models still made equivalent predictions in the scenarios I considered). Tailcalled brought up factor analysis applied to genomic data, and that does indeed seem to be a method used to discard certain factor-analytic models, although I’m not the best person to think this through for the bright/dark side problem.

One way forward that does seem promising are multi-construct studies. For a lot of the bright/dark side constructs, the “contaminating” factor seems to have the same flavor. For example, for those that are more “externalizing,” the distinction seems to be “I want something good for myself” (benign envy, narcissistic admiration) versus “I want something bad for others” (malicious envy, narcissistic rivalry), the latter maybe reflecting antagonism. For those that are more “internalizing”, the distinction seems to be “I try my best to get good things” (adaptive perfectionism, happiness strivings, acquisitive self-monitoring) versus “I try my best to avoid bad things because they are horrible” (maladaptive perfectionism, happiness concerns, protective self-monitoring). I hate to be that person, but maybe it was BIS/BAS—Behavioral Inhibition System, Behavioral Activation System—all along?

Let’s imagine you have N “substantive” constructs with a bright/dark side. You throw all items together, collect data, and run a factor analysis according to standard practices, including standard criteria for the number of extracted factors. I would be willing to bet that you would end up extracting N factors plus 1 or 2 (N substantive factors plus 1-2 bright/dark ingredients) rather than 2*N factors (N bright side and N dark side factors). In that case, it wouldn’t look very bright for the bright/dark side anymore.

Strictly speaking, the number of factors to extract is a matter of taste, even if it is based on (somewhat arbitrary) empirical criteria. There’s also nothing inherently empirical about extracting a small number of *orthogonal *factors. That mother nature shaves her orthogonally-jointed legs with Occam’s razor is just an assumption.^{[10]}I guess there are valid reasons why psychologists (or Meehl at least) have been struggling with factor realism since the dawn of advanced statistics. But this still seems a promising way forward to break things down into something that could be labeled ingredients, and many people would be willing to take these as a more realistic description of what’s going on under the hood. At the very least, this approach would result in more parsimonious descriptive taxonomies.

Rose et al. (2023) have a paper reporting such a multi-construct study focusing on dark traits. They combine items from quite a few measures (Five Factor Model, Antagonistic Triad Measure, Dirty Dozen, BIS/BAS,…) and end up with four factors (which they label antagonism, emotional stability, impulsivity, and agency). Those results seem quite neat to me, but I may be a bit biased because two of them (antagonism, emotional stability) appear like the worst offenders when it comes to bright/dark side constructs.

Bainbridge et al. (2022) go even further. They included 42 personality scales and show that a large share of them could reasonably be considered as located within the Big Five, in the same manner that Big Five facets are located within the Big Five. Now that’s what I call an epic redemption arc for our five best frenemies! I can very much recommend taking a look at this study, at least check out the very cool figures locating the scales within the Big Five. This work is not primarily concerned with bright/dark side constructs, but it provides a good discussion of what should be expected from proponents of new scales. They should try to locate them within the existing taxonomy, and if the overlap with existing measures is very high, they should prove that their added scale has added benefits (in a much more rigorous manner than is currently done).

The multi-construct approach also seems a good way forward in scenarios in which the bright/dark factor appears to be *mostly *response bias which would be shared across pretty much all questionnaires. As a side note, I think the relationship between a response “bias” factor and a “substantive” contaminating factor isn’t straightforward:

- if you have a substantive bright/dark factor (say, antagonism), social desirability will probably “feed into it” because some people are more willing to report (or ascribe to others) antagonistic feelings and behaviors
- you may end up with separate factors for positively and negatively worded items, which screams “response bias.” But that response bias can still show correlations that appear to be substantively interesting (see e.g., positive and negative self-esteem)
- if your intended factor structure is a bit brittle, you can always mix in a healthy dose of response bias to support it. For example, grit supposedly consists of “consistency of interest” and “perseverance of effort.” In the Short Grit Scale, the items for the former are negatively phrased (“New ideas and projects sometimes distract me from previous ones”) and the items for the latter are positively phrased (“Setbacks don’t discourage me”).
^{[11]}Now that’s what I call a consistent effort to get mother nature to confess.

But like I said before, maybe carving nature at its joints is too lofty a goal to begin with. Maybe even a proper taxonomy is beyond grasp. How about something humbler?

In psychological science, there’s “prediction”^{[12]}This mode of inference allows you to say “X independently predicts Y beyond other factors” and then wink, which everybody understands to imply causality. If pressed why you’re not aiming for causal inference instead, you may say something like “a causal understanding is a very ambitious goal; prediction is a stepping stone towards it.” I think that type of approach usually sets you up to fail on both fronts, at least in psychology (where predictive accuracy is often humble). You’re not doing causal inference well because you pretend to do something else with different rules; you’re not doing prediction well because you don’t *actually *want to predict anything. and prediction. I believe that prediction may be a worthwhile endeavor if it’s done well and if there’s actually a point in predicting the outcome.

Studies establishing new constructs or questionnaires, including studies on bright/dark side constructs, will often allude to the fact that the constructs under consideration can predict some outcome (often, but not always, another self-report questionnaire). I think it would be great if those studies actually made an argument why we would want to predict the outcome to begin with. For example, well-being surely is important. But what do we gain if we find out that it is predicted by X? We *can’t *conclude that we should increase X to make people happier—for that claim, we need causal inference. We *can *use that information to predict who is unhappy, but then what? Most personality researchers are not in the business of rolling out targeted intervention programs. And even if they were, the “predicted” outcome in those studies usually occurs at the same time as the predictors. Being able to predict present well-being from a multi-item self-report questionnaire isn’t really that helpful; we could just hand out a multi-item well-being questionnaire instead.^{[13]}There are exceptions, such as studies that look into the value of psychological constructs (often ability measures) for actually *pre*dicting occupational success. That literature is much more developed and tends to be more technically sophisticated than the average study trying to squeeze in one more self-report questionnaire. But I digress.

Let’s assume we *really *want to do prediction. What’s more useful, the bright/dark side model (e.g., chocolate obsession/chocolate appreciation) or the alternative model (e.g., chocolate liking + emotional stability)? If we’re thinking in terms of different rotations of the factor structure underlying the same items, it actually doesn’t matter at all—you get the same predictive bang for the buck. The coefficients will change but you’re not supposed to interpret them substantively anyway (recall you’re doing prediction, not explanation).

More interesting is the question whether the bright/dark side *questionnaire *is preferable over an alternative that takes the same amount of effort and time to collect. For example, if you can squeeze in eight items, what’s preferable for predictive purposes?

- four items to measure chocolate appreciation, four items to measure chocolate obsession
- four items to measure general chocolate liking, four items to measure emotional stability
- a single item to measure chocolate liking, seven items to measure emotional stability?

This is an empirical question and the answer will depend on

- the outcome to be predicted
- the actual questionnaires (maybe some of your measures happen to suck not because of the construct but because of the item phrasing) and
- other variables that are available for prediction

So the answer won’t be “bright/dark side construct yay” (or nay) but rather “for this specific predictive task, this particular questionnaire rather than that one, yay” (or nay). This answer won’t be very helpful if the predictive task exists only in the researchers’ imagination, never to be deployed in the real world.

I also suspect the bright/dark side constructs will not win very often when it comes to prediction. I have read quite a few studies in which the authors have to bend over backwards to demonstrate “incremental validity” (which is linked to psychology-style “prediction” rather than actual prediction). The number of additional predictors is usually kept low, measurement error is ignored, and even then, the added value is usually low and the statistical evidence often somewhat brittle.

But maybe when we say that constructs are useful for prediction, we mean something else—maybe we aren’t really thinking of quantitative predictions, but of something more casual. For example, we meet somebody who seems to have slightly narcissistic tendencies. Should we stay away? If they are mostly trying really hard to get others to admire them, maybe they’re fine or even fun to hang out with. If they are mostly trying really hard to put other people down, yikes.

Which leads us to a different research goal.

The thing is, there are a lot of bright/dark side constructs out in the literature, and people dig them. That’s in itself an interesting observation!

Considering the high prevalence, the phenomenon probably has two sides:

- dark side: mixing up two constructs is a good way to craft one’s own niche in the literature. A lot of known findings can reliably be rediscovered (“Oh look! The dark side of X correlates with all correlates of neuroticism! Isn’t that interesting?”) and that also makes the questionnaire attractive for other researchers (citation machine go brrrrr).
- bright side: people keep generating bright/dark side structures because when they try to generate items to capture a certain construct, they are very likely to think of behaviors and thoughts that
*matter*. What matters? Things that are functional or dysfunctional one way or another. Maybe antagonism creeps in because if somebody is antagonistic, that’s highly relevant information for future cooperation, so we’re very sensitive to that. Maybe neuroticism creeps in because we want to know whether we or other people are struggling, which may indicate a need for help. Another way to put this is that social perception may indeed have a bright/dark side structure, and maybe in the end all of our personality models are really just models of social perception.

These two aren’t mutually exclusive.

Beyond the pervasiveness of bright/dark side constructs, they also seem to have a broader appeal to humans in general (myself included, on my weaker days). In case you’re inclined to attend parties, any of these constructs will make for a more exciting small-talk topic than factor rotation methods. Why are they so attractive?

One contributing factor may be that they introduce some degree of nuance. Envy, pride, and narcissism appear like bad things. Pointing out that people who are highly envious may also channel their wickedness into positive behaviors (like trying to emulate the achievements of the target of their envy) demonstrates that it’s not all black and white.

Except that, of course, the bright/dark side constructs then go on and slice things into good and bad. But maybe because it’s contained within a broader construct, behavior change seems more plausible—keep the trait, change the manifestation. Instead of putting others down, couldn’t you just try to lift yourself up instead? Instead of being afraid of unhappiness, couldn’t you just strive for greater happiness? Of course, if you mentally rotate the factors that’s about the same as saying “don’t be such a jerk” or “don’t be so neurotic.” But the narrative is a different one, and maybe the narrative does facilitate change.

Whether that’s a good justification to pile yet another questionnaire onto the literature will very much depend on whether we consider crafting narratives that feel right and potentially helpful a central goal of personality psychology. If you ever went through the APA PsycTests database alphabetically, as Ruben did, you may conclude that it’s just not worth it; we just have too much *stuff *already. Are we really fine with doing superficially satisfying scholarly sanctioned self-help if it leaves our scientific literature in shambles?

Now if you excuse me, I’m going to revel in the consequences of the dark side of my scientific cynicism.

Footnotes

↑1 | Lastly, I will demonstrate that academic signposting (explicitly telling readers what comes next) is a bad writing habit that can only be justified if there is a mismatch between readers expectations and reality (“It’s more of an article than a blog post”). |
---|---|

↑2 | Ruben maintains that this is not the case and that there should be a small positive correlation between the two. Allowing for such a correlation would not change the general pattern and our next example will include for a correlation between the two underlying factors, so all the Bridget Jones lovers out there can mentally insert r = .2 here. |

↑3 | This corresponds to a situation in which you use three orthogonal factors (your three fingers) to explain the space around you. Given that the space around you has three dimensions to begin with, your hand model is perfect. Hands! Wow. |

↑4 | That the unrotated solution corresponds to the truth is only because of the data-generating mechanism we simulated. If it always worked like that, this blog post would be a lot shorter. |

↑5 | We should do this with new data; we could do this with the data underlying the EFA (to cite Anne: “CFA confirms the factor structure you found in an EFA in the same dataset? d’uh”). Luckily, it won’t make a difference for the bright/dark side thing because it is not a matter of sampling variability. |

↑6 | Or just a tiny bit, for proponents of the Bridget Jones hypothesis. |

↑7 | If we wanted to fit that type of structure, the model would not be identified (unless we incorporated some additional constraints, such as the equality of certain loadings). |

↑8 | Not that anybody cared about the Chi-Square test these days. If it keeps rejecting our models, it’s only fair that we reject it in return. |

↑9 | I think not many psychologists would claim they are factor realists when asked about it, but many articles nonetheless seem to imply some degree of realism. |

↑10 | I guess there are valid reasons why psychologists (or Meehl at least) have been struggling with factor realism since the dawn of advanced statistics. |

↑11 | Now that’s what I call a consistent effort to get mother nature to confess. |

↑12 | This mode of inference allows you to say “X independently predicts Y beyond other factors” and then wink, which everybody understands to imply causality. If pressed why you’re not aiming for causal inference instead, you may say something like “a causal understanding is a very ambitious goal; prediction is a stepping stone towards it.” I think that type of approach usually sets you up to fail on both fronts, at least in psychology (where predictive accuracy is often humble). You’re not doing causal inference well because you pretend to do something else with different rules; you’re not doing prediction well because you don’t actually want to predict anything. |

↑13 | There are exceptions, such as studies that look into the value of psychological constructs (often ability measures) for actually predicting occupational success. That literature is much more developed and tends to be more technically sophisticated than the average study trying to squeeze in one more self-report questionnaire. |

At one extreme, there are calls for samples that are either full-blown representative or samples that at least *to some extent* capture the diversity of the population of interest, which may actually often be *all *humans. Only this way, or so they say, we are able to arrive at generalizable insights about the human psyche. And frankly, it makes a lot of sense that claims about humans, in general, should be supported by data from humans in general.

At the other extreme, some say that psychological studies are actually getting at mechanisms that are fundamental to the human mind. We are all part of the same species, so it should not be a big problem if our sample mainly consists of College Freshmen at a Large Midwestern University. And frankly, it does seem like we have managed to identify some generalizable phenomena^{[2]}Such as motivated reasoning. even without ample samples.

What gives? As so often in life, this is one of the situations where the most boring sounding of all answers is correct: It depends. It depends on the goal of the analysis; it depends on the assumptions one is willing to make. Spelling out both analysis goals and underlying assumptions doesn’t appear to be psychologists’ strong suit, so it only makes sense that we’d also be confused about the need for/value of representative samples. So, will *you *need a representative sample? Let’s consider three different scenarios: (I) you want to estimate a prevalence, (II) you want to estimate a causal effect using an experiment, or (III) you want to estimate a causal effect using observational data.

Or alternatively, some mean score, say average happiness on a scale from 0 to 10 feelings integer. For these types of questions, I think most of us would immediately ask “average for whom?” because it is obvious that average happiness (or, say, vaccination rate) is a property of some population of interest. And, ideally, we would want to have a representative sample of that population to get the estimate right.

But what even is that, a representative sample? The term is used with different meanings within the non-scientific literature, within the broader scientific literature, and even within statistics; it can also simply be a way to give the sample a pat on the back (a phrase from Kruskal & Mosteller, 1979, that I really dig).^{[3]}This was brought to my attention by Stas Kolenikov and Raphael Nishimura *after *this post had been published, how embarrassing. The two of them plus Andrew Mercer provided some super helpful explanations and pointers to the literature, for which I am very grateful. Folks, listen to your local survey methodologist/statistician! As a side note, readers of this blog may also enjoy Nishimura’s pinned tweet. Here, I will use it to refer to a scenario in which each person in the population had the same chance of ending up in the sample, which in more precise terms is an equal probability sample. An equal probability sample is actually not what smart survey planners would usually go for,^{[4]}Stas brings up a study on American Jews for which he was the survey designer. For example, in America, Jews tend to live in some specific urban areas; sampling more heavily in those areas gives you better bang for the buck. but I think it provides a good abstraction to think about samples, so let’s stick with it for now.

For example, you simply send out study invitations to *everyone *and then there is a 1% chance for any person that they participate. Obviously, that’s not going to happen – even if your target population was, in fact, only College Freshmen at a Large Midwestern University. First of all, how do you even know who *everyone *is, much less contact them? Second, it is extremely unlikely that everybody participates with the same probability. For example, some people simply would never want to participate in your survey study or experiment, no matter what, and you (usually) can’t force them to change their minds.^{[5]}…institutional review boards these days, amirite? Don’t worry; it’s not you, it’s them! But you still have to deal with the aftermath in a productive manner.^{[6]}And the beforemath, but this is for a different blog post.

In a somewhat more realistic yet still benign scenario, some people are more likely to end up in your study than others – but you can explain that with help of variables that you have actually collected. For example, women and people with higher educational attainment may be more likely to participate in your study. However, within groups defined by gender and educational attainment, it’s again just random who provides data – so all women with bachelor degrees end up in your sample with some probability *p*; all men with high school diplomas end up in your sample with some (lower) probability *q*. Within those groups, those who do respond are not systematically different from those who don’t; they are exchangeable.

This is great news because it means that you can accurately estimate the mean of the variable of interest for subgroups defined by gender and education. If you also happen to know the distribution of gender and education in the population of interest, you can furthermore combine these subgroup-specific values with appropriate weighting and end up with an unbiased estimate for the population. For this to work, of course, all groups need to be present in your data in the first place (this is referred to as positivity, see e.g. Mercer et al., 2017). If *everybody *in your study has a high school diploma but the population of interest contains a non-negligible number of people without one, that puts you in a bad spot. Thus, a few “atypical” sprinkles on your sample may often be helpful, even if it doesn’t live up to the ideal of representativeness.

The most likely scenario is of course one in which you neither have a truly random sample of the population nor the necessary variables to render it random conditional on available information. These come in two variations with slightly different causal graphs.

First, imagine you want to estimate how many people are vaccinated against COVID-19 by asking people about their vaccination status. Now imagine that trusting the scientific establishment makes one more likely to get vaccinated, and also makes one more likely to participate in surveys conducted by scientists. We probably did not measure trust in the scientific establishment, and even if we did, we would additionally need to have a credible estimate of its distribution in the general population. But at least in principle, we can *imagine *a situation in which we had those pieces of information that allowed us to reweight the data. Second, start from the same scenario but now whether or not people participate in the study actually causally depends *on their vaccination status*, which is the outcome of interest. In this scenario, collecting additional variables won’t help, so referring to future studies offers little solace.

In either case, trying to credibly estimate the prevalence of vaccination will cause vexation. Depending on the stakes and the purpose of your research, you may still proceed, but any further step will hinge on assumptions and result in more uncertain claims. For example, you may be willing to make some assumptions about the direction of the bias introduced by nonrandom sampling. If people who trust the scientific establishment are more likely to participate and more likely to be vaccinated, your estimate may be a plausible upper bound.

You could also reason that actually, the selectivity of your sample isn’t a big issue for estimating the population mean because of the relationships of the involved variables. For example, let’s say you want to estimate the population average of the ratio between the length of the first and the second toe (1D:2D ratio) but your sample only includes people who are very eager to participate in very important research studies. One could argue that the selectivity doesn’t matter, because why would people who are eager to participate in studies vary systematically with respect to their 1D:2D ratio? Of course, the lack of an association is an assumption that may be disproven in future studies, so better be on your toes when taking this inferential risk.

Speaking of which, assuming that you have collected all the information that you need to render the sample “as good as random” is also just that, an assumption that may be wrong. This also holds for “nationally representative” panel and survey studies that are frequently used in the social sciences. These studies often invest a lot of effort into ensuring that samples are as good as possible, implementing complex sampling schemes and taking measures so that people don’t drop out,^{[7]}For example, interviewers of the German SOEP have a budget to buy small gifts such as flowers to keep respondents happy. And they say romance is dead. and then on top give you some weights to reweight the data to take into account sampling design and potential non-response bias. But those people aren’t magicians either, and so they have to rely on information they have (variables such as federal state, region, gender, and age). It is still possible that there are other factors that affect the probability of participation beyond that (at the very least, whether or not one has the patience to participate in a nationally representative panel study). So your sample may look representative on observable characteristics but could still be non-representative because of unobservables. Whether or not these will bias your estimate will depend on how they relate to the outcome of interest. Personally, I would often be willing to say “it’s the best type of data we can get, so this is our best bet” *unless *you are trying to estimate something like “liking of panel studies” in the general population.

A truly representative sample is more a useful abstraction than something you will actually ever work with,^{[8]}Except for edge cases, such as compulsory schooling or those magical Scandinavian countries that collect all sorts of registry data. Though this is cheating, because if the probability of inclusion in the sample is virtually 1, it’s no longer a sample, is it. and the heavy lifting is done by assumptions about the data-generating process, that is: about how your sample was generated, and how the involved variables relate to the outcome of interest.

Another way to think about this which you may find helpful^{[9]}Or confusing, in which case just ignore this paragraph. to figure out the specifics of your data situation is missing data problems. If you (hypothetically) invite everyone to join your study and only some respond, the data of the non-respondents is missing. This missingness may be completely random (Missing completely at random, MCAR) which results in a representative sample; it may be random conditional on information available to you (missing at random, MAR); or it may be missing in a non-random manner (not missing at random, NMAR; also known as missing not at random, MNAR).^{[10]}Whether Rubin was trolling when he came up with these NightMARish acronyms is unknown. Yet here we MAR. The last one maps onto the two vexing vaccination rate scenarios described above: there may be variables that render the missingness random but that you unfortunately did not measure; or alternatively, the missingness directly depends on the variable of interest.

There is a very nice paper by Felix Thoemmes and Karthika Mohan discussing these problems with DAGs from a causal perspective, because all of this hinges on the causal mechanism that generated the data, and thus, eventually, on the assumptions that we are willing to make.

This was a lot of exposition for an inferential problem that we rarely encounter in psychology. Sometimes we do want to estimate prevalences and means (e.g., in psychiatric epidemiology, when generating norms for IQ tests), but usually, we care little about it and happily standardize everything away (which isn’t necessarily a good thing and maybe a topic for a future blog post). But a lot of the previous points generalize to the other scenarios, so let’s move on to causal effects.

You run an experiment to identify some causal effect, but your sample is not representative of the population of interest. The good news is that your experiment will give you the average causal effect of the manipulation you implemented *for the people in your sample*. The not-so-good news is that this won’t necessarily be the average causal effect in the general population. Just like average happiness may differ between populations of interest, so may the average effect of some intervention on happiness. One way to “get rid of” this problem is to simply *assume *that the effect is the same for everyone and call it a day.^{[11]}Putting the “con” in convenience sample

A priori, this may not be the most satisfying solution given many psychologists’ insistence that the human psyche is so complex that everything (including interindividual differences) interacts with everything in mysterious ways. It’s even less satisfying once we additionally consider that, in non-linear models, everything may be interacting even if the coefficients don’t directly tell us that.

So effect homogeneity should not be the default assumption for all of psychology. It may still be a solid, defensible assumption for certain domains–for example, I’m personally quite willing to assume that certain fundamental psychophysics effects are very similar across people from the same population. For anything closer to the *social *spectrum of psychology, I’d be more skeptical. This is where the aforementioned study enters the stage: Coppock, Leeper, and Mullinix (2018) tried to quantify the amount of effect heterogeneity in a number of “standard social science survey experiments”^{[12]}I couldn’t find a table explaining all included experimental procedures, so I pulled up two examples to get an idea of the type of paradigm they looked at. In one experiment, the outcome of interest was support for the death penalty and the treatments involved three argument conditions – either respondents were asked right away (1), or they were presented with the information that sources say that it is unfair because most people who are executed are African American (2) or that it is unfair because too many innocent people are being executed (3). In another one, respondents were asked for their support of different bills, but the description of the bill either contained no cue (1), or mentioned that it was supported by either Obama (2) or by McCain (3). along certain dimensions (e.g., age, gender, and education, but also partisanship). They conclude that there seems to be fairly little effect heterogeneity *for these types of experiments, along these dimensions*, which is good news for people conducting these types of studies in non-representative samples. Whether this generalizability generalizes to other types of experiments is, of course, another question. The authors are quite careful with their language, saying explicitly that “[t]he response to this evidence should not be that any convenience sample can be used to study any treatment without concern about generalizability.” (It’s a good paper and short, so give it a read.)

But let’s assume we can’t assume that the effects are homogeneous. What now? The logic from Scenario I still applies. When everybody from the population had the same chance to end up in your sample (“representative”, equal probability sample, missing completely at random), you can simply do your usual analysis (e.g., a t-test) to estimate the average effect in your sample, and this will be an unbiased estimate of the average effect in the population of interest.

In the scenario in which your sample is selective but in ways that you can fully explain (e.g., women and highly educated individuals may once again be more likely to end up in your sample, missing at random conditional on available information), you can again essentially reweight the data to determine the average effect of interest in the target population.^{[13]}There are different ways to do this; in Deffner et al. (2022) we suggest multilevel regression with poststratification as a principled approach. And what if you don’t understand the selectiveness or lack the information you need to generalize your estimate to the population (not missing not at random…not)? Again, you can try to make some more or less well-informed guesses, but those should be supplied with appropriate justifications and a discussion of the necessary assumptions.

Of course, you may run an experiment and just say “well I just want to estimate the average effect in that particular sample, isn’t that interesting enough?” As Borysław Paulewicz writes in this Twitter thread, at least you’re able to show that your intervention works, even if only for some people. One way to think about this is that you just declare your sample the population of interest (I’m interested in *precisely *those people). But given that nobody writes discussion sections about *precisely *those people (“These findings demonstrate that for John, Robert, Michael, Linda, Jessica, and Sarah, thinking of a recent painful episode in their lives leads to…”), what is more likely to happen is some sort of implicit generalization fueled by implicit assumptions. Surely, effect heterogeneity won’t be so large that the intervention has the opposite effect in large parts of the population? And even if there were some people for which it had a large negative effect, at least some would have ended up in your non-random sample by chance alone, right? Those assumptions may sometimes be justified (e.g., when we have good reasons, including prior studies, that make us suspect there won’t be much systematic heterogeneity), and maybe sometimes they aren’t.

Claiming that you don’t want to generalize at all may often be the path of least resistance, not the least because of the following (generalizable, robust) empirical law: If there is a tempting yet not very well-supported conclusion that could be drawn from your study, readers will draw it* even if you tell them they shouldn’t*.^{[14]}This was once revealed to me in a dream.^{[15]}A corollary to this is “The Law of Lakens’ guidelines” according to which, whenever you try to make the point that researchers should not follow certain guidelines per default, you will get cited as a source of said guidelines. Many thanks to Taym, who brought this to my attention. Whether you like it (because it makes your research more interesting while maintaining plausible deniability) or not (because you want to stay in control of the narrative). This is why I am somewhat skeptical about the utility of so-called “Constraints on Generality” statements and more in favor of spelling out the assumptions necessary to support generalizations.

But let’s say your question doesn’t lend itself to experiments and you end up trying to estimate a causal effect from observational data. This makes things more challenging because even with perfect data from the whole population, estimating causal effects from observational data requires a lot of assumptions. But the interesting thing that I want to highlight here is that these requirements can collide with non-representative samples in unfortunate ways. If you’ve been following this blog, you can already guess what’s coming next. It’s our old frenemy collider bias.

Collider bias can become an issue when participation in your study is causally affected by the variables whose causal relationship you’re interested in. To ~~regurgitate ~~ resuscitate a previous example with some modification, let’s say you’re interested in how intelligence affects people’s willingness to keep doing some mindless repetitive task when told to do so. It’s hard to get the general population to participate in a study that involves mindless repetitive tasks, but psychology students are readily available.^{[16]}To the same degree that course credit is available to them. So, you invite them into your lab, let them fill out some intelligence test, and then tell them they have to cross out every single occurrence of the letter “p” in Dostoevsky’s “The Idiot.” Then you hand them the book and a pen and record how long it takes for them to say “what the hell?” and quit.

You observe that smarter people quit earlier, which lends itself to a certain narrative: smart people notice how stupid your study is and thus leave. But correlation does not equal causation, so you may start thinking about potential confounders (e.g., gender, age). However, in the described setup, you don’t only have to worry about common cause confounding (the vanilla flavor of confounding), you also have to worry about collider bias induced via sampling. Being smart increases people’s chance to study psychology (or really to study anything at all), but so does being conscientious (in particular in Germany, where you need to finish school with top grades to be admitted). This alone is sufficient to induce a spurious negative association between intelligence and conscientiousness in psychology students. So, your smarter psych drones may actually quit earlier not because they are clever, but because they are less conscientious – and your sampling induced that spurious association between smarts and grit(s) in the first place. Thus, the non-representative sample induces new threats to causal claims based on observational data, *as if there hadn’t* been enough already!^{[17]}As a side note, even if you end up with a representative sample, collider bias via selection can still be an issue for causal inference because the target population may already be affected by it. For example, the following populations have been going through some sort of selection filter that may introduce all sorts of spurious correlations: women in STEM, people in management positions, people who survive to the age of 80, people who survive childhood, humans who have been born.

Actually, even if you are just interested in the correlation (e.g., between intelligence and task persistence), the correlation you observe in your sample won’t necessarily be the correlation in the population.^{[18]}If you are doing an experiment but then want to investigate how non-manipulated third variables “moderate” the effect (even just in a non-causal sense, see here), all of the concerns in this section apply. Consider the case of online personality Aella, who happens to be smart and popular in rationalist circles, and a sex worker. Aella likes to do Twitter polls that ask for two things at once, say “are you a boy or a girl” and “what is your opinion on [societally shunned sexual act that doesn’t really harm anyone]?” This can be used to get at the correlation between gender and opinions on [societally shunned sexual act] *among Aella’s followers*. But it is likely quite uninformative about the correlation in the general population. Because following Aella on Twitter likely causally depends on gender (potentially mediated by horniness) and on being into rationalist stuff. That means that Aella is a collider that introduces (or masks) all sorts of correlations between gender and attitudes on a wide variety of topics in ways that are hard to predict unless we know what causes people to follow Aella. That said, this needn’t invalidate all conclusions from her polls (it depends on the causal graph of the involved variables), and she is now also using other sources to recruit respondents for her forays into human sexuality. She is also doing many things right (like sharing her raw data, plotting and reporting means) and frankly nobody is going to include questions on obscure kinks in major household panels anytime soon, so let him/her who never fell for collider bias or who only uses representative samples cast the first stone.^{[19]}Never fell for collider bias –> Casting stone <– Only uses representative samples

What to do? You already know the MCAR/MAR/NMAR logic, and it applies here as well. If it is random who from the population of interest ended up in your sample and who didn’t (MCAR), there is no (additional) problem (beyond the usual headaches of causal inference). If we are willing to assume that we understand enough about the selectivity and if we collected the right variables, we may be able to fix the selectivity issue by modeling it (MAR). In economics, if I understood correctly, James Heckman received a Nobel Prize for his works on that topic; in psychology, corrections for range restrictions have been developed.^{[20]}I used to think these corrections were “causally agnostic”, but Brenton Wiernik informed me that these corrections are a lot more sophisticated than I had thought, and I absolutely trust Brenton on these issues (and so should you). And if we don’t have a proper grasp on the selectivity, we have to tread carefully as inference will heavily hinge on additional assumptions.

Of course, they aren’t. Non-representative samples can threaten both external validity (scenarios I-III) and internal validity (scenario III). To which extent those threats should worry you will depend on what you are trying to do and which assumptions you are willing to make (e.g., considering the selection mechanism, considering the amount of effect heterogeneity in the population). Going “forward”, this may affect how you plan your studies. Going “backward” – taking the data as is, with all its imperfections (#AllDataAreBeautiful) — this may affect how you analyze the data to arrive at the best possible answer. I feel like we as a field can do much better on *both *fronts.

Considering study planning, one conclusion could be that *sometimes*, a convenience sample just won’t scratch the inference itch, so maybe it would be wiser to pool resources for higher quality sampling (to, e.g., at least get a couple of non-students so that you can try to project estimates on the general population using assumptions)^{[21]} Ruben pointed out that we may be doing ourselves a disservice by only having catchy words for representative samples (expensive and potentially unattainable) and convenience samples (which may range from psych students at Harvard to MTurkers). I concur, all samples are inconvenient, but some are (more) useful (for generalization). Of course, the survey people are way ahead of us by not talking about representative samples at all, but instead providing more precise definitions of features of samples in relation to inferential goals. or to rely on existing national surveys (even if they don’t contain the 300-item personality questionnaire of your dreams/nightmares).

Considering data analysis, it feels to me that while psychologists are very happy to talk about interactions of all sorts, there is no broad understanding that effects may vary between individuals in less tractable ways, thus rendering effect estimates from individual studies a property of the respective sample. This, of course, leads to the broader issue that we often treat “the effect of X on Y” as some disembodied law of nature, rather than an abstraction that can be specified in precise terms which will vary depending on your study. More precision about what we are trying to estimate (i.e., estimands) won’t hurt, even if at some point it does turn out that *empirically*, effect heterogeneity is not that big of a deal. Maybe we felt the future all along! But until then, it’s just an assumption.

Footnotes

↑1, ↑2 | Such as motivated reasoning. |
---|---|

↑3 | This was brought to my attention by Stas Kolenikov and Raphael Nishimura after this post had been published, how embarrassing. The two of them plus Andrew Mercer provided some super helpful explanations and pointers to the literature, for which I am very grateful. Folks, listen to your local survey methodologist/statistician! As a side note, readers of this blog may also enjoy Nishimura’s pinned tweet. |

↑4 | Stas brings up a study on American Jews for which he was the survey designer. For example, in America, Jews tend to live in some specific urban areas; sampling more heavily in those areas gives you better bang for the buck. |

↑5 | …institutional review boards these days, amirite? |

↑6 | And the beforemath, but this is for a different blog post. |

↑7 | For example, interviewers of the German SOEP have a budget to buy small gifts such as flowers to keep respondents happy. And they say romance is dead. |

↑8 | Except for edge cases, such as compulsory schooling or those magical Scandinavian countries that collect all sorts of registry data. Though this is cheating, because if the probability of inclusion in the sample is virtually 1, it’s no longer a sample, is it. |

↑9 | Or confusing, in which case just ignore this paragraph. |

↑10 | Whether Rubin was trolling when he came up with these NightMARish acronyms is unknown. Yet here we MAR. |

↑11 | Putting the “con” in convenience sample |

↑12 | I couldn’t find a table explaining all included experimental procedures, so I pulled up two examples to get an idea of the type of paradigm they looked at. In one experiment, the outcome of interest was support for the death penalty and the treatments involved three argument conditions – either respondents were asked right away (1), or they were presented with the information that sources say that it is unfair because most people who are executed are African American (2) or that it is unfair because too many innocent people are being executed (3). In another one, respondents were asked for their support of different bills, but the description of the bill either contained no cue (1), or mentioned that it was supported by either Obama (2) or by McCain (3). |

↑13 | There are different ways to do this; in Deffner et al. (2022) we suggest multilevel regression with poststratification as a principled approach. |

↑14 | This was once revealed to me in a dream. |

↑15 | A corollary to this is “The Law of Lakens’ guidelines” according to which, whenever you try to make the point that researchers should not follow certain guidelines per default, you will get cited as a source of said guidelines. Many thanks to Taym, who brought this to my attention. |

↑16 | To the same degree that course credit is available to them. |

↑17 | As a side note, even if you end up with a representative sample, collider bias via selection can still be an issue for causal inference because the target population may already be affected by it. For example, the following populations have been going through some sort of selection filter that may introduce all sorts of spurious correlations: women in STEM, people in management positions, people who survive to the age of 80, people who survive childhood, humans who have been born. |

↑18 | If you are doing an experiment but then want to investigate how non-manipulated third variables “moderate” the effect (even just in a non-causal sense, see here), all of the concerns in this section apply. |

↑19 | Never fell for collider bias –> Casting stone <– Only uses representative samples |

↑20 | I used to think these corrections were “causally agnostic”, but Brenton Wiernik informed me that these corrections are a lot more sophisticated than I had thought, and I absolutely trust Brenton on these issues (and so should you). |

↑21 | Ruben pointed out that we may be doing ourselves a disservice by only having catchy words for representative samples (expensive and potentially unattainable) and convenience samples (which may range from psych students at Harvard to MTurkers). I concur, all samples are inconvenient, but some are (more) useful (for generalization). Of course, the survey people are way ahead of us by not talking about representative samples at all, but instead providing more precise definitions of features of samples in relation to inferential goals. |

I am not very keen to join the stats wars, but if I *had to* join, I would rally under the banner of House Cause. That is the one framework I’d champion in a (randomised controlled) trial-by-combat if necessary: Authors should spell out their analysis goal (their estimands) as clearly as possible. In particular, they shouldn’t be weasley when it comes to causality:^{[1]}like those wimps of House Granger If what you are interested in looks like a causal effect, has the same implications as a causal effect, and quacks like a causal effect, then it probably *is *a causal effect. Furthermore, authors should explicitly spell out the assumptions under which the data can inform us about the theoretical estimand of interest.

In many cases, when the goal is to infer a causal effect and the data are observational, these assumptions will have to be quite strong. Usually, the more complex the estimand, the worse it gets. A total (i.e., totally vanilla) causal effect from observational data? That is going to involve the assumption that there are no unobserved confounders, which tends to be hard to defend, in psychology at least. An indirect (i.e., mediated) effect from observational data? That will actually be the combination of two causal effects, so now we need to make assumptions about a lack of confounding between *multiple pairs of variables*. Moderated mediation from observational data, in which we want to figure out how some ~~third~~ fourth variable affects those paths? That’s even worse, and you should feel worse.

Or maybe you don’t, in which case there is a recent article by us^{[2]}The 75% CI and Paul on the matter. Some time ago, the Journal of Media Psychology linked our article in a tweet in which they stated that they had been desk-rejecting a lot of manuscripts looking at mediation in cross-sectional data.

Quite a few people were unhappy about this tweet because they assumed it implied that all cross-sectional mediation analysis shall be desk-rejected henceforth, no matter the details. The intended message of the tweet was, of course, more nuanced. As it should be! After all, mediation analysis on the basis of cross-sectional, observational data is not impossible per se. It’s just that it is not very convincing in like 90% of cases.

But whenever people start to argue about mediation analysis, there is one particular defense that I find particularly intriguing and worth discussing:

Cross-sectional, observational mediation analysis is *not *about estimating or even inferring causal effects – that is impossible with observational data, by design. However, one may hypothesize a particular mechanism that implies an indirect effect. Now the correlations in the data may be compatible with such an indirect effect, and that should provide some evidence for the hypothesis. After all, my hypothesis predicts something (a correlational pattern), and then that prediction pans out.

This can be leveraged as a defense against anybody who generally criticizes causal inferences on the basis of observational data. Sure, correlation does not imply causation, but causation implies a specific pattern of correlations, and that must count for something. From that perspective, we don’t really need causal inference, do we? We just need a theory that makes predictions, and then we check whether the data align with them.^{[3]}Indeed, I once heard somebody voice confusion about the whole *notion *of causal inference – what would it be good for? And all those pesky assumptions somehow seem to have disappeared, or at least they appear less central.

From this angle, one may even think that the more complex the estimands, the better. If we hypothesize that there’s a particular causal effect of X on Y, that may just imply a simple correlation between the two. But if we are interested in a more complex causal chain, that implies a whole pattern of correlations. So our prediction becomes more specific, and if our data confirm these increasingly specific predictions, it just feels like we should become more confident in the underlying hypotheses. But before, we said that more complex estimands require more assumptions and are thus harder to defend. What gives?

[Trying to estimate a causal effect from observational data] and [Testing whether observational data are aligned with some (causal) hypothesis] may seem like different inference games with different rules. What ties the two together is something you may have noticed was missing in the previous part: Testing a hypothesis is actually not so much about trying to confirm it, but about trying to falsify it. Falsification is what makes hypothesis testing powerful in the first place – it’s not very impressive to find support for your hypothesis in a study that had no chance of falsifying it.^{[4]}This is what Deborah Mayo calls “bad evidence, no test” (BENT).

The missing ingredient to make hypothesis testing spicy is the notion of *test severity*, which tells you the extent to which an empirical observation (e.g., a correlation between X and Y) corroborates a hypothesis (e.g., X has a causal effect on Y). Roughly speaking, an empirical observation provides a severe test of a hypothesis if it is (1) very likely to occur if your hypothesis is true and (2) very unlikely to occur otherwise. Test severity can be formalized as p(observation|hypothesis, background knowledge about the world) divided by p(observation|background knowledge about the world). The higher the probability of the observation given your hypothesis plus any background knowledge about the world, p(observation|hypothesis, background knowledge), the higher the severity (“my theory said so!”). The higher the probability of the observation *given just your background knowledge*, p(observation|background knowledge), the lower the severity. A finding that makes you go “no shit, Sherlock” (high probability given background knowledge) is unlikely to provide a severe test for some novel and counterintuitive hypothesis about the world.

For example, let’s say your hypothesis is that subjective feelings of happiness improve health. Your empirical observation is that people who say they are happy also report that they are healthy. But everything that you know about the world *already tells you* that happiness and health should be positively correlated. So the denominator will be large and thus test severity will be low. You’ll need something stronger to support your claim about happiness and health.^{[5]} e.g., a shot of Stroh-Rum

Now that we have the concept of test severity, we can link back to the causal inference framework. If we want to use the observed correlation between happiness and health as an estimate for the causal effect of happiness on health, we have to assume that those two variables are not confounded and that there is no reverse causality with health affecting happiness. Those are, of course, unrealistic assumptions. And these unrealistic assumptions are what render the observed correlation a weak test of the causal effect of interest. Precisely because we know that these assumptions do not hold (for example, because of common causes), we already expect a correlation between the variables. So the denominator, p(observation|background knowledge) will be high and severity will be low.

Now, before we talked about mediation, how does this work out here? When people criticize cross-sectional observational mediation analysis, it is often because at least one of the paths is trivially explained without the causal effect of interest. For example, a common pattern involves a psychological mediator and then some vaguely related psychological outcome, both assessed via self-report. Those variables are almost guaranteed to be correlated to some degree, because they may be affected by shared response biases, or by underlying personality traits. So usually, that part of the mediation chain is not subjected to a severe test, and without it, the whole mediation thing turns moot. Of course, often the first path (“independent” variable to mediator) suffers from the same issue, which doesn’t improve the situation. The severity of your mediation test ends up twice as much as nothing at all, which is still nothing at all.

To sum it up: the assumptions necessary to estimate a causal effect from data correspond to the severity of the test that the data provide for the hypothesized effect. If you have to make unrealistic assumptions (that violate your background knowledge), that means that the data provide a weak test of the effect (because said background knowledge already explains the association). If some data seem like a weak test because of an obvious alternative explanation (some background knowledge that “explains away” the observation without the need to invoke the effect of interest), the resulting estimate of the effect will rely on the assumption that the alternative source of association doesn’t exist. What makes or breaks the inference, regardless of whether you frame it as an attempt to estimate a causal effect or as an attempt to test some causal hypothesis, is the same: the other stuff that you know about the world. Curse you, background knowledge!

In the end, with some simplification, saying “those are unrealistic assumptions” is like saying “that’s a super weak test.^{[6]}I have only talked about the case in which the hypothesis is “effect X on Y exists.” That may be the most relevant case for psychology in its current state, but I believe the correspondence holds up more generally for e.g., hypotheses about the absence of effects, directed hypotheses. There are interesting implications for how to think about sensitivity analyses along the lines of “how large would confounding influences need to be to flip the sign of the estimate?” and for experimental studies in which questions are less about the presence of effects and more about the specific mechanisms. I have a truly marvelous discussion of these issues which this footnote is too narrow to contain. Maybe this correspondence seems trivial, but some researchers seem, at the same time,* *(1) very averse to assumptions (“Unconfoundedness? At this time of year? In this field of research?”) and (2) opportunistically open to weak tests (“Sure it is no conclusive proof, I mean nothing is ever conclusive in science, but the data are compatible with this story”).

Now I don’t believe that weak tests are necessarily worthless, just like I don’t believe that causal estimates that rest on implausible assumptions are necessarily worthless. What seems important to me is that the matter is discussed transparently. And this is precisely why I find the assumptions framing a bit neater: explicitly stating that an analysis rests on the assumption that X and Y don’t share any common causes except for A, B and C makes it very easy for readers to consult their background knowledge and check whether they consider that plausible. So personally, I don’t believe that cross-sectional observational mediation analyses should be desk-rejected by default. But papers that don’t make any effort to be transparent about the underlying assumptions, or that try to sell a weak test as something more severe? Maybe those should be sent back by default.

Footnotes

↑1 | like those wimps of House Granger |
---|---|

↑2 | The 75% CI and Paul |

↑3 | Indeed, I once heard somebody voice confusion about the whole notion of causal inference – what would it be good for? |

↑4 | This is what Deborah Mayo calls “bad evidence, no test” (BENT). |

↑5 | e.g., a shot of Stroh-Rum |

↑6 | I have only talked about the case in which the hypothesis is “effect X on Y exists.” That may be the most relevant case for psychology in its current state, but I believe the correspondence holds up more generally for e.g., hypotheses about the absence of effects, directed hypotheses. There are interesting implications for how to think about sensitivity analyses along the lines of “how large would confounding influences need to be to flip the sign of the estimate?” and for experimental studies in which questions are less about the presence of effects and more about the specific mechanisms. I have a truly marvelous discussion of these issues which this footnote is too narrow to contain. |

But marginal effects are one of those really valuable pieces of the puzzle that make other seemingly unconnected things fall into place. All those convoluted coding systems for categorical variables. Why people habitually center predictor variables before calculating their product term to investigate interactions. Why everything “interacts mechanically” in a logistic regression, and, quite generally, why you might have done interactions in nonlinear models the wrong way (although that sort of depends on what question you were trying to ask in the first place). I learned about the whole idea of marginal effects relatively late, and that made all those things more puzzling than necessary. If somebody would have told me earlier, that would have been great.

But Vincent Arel-Bundock just dropped a big release of the R package *marginaleffects *(check out the really detailed documentation here) which allows you to calculate marginal effects for more than *60 classes of models* in R, and Andrew Heiss^{[2]}Who was so kind to comment on an earlier version of this blog post, for which I am very grateful. Of course, he bears no responsibility for any mistakes I made, let alone for any inappropriate joke. wrote a super helpful comprehensive (“75-minute read”) blog post on marginalia.^{[3]}Thus scooping the best *English *pun we could have come up with. Aber ich marg ihn trotzdem. Still, they left a margin for me to write my own blog post – so let’s talk marginal effects. This piece will be as non-technical as possible (pinky promise). If you want to go into more depth, definitely check out Andrew’s blog post (which is still very accessible) and the *marginaleffects *documentation. You could even go and check out the chapter on it in the Stata manual (although there is a good chance that if you’re using Stata, you’re working in a field where marginal effects are more well-known and thus don’t need this blog post).

This is oversimplifying things a bit, but think of statistics as a two-step process (Figure 1, Panel a).

**Figure 1**

*Add a Figure, they said. It will add value, they said.*

In the first step, you build and estimate a statistical model. In the second step, marginal effects.

Think of the model as a prediction machine (Figure 1, Panel b): you can plug in any combination of predictor variables and it will return the predicted outcome value *for that particular combination*. This model could essentially be anything, from a simple t-test to psychologists’ favorite, *the *ANOVA, to regression with all its beautiful variations (hierarchical/multilevel/mixed models, all non-linear variations such as logistic regression…). You then use this model to generate all sorts of predictions, contrast them for different scenarios (e.g.: [Y: Your predicted knowledge of marginal effects for X: You continue to read this blog post] minus [Y: Your predicted knowledge of marginal effects for X: You stop right now and go do something else]), and aggregate them if necessary to answer your questions. Those are your marginal effects.

Will this process provide the right answer? It depends, obviously. Your data may not be able to provide good answers, maybe it’s biased or just not enough. Your model may not appropriately capture the data and cause issues. The coefficients of interests may not be causally identified, so that the resulting marginal effects really aren’t effects in the narrower sense at all, but some spurious mess (we will still call them effects for the purpose of this blog post). These are all issues in which you can read up in some detail in the psych literature. But, importantly, even if all these issues are taken care of (big if), you still have to query your statistical model in a way that it returns an answer to the question you are interested in. In many cases, you need to calculate a marginal effect. That step–asking your model a precise question–seems sometimes a bit neglected in psych, to the point that we turned it into the title for our interactions paper, “Precise Answers to Vague Questions”. But it’s also something that’s simply neglected during our training, so one might be forgiven for thinking that a regression analysis ends with regression coefficients, when they are often just an intermediate result.

**Figure 2**

*Me: *not exactly sure what question I am trying to answer**

*Statistical model: *returns a precise result down to the third decimal**

*Me:*

Let’s start with a very simple example. There is a runaway trolley barreling down the railway tracks at 10 meters per second (36 kilometers per hour, or 22.3694 miles per hour), on an even plane. For simplicity, let’s assume that no forces (such as air resistance, friction, people tied up on the tracks) are involved, so that the trolley just keeps moving at constant speed, away from you in a straight line. To predict Y (the distance between you and the trolley, in meters) from X (time elapsed since the trolley left where you are standing, in seconds), we can use the following equation: y(x) = 10*x. To “use” such an equation, you simply take a value of X that is of interest to you and plug it in and type it into your calculator. Here, you probably won’t even need a calculator, but once equations get more complex you probably will. If you want to make your middle school physics teacher a happy person, add units (x is measured in seconds, y is measured in meters, so the coefficient would be 10 meters per second). Figure 3 visualizes this model.

**Figure 3**

*The World’s Easiest Trolley Problem.*

Now we are done with Step 1, we have got our model. We only have one predictor and the model isn’t exactly complex, so there’s pretty much only one marginal effect that we can calculate: The effect of X (time elapsed) on Y (distance between you and the trolley). The answer is as trivial as it gets: it’s 10 meters per second, the speed with which the trolley is traveling. There are different ways to get that number. We could use Figure 3 and draw a slope triangle. We could plug in two values for X and take the difference: after 10 seconds, the trolley is at 10*10 = 100 meters, one second later it is at 10*11 = 110 meters, so it traveled 110 – 100 = 10 meters per second. We could calculate the first derivative (it’s a constant, 10). And we can *directly *get that number from the model coefficient in front of the predictor. It says 10 right there!

This may feel a bit trivial because frankly the model is trivial and I already told you the speed of the trolley. But if the only statistical model you are ever using is univariate linear regression–one predictor, draw a single straight line, no frills–that is pretty much all there is about marginal effects. It’s the slope of the regression line, which you can read off the single regression coefficient.

Unfortunately, when things get more complex, this equivalence–regression coefficient of a predictor == its marginal effect–often breaks down. In those situations, the regression coefficients still tells you how to calculate the outcome variable Y for any combination of predictor variables. But we can no longer take for granted that the coefficient of a predictor reflects the effect of the predictor on the outcome in a straightforward manner.

“Things getting more complex” involves all scenarios in which the effect of a variable may vary–for example, when the effect of a variable depends on another variable (aka interaction, or effect modification). Here, the question “what’s *the *effect of X on Y” is suddenly underspecified to begin with, because according to the model, the answer depends on the value of a third variable.

“Things getting more complex” also involves anything outside of the realm of linear regression, such as logistic or probit regression. The logic of both types of regression is very similar to linear regression, except that our (linear) regression equation does not model the actual outcome of interest (which can take on the values 0 and 1) directly, but instead a nonlinear transformation of it. Because of that transformation, “the effect of X on Y” again varies. For example, if X is a continuous variable, a 1-point-increment from 0 to 1 may have a different effect than a 1-point-increment from 1 to 2. In that sense, in such a nonlinear regression, every variable “interacts with itself.” And if multiple predictors are involved, the effect of any of them will *automatically *depend on all others, even if no interaction terms are included–a phenomenon that Simonsohn has labeled “mechanical interaction.”

But, to stay with the trolley, let’s look at the easiest possible more complex example.

Let’s imagine the trolley was speeding away from you *on an inclined plane*. Apart from that, there are again no additional forces involved.^{[4]}So don’t worry whether the trolley is accelerating toward anyone–in the vacuum of our thought experiment, nobody can hear them scream anyway. The trolley will now accelerate constantly, which means that its speed is changing, but at a constant rate. Let’s say we can now predict the position of the trolley with the following formula: y(x) = 0.25*x², see Figure 4.

**Figure 4**

*Accelerating trolley. Homework assignment: After how many seconds has it reached the speed of light?*

What’s the effect of X (time elapsed) on Y (distance between you and the trolley)? In other words, how fast is the trolley traveling? Duh, it depends. It starts at 0 meters per second and after one minute, it’s like, *much *faster. Thus, we already know that the coefficient 0.25 from the equation describing the position of the trolley can’t possibly be the answer. But we can still calculate the precise speed at any point in time. Again, there is a graphical solution for this (take Figure 4, draw a tangent line and calculate its slope). We could also still plug in different values for X and take the difference of the predicted Ys, except that if we want to predict the speed *at a single point in time*, the difference between the X values we plug in has to be like really really small, as in infinitesimally small. If you figure out how to do that, congratulations–you reinvented differential calculus. If you learned differential calculus in high school, things get a bit easier because you already know that you need to determine the first derivative of the equation above, and that there are straightforward rules to do so (and even online calculators to do the job). Here, the equation that gives us the speed of the trolley at any point in time is 0.5*x. In other words the trolley gains 0.5 meters per second per second, see Figure 5.

**Figure 5**

*First derivative of Figure 4. See **derivatives of polynomials**, although one could also apply the inverse of **Tai’s method for the determination of total area under a curve**.*

Now, we can still answer all sorts of questions about the speed of the trolley at any point in time, or questions like “the average speed of the trolley in the first 60 seconds.” And how about “a new trolley is put on the incline every 20 seconds, what is the average speed of all trolleys on the track after 5 minutes”? That might sound like your physics teacher went nuts, but it’s also getting closer to the type of marginal effects we are going to calculate next. For all of these questions, you can no longer *directly *read the answers off the initial equation describing the position of the trolley, you have to do some manual math first. That is something you’d usually want to avoid if models get more complex, so now let’s move on from simple equations of motion to an actual statistical model in the wild.

As a side note, we are now going to switch from a continuous predictor (time in seconds) to a categorical predictor. This makes it easier to think things through, because we do not concern ourselves with derivatives. In his blog post, Andrew Heiss says that only in the continuous case—where we have to consider derivatives, i.e., the instantaneous slope, i.e., the change in the outcome associated with an infinitesimally small change in the predictor—it’s a marginal effect; for categorical predictors–where we consider the change associated with a discrete change in the predictor—it’s just a “conditional effect” or “group contrast.” The Stata manual, in contrast, does not make that distinction but notes that some people use “partial effect” to refer to the categorical case. You may adopt whatever terminology you like (just use what the cool kids in your field use). The terminology is distinct from the general idea underlying those effects (Figure 6), and I really just care that you get the general idea: we look at what our model predicts happens with the outcome if we change a predictor one way or another.

**Figure 6**

*A finger pointing at the moon is not the moon. Unless you really enjoy getting into arguments on Twitter.*

Tamás Keller and Felix Elwert, two sociologists with whom I have collaborated, planned and conducted a large field experiment in which the seating chart of 182 school classrooms was randomized to investigate whether kids who sit next to each other are more likely to report that they are friends at the end of the school term.

I did the analyses, and they got a bit more complicated because (1) the outcome (mutual friendship) is dichotomous, so we need a non-linear model, (2) students are nested within classrooms, and (3) the unit of analysis is any pair of students within a classroom, so each observation is also nested within students. Now you don’t need to follow the technical details of this model to follow the rest of this blog post, but just in case you’re interested: We accounted for these things by (1) using a probit model (which models a latent continuous friendship propensity, and if that takes on a positive value, we predict the dyad will be friends), by (2) including classroom fixed effects^{[5]}If probit + fixed effects makes you wince (are you an economist by any chance?) we also did a variation that replaces classroom fixed effects with the size of the classroom. And a linear probability model, because we tried to make readers from all fields of the social sciences happy. (so every class may have their own unique baseline friendship propensity), and by (3) including student random effects (every dyad is nested in two students, so technically it’s a multi-membership model).

You can look up the details in the published article, but what matters here is: we got all sorts of things in the model, but it’s still just a prediction machine—one that tries to tell if two people will be friends. Wouldn’t that be helpful in real life? Substantively, we are interested in what happens to friendship when a dichotomous variable, “deskmate” switches from 0 (the two students are not seated next to each other) to 1 (the two students are seated next to each other). The probit regression coefficient of that dichotomous deskmate variable is 0.27. So, the effect of being seated next to each on friendship is ~~42~~ 0.27. 0.27 of what, you wonder? Well, 0.27 of “latent continuous friendship propensity”, whatever that is. Now this coefficient is different from zero, which may render it publishable depending on the field–but to actually understand what we can learn from the model, we need to figure out what it predicts with respect to actual friendships.

*In* *principle*, we can manually compute what our estimated model says for every pair of students in the data. For this, we plug the corresponding values (their classroom fixed effect, their student random effects, the overall intercept) into the regression equation and set “deskmate” to 1. We thus calculate the predicted latent friendship propensity for a pair of students, which in a last step we translate into an actual prediction of the outcome (friendship, yes or no) by checking whether it exceeds the threshold implied by the link function. Now we know what the model predicts in case those students are seated next to each other.

Next, we repeat the same exercise but set “deskmate” to zero. Now we know what the model predicts in case those students are not seated next to each other.

Last but not least, to get the effect of sitting next to each other on friendship *for this particular pair of students*, we take the difference between the two predictions. Here, we get one of three possible outcomes. The students are only predicted to be friends when they are seated next to each other (1 – 0 = 1), sitting next to each other doesn’t make a difference (1 – 1 = 0, 0 – 0 = 0), and, at least hypothetically, actually sitting next to each other turns them into non-friends (0 – 1 = -1).

In the end, we have an estimate of the causal effect for one particular observation in the data. This estimate may look different for other pairs of student in the data, because this is a non-linear model–whether the deskmate variable “pushes” a particular pair of students over the threshold for a friendship or not will depend on, for example, whether those students are more or less likely to befriend others in general (i.e., on their random effects). So we’d have to repeat the math for all 24,962 pairs of students in the data who could have become friends. Except doing this manually would of course be horribly tedious, even more so if we can’t afford student research assistants.

But this is the 21st century and I already told you that your statistical model is nothing but a prediction machine. For example, in R, after estimating pretty much any model you can come up with, you can use predict() or some alternative function to generate predictions for old and new data within seconds. So we can copy our data, set “deskmate” to 1 for every observation, and generate predictions. Here, we could do a short break and summarize our predictions to answer the question “What does the model predict if every pair was deskmates?”; such a summary of predictions is called a margin (according to the Stata manual). But let’s move on and copy the data once again and set “deskmate” to 0 for every observation and subtract the results.

We end up with estimates of the causal effects for every single observation. One very natural way to summarize the distribution of these causal effects is simply taking the arithmetic mean. In our deskmate paper, this mean turned out to be 0.07, which means that being seated next to each other increased students’ friendship probability by 7 percentage points on average (from 15% to 22%, those are the two margins involved in this prediction). This is the so-called average marginal effect (AME), the mother of all marginal effects.

A technical detail that is important in practice: So far, we have completely ignored any uncertainty in that estimate, and usually we want more than just a point estimate. In our deskmate paper, we used a Bayesian approach, and that allowed us to draw from the posterior, which is a very neat generalized way to quantify uncertainty. That said, if you don’t want to do this “manually” like a pleb (which I did for our deskmate paper), or if you prefer a frequentist approach, the *marginaleffects *package got your back either way–it can work with *brms* but also with pretty much anything else (including *lme4*, *nlme*, and *survey*).

This whole idea–predict values for certain combinations of predictors, contrast them, aggregate them–is extremely flexible. For example, instead of aggregating the individual causal effects across the whole sample, you could aggregate it for certain subgroups that may be of interest. Or you could use predictions to figure out what the average marginal effect would be like if all observations had a certain property. For example, in our paper, we looked at the average effect of being seated next to each other for different degrees of similarity between students, and we will return to that for the last example. You could even go wild and try to calculate average causal effects in a different population, which obviously only works well under certain assumptions but neatly links to the topic of generalizability and transportability.

Also, nobody forces you to aggregate across observations in the first case. You may want to calculate a marginal effect for a particularly representative observation, or maybe for your uncle Bob. You could even, please bear with me, generate some hypothetical “average observation”–an observation where all covariates are exactly at the sample mean–and estimate the causal effect *for this particular observation*. Now this particular observation may not actually exist in your data, and *the effect for this average observation*, called the marginal effect at the mean, may or may not be a good approximation of *the average effect across observations*, the average marginal effect. In some models, the two will coincide. Why would the marginal effect at the mean warrant special attention? Actually, its logic is closely tied to how psychologists sometimes calculate certain marginal effects, maybe even inadvertently, by implementing specific coding schemes.

Imagine a linear regression in which you want to predict some continuous outcome from two variables, X_{1} and X_{2}, as well as their product, X_{1}*X_{2}. In such scenarios, it is routinely recommended that researchers center X_{1} and X_{2} prior to calculating the product term. You *might *have heard that this is helpful because it reduces multicollinearity, which is good, because multicollinearity is supposedly bad. Now leaving aside that collinearity isn’t a disease that needs curing, centering the predictors before calculating the product term won’t really change *anything *about your statistical model. It will still return precisely the same predictions for every single observation, and the uncertainty in those predictions will be unchanged, and so whatever marginal effect you calculate, the answer will be exactly the same.

But what does change is the magnitude of your regression coefficients, and how you can interpret them. Your regression equation looks something like this: Y = b_{0} + b_{1}*X_{1} + b_{2}*X_{2} + b_{3}*X_{1}*X_{2} + E. Without centering predictors, b_{1} will be the effect of a 1-unit change in X_{1} *when X _{2} equals 0*. Because if X

When we do center the predictors, b_{1} will still be the effect of a 1-unit change in X_{1} when X_{2} equals 0. But after centering, X_{2} will be zero when it’s at the sample mean. So now b_{1} actually turns into the marginal effect at the mean. That is, in many cases, much more interpretable than the marginal effect at X_{2} = 0. Of course, even without centering, we could have calculated the marginal effect at the mean after estimating the model and arrive at the same answer, albeit in a second step.

Let’s imagine a different scenario in which we want to use a linear regression to predict some continuous outcome from two factors, A and B, with two levels each, as well as their interaction. Okay, to be honest, this is not exactly a different scenario–it’s a special case of the previous scenario, in which X_{1} and X_{2} are dichotomous. But, in psychology, there is a dedicated section of the methods training for experimental designs that is all about ANOVAs, and tends to use different letters. Actually, let’s just stick with X_{1} and X_{2}, so that we can keep b for the regression coefficients without getting all confused about it. So, let’s imagine the same scenario as above, but X_{1} and X_{2} can only take on two possible values, and those represent different experimental conditions–we have a traditional 2×2 design.

We could use dummy coding, so that both X_{1} and X_{2} can take on the values 0 and 1. Our regression equation, as before, looks like this: Y = b_{0} + b_{1}*X_{1} + b_{2}*X_{2} + b_{3}*X_{1}*X_{2} + E. Just like above, b_{1} will be the effect of a 1-unit change in X_{1} *when X _{2} equals 0*. Here, X

But instead, we could also use effect coding. For example, we could code both variables so that they can take on the values -0.5 and +0.5. Again, Y = b_{0} + b_{1}*X_{1} + b_{2}*X_{2} + b_{3}*X_{1}*X_{2} + E and b_{1} will be the effect of a 1-unit change in X_{1} *when X _{2} equals 0*. But now X

Those coding tricks are neat once you understand how they work. They are also quite limited in what they can achieve–you always have to recode your data to get a different marginal effect; the types of marginal effects you can get are limited; and anything you estimate will be on the same scale as your regression coefficients. The latter is no problem as long as you do linear regression, where the scale of the regression coefficients *is *the scale of the outcome (aka “identity link function”). But it becomes a pretty big topic whenever you use a non-linear model, so for our last example, let’s return to the deskmate experiment.

In our paper on whether deskmates become friends, we were also interested in whether being seated next to each other “worked” better for some pairs of students than for others. In particular, it seemed plausible that this would most likely lead to friendship among students who are similar (and thus likely to befriend each other anyway). The kids in the study were in 3rd to 8th class, so one particularly salient dimension is gender. If you assign a girl to sit next to a boy (yuck!), is there still a chance they become friends?

To realize this on a statistical level, we used the same model as above, but additionally included a categorical variable indicating the gender of both students (three possible levels: both girls, one girl one boy, both boys) as well as its product with the binary deskmate variable into our probit model. If we look at the model-implied probit regression coefficients for each group, it does not look like our hypothesis pans out. For two girls, the deskmate effect is 0.27, for mixed dyads it’s 0.40, and for two boys it’s 0.39. This may look like the mixed dyads actually have the biggest effect of the intervention, although if we take into account uncertainty in those estimates, we might as well conclude that nothing happens at all with respect to effect moderation—the effects appear to be statistically indistinguishable.

But those regression coefficients need to be interpreted on the scale “latent continuous dimension underlying the probit model”, and that’s a bit abstract. So what does the model predict with respects to the actual outcome, (manifest) friendships? To answer this question, we can calculate average marginal effects for different types of dyads, by following the prediction procedure described above repeatedly, each time fixing the categorical variable indicating gender to a different value. Thus, we look at the average marginal effects if all our student pairs were all boys, all girls, or mixed.

If we follow that procedure and look at the increase in friendship probability, it is actually tiny for mixed dyads, and much bigger for same-gender dyads. So, if we do seat a boy next to a girl, the effect on their friendship probability is, on average, tiny in terms of percentage points (see Figure 7, which also shows the margins). This pattern is somewhat predictable if you know the different baseline probabilities of a friendship, the effects according to the regression coefficients and how the probit link works, but then again everything is obvious once you know the answer.

**Figure 7**

*Model-implied friendship probabilities for pairs of students depending on their gender and the resulting effects of sitting next to each other.*

So here we have a case where calculating marginal effects on the actual outcome scale results in qualitatively different conclusions than simply intently staring at regression coefficients. This is one example of interactions being scale dependent; this property can be interpreted quite broadly to think about, for example, ceiling and floor effects–for this additional plot twist, check out our manuscript on issues with interactions.

And that’s pretty much it. You pick your statistical model, estimate the parameters, use it to crank out predictions, contrast the predictions you want to contrast, aggregate whichever way you want to. This is a very flexible all-purpose workflow; it doesn’t even discriminate between Frequentists and Bayesians. One possible “downside” is that you suddenly have to consider which effect estimate is actually relevant for your particular context, but of course that’s not a bug,* it’s a feature*. I’d predict that the marginal effect of caring more about marginal effects would be an increase in the manifest number of average marginal effects in the literature.

That said, of course, I haven’t really told you anything about how to figure out which marginal effect you are interested in. Do you even want to know the average marginal effect? It definitely makes sense if you think about rolling out an actual intervention and want to gauge its impact overall. But in other scenarios, a group specific marginal effect may make more sense. Or maybe you want to illustrate what your models says for a number of observations that may represent particularly interesting observations, say, somebody with a median income, or somebody who is among the bottom or top 5% earners. Maybe there are even instances where, in a non-linear model, you *do *want to know what happens on the underlying latent scale rather than on the outcome scale. For example, Simonsohn seems to consider interactions on that latent scale substantively more interesting and calls them “conceptual interactions” (though there’s definitely people from other fields who would disagree with that assessment). The somewhat unsatisfying take-home message here could be that we often don’t know precisely what we want to estimate to begin with and might have to figure that out first. Though in any case, I believe it can’t hurt to provide multiple estimates to give a fuller picture of what the model says.

More generally, I believe that marginal effects are a good step towards a more productive understanding of statistical models. At least in psychology, we tend to teach them as special purpose tools–use a t-test if you want to compare two groups, ANOVA if you want to analyze factorial experiments, regression if you want to deal with continuous predictors^{[6]}and use whatever is trending in your field for publishing in certain journals.–and students may consider themselves lucky if, at some point, somebody points out to them that they all belong to the same class of models. This makes stats appear much more arcane and fiddly than it needs to be. Instead, we could focus on teaching the general idea that statistical models are (fallible) predictions machines that can be queried in certain ways to answer certain questions.^{[7]}I think there are some parallels here to Richard McElreath’s framing of statistical models as Golems, though I think Richard tends to care much more about how the golem is built. That seems like a useful framework to generate a general sense of familiarity, which we can then leverage to learn the nitty-gritty details in the next step, with less anxiety. In some contexts, it may still very much make sense to acquire a more mechanistic understanding of how particular models work, but this shouldn’t come at the expense of a more general understanding.

And now I’d recommend you get your hands dirty and estimate some marginal effects with the help of *marginaleffects *and some assistance by Andrew’s blog post.

Footnotes

↑1 | Can’t unsee Abbildung 2. |
---|---|

↑2 | Who was so kind to comment on an earlier version of this blog post, for which I am very grateful. Of course, he bears no responsibility for any mistakes I made, let alone for any inappropriate joke. |

↑3 | Thus scooping the best English pun we could have come up with. Aber ich marg ihn trotzdem. |

↑4 | So don’t worry whether the trolley is accelerating toward anyone–in the vacuum of our thought experiment, nobody can hear them scream anyway. |

↑5 | If probit + fixed effects makes you wince (are you an economist by any chance?) we also did a variation that replaces classroom fixed effects with the size of the classroom. And a linear probability model, because we tried to make readers from all fields of the social sciences happy. |

↑6 | and use whatever is trending in your field for publishing in certain journals. |

↑7 | I think there are some parallels here to Richard McElreath’s framing of statistical models as Golems, though I think Richard tends to care much more about how the golem is built. |