Summer in Berlin – the perfect time and place to ~~explore the city, take a walk in the Görli, go skinny dipping in the Spree,~~ attend an overcrowded, overheated conference symposium on cross-lagged panel models (#noAircon). So that’s what I did three weeks ago at the European Conference on Personality.

The cross-lagged panel model debate in psychology provides the backdrop for this blog post; knowledge of its dark secrets is not necessary to follow along. But if you *do* want the CliffsNotes, read this footnote^{[1]}Starting point: Once upon a time, psychologists used the cross-lagged panel model (CLPM) to draw sort-of-causal-but-maybe-it’s-not-causal-more-research-is-needed inferences without a care. What the cross-lagged panel model does is essentially regress Y on both Y at an earlier point in time and X at an earlier time point. If the coefficient of X is significant, that’s a significant cross-lagged effect. Simultaneously, you regress X on both X at an earlier time point and Y at an earlier time point to test for cross-lagged effects into the other direction. Maybe you can already tell why it would be contentious to interpret those estimates causally – it seems a bold proposition that controlling for an earlier outcome would be sufficient to take care of all confounding. Cue vaguely threatening music. Enter Hamaker et al. (2015), whose “critique of the cross-lagged panel model” points out that these “effects” will be confounded if the two constructs have some degree of trait-like stability, and if their stable parts are correlated. This paper had a huge impact on the field (easter egg: make sure you check out footnote 1 to get some idea of how causality was treated in the psych methods literature, back in the day). In any case, Hamaker et al. say that correlated random intercepts should be included for both constructs. This accounts for potential confounding via the traits (and essentially results in “within-person” estimates, similarly to what you would get in a fixed-effects model). The carefree days are over. Once the random intercepts are included, fewer cross-lagged effects turn out to be significant. Orth et al. (2021) to the rescue: maybe it is fine to use the model *without* the random intercepts, as it supposedly tests “conceptually distinct psychological and developmental processes.” This is a tempting proposition because it implies both theoretical nuance and more significant findings. Further vindication for the original cross-lagged panel model is provided by Lüdtke and Robitzsch’ “critique of the random intercept cross-lagged panel model” (2021). Suffice it to say that this text will be subject to exegesis for the years to come. Meanwhile the editor of at least one major personality journal appears slightly exasperated because time is a flat circle (Lucas, 2023: Why the cross-lagged panel model is almost never the right choice). My own small contribution to this debate is pointing out that if people mainly use these models to draw causal inferences, maybe we should focus on causal inference (Rohrer & Murayama, 2023: These are not the effects you are looking for). It’s not like longitudinal data somehow magically solved causal inference. (SCP-47258, containment class: Keter).

What I realized during this panel is that personality psychologists (and, by extension, other subfields of psychology that like to do fancy modeling) very much approach statistical modeling from statistics. They run some analyses that return numbers. Then, they work their way toward an interpretation of those numbers.

In this blog post, I want to argue that this leads to a lot of confusion, and that there is another way – starting from the desired interpretation and then working towards a statistical analysis that can (hopefully) deliver.

## The incumbent: Statistics first #AllModelsAreBeautiful

For example, let’s consider longitudinal modeling. Researchers have two variables and are interested in their “interrelations” or “how they relate over time” or “how X and Y contribute to each other.” Then they mostly apply some out-of-the-box model which is usually referred to by acronym, such as the CLPM (cross-lagged panel model) or the RI-CLPM (random intercept cross-lagged panel model) that were the subject of the conference symposium mentioned above, or maybe the ARTS (autoregressive trait state model) or the STARTS (stable trait autoregressive trait model), or maybe the LGCM, or a LCSM. These are all part of the SEM family, but to spice things up, somebody may also pick an HLM/MLM/REM/MEM.^{[2]}These all refer to the same class of models, FML. You get the idea. Finally, all or most resulting coefficients are interpreted, one way or another, often following some template people acquired when learning about the model (“this number here means that”) combined with some freestyle variations.

Of course, not everybody works with longitudinal data, and so there are variations of approach. For example, researchers may set up some model to “decompose the variance” in some measures and then see how the variance components “are related.” Or they may work with self- and other-report data of some construct and somehow transform and aggregate these in various ways to derive some metric, and then correlate said metric with something else, and then try to interpret the resulting correlations between correlations. In a sense, there’s a lot of highly creative data analysis going on in personality research.

What these approaches have in common is that the starting point is the “method”, the statistical model or analysis that is applied. What method is chosen is determined partly by the data structure but also by tradition. For example, if you have nested data (of the non-longitudinal type), there is some default sentiment that you have to deal with that by means of a multilevel modeling – multiple articles have tried to make the point that if you are not interested in the whole multilevel structure, maybe there are more minimalistic solutions, but that remains a fringe viewpoint. If, instead (or additionally) you have self- and other-reports, you have to “control for normative desirability” to derive “distinctive personality profiles” which you can then correlate to arrive at supposedly interpretable correlations. Certain trends come and go; for example, bifactor models were popular in certain literatures that are rather foreign to me, but they seem to have fallen out of favor. One may be forgiven to think that the whole CLPM debate is another example of such a trend—who knows what will come after the RI-CLPM.^{[3]}I disagree with the underlying sentiment though. If causal inference is the goal and longitudinal data are meant to improve the chances of successful identification, accounting for “the trait”—stable between-person differences—isn’t a question of fashion; it’s the sensible default.

This particular way of dealing with statistics leads to some peculiarities. Even substantive personality researchers will often specialize in certain model classes and become very apt at interpreting their moving parts (“oh, this variance term here means that…”). From the outside, this may give the impression that personality researchers have particularly well-developed methods skills, and I wouldn’t disagree with that.^{[4]}The comparison group obviously matters. But in principle, I believe that personality researchers are on average quite technically competent; I also believe that we could make better use of those powers (hence, this blog post). We also end up with a weird mix of conservative rigidity and postmodern nihilism. Considering the former, if you have a particular type of data, people will act as if one particular model was obviously the only right choice. Incidentally, this makes it very hard for non-psychologists to publish in personality journals; your fixed-effects model might get rejected because in this house we analyze longitudinal data differently. Considering the latter, there is still an underlying sentiment that “any model goes.” For example, if the CLPM results in confounded inferences, that’s your problem as the researcher who overinterpreted the model. The model did nothing wrong, it did precisely what it was supposed to do. What is wrong are your inferences. Classic beginner’s mistake. But surely somebody out there will have the right research question for the model you reported.

## The contender: Substance first #WhatIsYourEstimand

Now, here’s an alternative approach one could envision. One could start from a clearly defined analysis goal, such as “I am interested in the causal effect of X and Y in target population Z.” Or maybe one could simply be interested in the distribution of X in a certain population Z; or maybe even just in the (unconditional, bivariate) correlation between X and Y in Z. I am the last person to tell people what they should be interested in—but the first one to tell them that if they don’t tell me what they are trying to estimate in the first place, why even bother.

These analyses goals are so-called theoretical estimands, and the wonderful paper by Lundberg et al. (2021) explains that they should be described in precise terms that exist outside of any statistical model. It also illustrates how to do so. To supplement this approach, given how many researchers insist that they are interested in prediction rather than causal inference, I am willing to concede that one could also start from a clearly described scenario in which one actually wants to make predictions—predictions in the sense of predictions (e.g., trying to predict how satisfied two romantic partners will be after one year, based on their personality right now), not in the oh-it’s-only-a-correlation-hm-but-maybe-it-is-also-more-no-but-definitely-not-causal sense in which it is often used in psychology (e.g., this new questionnaire predicts 0.1% of the variance in subjective well-being above and beyond this 120 item Big Five questionnaire; for more ramblings on incremental validity see also section 3.2 in Rohrer, 2024).

Now, we still need to figure out how to actually learn something about the theoretical estimand of interest, or alternatively, how to best predict the outcome. Depending on the estimand and the available data, we may actually end up using a CLPM/RI-CLPM/ARTS/STARTS/LGCM/LCSM/SEM/HLM/MLM/REM/MEM after all. But if the analysis goal is causal inference, then quite likely we will realize that we additionally have to adjust for at least some covariates to reduce confounding. And we wouldn’t want to interpret every single coefficient that the model returns. In fact, many of the coefficients may be uninterpretable (this is known as the Table 2 fallacy). But that’s not a bug; we don’t need to be bothered by it if the coefficient corresponding to our estimand of interest is interpretable (big if). So we might end up with a similar model, but approach it in a different spirit, and our interpretation of the results may be a lot more focused.

If instead the analysis goal is just estimating a correlation, we might as well end up just calculating a correlation. Researchers sometimes come up with a hierarchy from “spurious” correlations to “true”/”interpretable”/”robust” correlations. The latter are usually correlations after more or less successfully conditioning on confounders that may bias the correlation relative to some estimand of interest. But the estimand of interest isn’t spelled out, and so it seems like a convoluted attempt to tweak the concept of a correlation until it provides the answer to the unarticulated research question. But if you are really just interested in the correlation, all that statistical control and other contortions may be unnecessary.^{[5]}This is an oversimplification – the correlation in your possibly selective sample may be a biased estimator of the correlation in the population of interest; in that case you may still need to worry about and take into account third variables. But this is usually not why psychologists invoke third variables. Additionally, there may be concerns about measurement biases; I think those mostly can be rephrased as concerns about confounding (which is also reflected by the models that people usually use to tackle them, mostly under the implicit assumption that whatever biases the measurement is not correlated with some underlying substantive variables). And if you are *really* interested in *actual* prediction, knock yourself out but do it properly.

Estimands aren’t magic; being upfront about the analysis goal does not guarantee valid inferences. To ensure that we can actually recover the theoretical estimand of interest from the data, assumptions are necessary.^{[6]}These should *also* be spelled out explicitly. But I have come around to believe that explicit estimands are the thing we have to tackle first to get anywhere. If I know your estimand but not your assumptions, I can figure them out on my own and there’s little room to argue about that. If I know your assumptions but not your estimand, I can *maybe* figure out your estimand. But reverse-engineering is tedious and error prone, and the vagueness with which researchers articulate their research questions makes it frustrating—they may always claim that the estimand implied by their analyses and assumptions was not the one they had in mind. Peer-review in a quantitative empirical science shouldn’t have to involve that much hermeneutics, and yet here we are. For example, if we want to learn about causal effects based on observational data, those assumptions will often have to be quite strong. Maybe they are prohibitively strong, which leaves us in the same spot as doing things the other way around—here are some numbers, we are not sure what to make of them. So:

## Why even bother with estimands?

We should bother because moving from statistics to interpretations is clearly confusing people. A lot. I would be willing to say that in personality research, people being confused about what they try to achieve with their often supposedly sophisticated statistical models is among the top sources of research waste. Years of PhD students’ lives are consumed learning and implementing arcane analyses that may not even be the best way to address their research questions at hand. Months of reviewers’ and editors’ lives are wasted trying to figure out what the hell the authors are even trying to do, often in a lengthy back-and-forth. There are just so many debates consuming the time and energy of researchers that could be resolved if everybody had a clearer idea of their estimand in the first place.^{[7]}Examples involve the back-and-forth regarding the whole purpose of the marshmallow test (prediction or explanation?), the endless debate about the age trajectory of happiness (what’s an age trajectory anyway?), and the major confusions that arise in Many Analysts projects. More details and references in section 4.1 here.

I know the whole estimand thing is a tough sell for multiple reasons.

First of all, if we explicitly acknowledge that there’s something out there in the world about which we want to make statements, it makes our findings potentially fallible. In contrast, mere statistics won’t let you down. A conditional association is what it is. Maybe we shouldn’t overinterpret the data and instead just let them speak for themselves? The bad news is that if you’re doing substantive research, substantive interpretations are going to happen either way—otherwise the whole exercise of running statistical analyses would be pointless. Being vague about the desired interpretation may maintain some degree of plausible deniability and potentially offload the error to others (“oh, you shouldn’t have interpreted our numbers like that!1”). But honestly, that’s just a cowardly way to do science.^{[8]}To me, it’s most clearly exemplified by how psychologists treat causal inference based on observational data. They will just try so hard to imply causality without ever owning their causal claims. Life is too short for that.

Naturally, you may still feel bad about communicating your preferred interpretation with too much certainty. That’s good, actually. You can communicate this throughout your manuscript and use the limitations section to point out how mistaken assumptions may invalidate inferences.^{[9]}That results in much more interesting limitations sections than the usual nod to external validity (“Because this study took place in Luxembourg and only included psychology undergraduate students, findings may not generalize to the Global South”). Just don’t try to squeeze that uncertainty into the analysis goal (“maybe we’re trying to do causal inference; maybe we’re trying to do prediction; maybe it’s a secret, more complex third thing”). Own your estimand (“we’re trying to make statements about this; here’s our answer; here’s how it could be wrong”).

Another part is that people will say “oh, but of course you need to pick the right model for the research question at hand, everybody knows that!” I admit the underlying insight may be trivial—if you are not clear about what you want to do, it’s really hard to do it well. But the devil is in the detail. I see a lot of researchers pay lip service to the notion that you need to pick the right model for your research goal. Then they will memorize which vaguely phrased research questions can supposedly be answered by which coefficient in what model. And then if you look into the literature, it’s predictably still a mess.^{[10]}There’s a parallel phenomenon in the “theory crisis” discussion. People will point out that psychology often lacks rigorous theorizing, which is very true. But then sometimes you look at what those people consider serious theorizing, and it turns out that it’s mostly boxes containing a hodgepodge of variables, haphazardly connected by arrows based on either common sense or flimsy/confounded empirical studies; the central theoretical prediction being that “everything may be connected.” So if I hear somebody talk about how psychology needs more theory, I keep my guard up until I can confirm that this is not the type of theorizing they have in mind. So maybe having a clear research question and picking the right model for the task is not all that trivial. Here, the estimands framework enforces rigor because it provides a systematic way to spell out the analysis goal. And then in the next step, thinking about the necessary identifying assumptions enforces rigor about how to connect the research question to the statistical model.

Another reason why this is a tough sell is that it to some extent devalues researchers’ hard-earned stats skills. *Nobody* likes to see their skills devalued. If you have invested years into mastering the details of all moving parts of SEM, or HLM, or weird index variables derived from other variables (looking at you, profile correlation and euclidean distance people), surely you would want to apply that some more rather than switching to an estimand angle. The estimand angle may sometimes result in the insight that some sort of (generalized?) linear model may be sufficient for the task; and maybe also that, for example, the choice of covariates (which connects to the identification assumptions) matters more than the choice of statistical analysis.

So, it’s important to point out that the skills that are needed to move from statistics to substance—the skills to figure out how certain model parameters can be interpreted, and when inferences go wrong—are of course valuable and useful within the estimands framework, that is, when moving from substance to statistics. The mapping still goes both ways, and in practice there will be some degree of back and forth. For example, one may realize that one’s statistical model behaves weirdly in certain circumstances, which in turn alerts us that we may have missed assumptions necessary to link the theoretical estimand to the empirical estimand. Really, committing to estimands does not change anything about the underlying statistics.

It’s just a different angle from which to approach things that may help clear up some confusion. And, beyond this, it also just makes sense to take the theoretical estimand as a guiding light if you’re a substantive researcher. If you’re a substantive researcher, you mainly do statistics to answer substantive questions, not because you love statistics so much.^{[11]}Unless that’s *really* your thing; this is a kink-shaming free space. So let’s start from clearly defined theoretical estimands and move on from there.

Further readings:

- Lundberg et al. (2021): What is your estimand? Defining the target quantity connects statistical evidence to theory. This is an instant modern classic for the social sciences. Admittedly a bit much if you are completely new to causal inference, but in any case worth the effort.
- Auspurg & Brüderl (2021): Has the credibility of the social sciences been credibly destroyed? Reanalyzing the “Many Analysts, One Data Set” project. This provides a great illustration of how unclear estimands create confusion and connects it to meta-scientific discussions about so-called researcher degrees of freedom.
- Kahan et al. (2024): The estimands framework: a primer on the ICH E9(R1) addendum. If there’s any field in which estimands have become somewhat mainstream, it’s medical research/biostatistics/health research/epi/not-sure-how-to-call-it. This article nicely explains the intricacies of estimands within the “simple” context of medical trials (i.e., in a context where it may not be obvious that one could come up with different estimands).

Further ramblings on this blog:

- Mülltiverse Analysis. It has become “a thing” to run a lot of analysis in psychology (e.g., multiverse analysis, specification curve analysis). This raises questions about the underlying estimand, which I discuss in this post.
- Who would win, 100 duck-sized strategic ambiguities vs. 1 horse-sized structured abstract? In which I go all in and demand that we make it mandatory to spell out a clear estimand in the abstract.
- Causal Inference | Hypothesis Testing | All at Once. Maybe you don’t have an estimand because you’re just testing some empirical prediction of your theoretical model? Here, I argue that the same rules still apply.

Footnotes

↑1 | Starting point: Once upon a time, psychologists used the cross-lagged panel model (CLPM) to draw sort-of-causal-but-maybe-it’s-not-causal-more-research-is-needed inferences without a care. What the cross-lagged panel model does is essentially regress Y on both Y at an earlier point in time and X at an earlier time point. If the coefficient of X is significant, that’s a significant cross-lagged effect. Simultaneously, you regress X on both X at an earlier time point and Y at an earlier time point to test for cross-lagged effects into the other direction. Maybe you can already tell why it would be contentious to interpret those estimates causally – it seems a bold proposition that controlling for an earlier outcome would be sufficient to take care of all confounding. Cue vaguely threatening music. Enter Hamaker et al. (2015), whose “critique of the cross-lagged panel model” points out that these “effects” will be confounded if the two constructs have some degree of trait-like stability, and if their stable parts are correlated. This paper had a huge impact on the field (easter egg: make sure you check out footnote 1 to get some idea of how causality was treated in the psych methods literature, back in the day). In any case, Hamaker et al. say that correlated random intercepts should be included for both constructs. This accounts for potential confounding via the traits (and essentially results in “within-person” estimates, similarly to what you would get in a fixed-effects model). The carefree days are over. Once the random intercepts are included, fewer cross-lagged effects turn out to be significant. Orth et al. (2021) to the rescue: maybe it is fine to use the model without the random intercepts, as it supposedly tests “conceptually distinct psychological and developmental processes.” This is a tempting proposition because it implies both theoretical nuance and more significant findings. Further vindication for the original cross-lagged panel model is provided by Lüdtke and Robitzsch’ “critique of the random intercept cross-lagged panel model” (2021). Suffice it to say that this text will be subject to exegesis for the years to come. Meanwhile the editor of at least one major personality journal appears slightly exasperated because time is a flat circle (Lucas, 2023: Why the cross-lagged panel model is almost never the right choice). My own small contribution to this debate is pointing out that if people mainly use these models to draw causal inferences, maybe we should focus on causal inference (Rohrer & Murayama, 2023: These are not the effects you are looking for). It’s not like longitudinal data somehow magically solved causal inference. |
---|---|

↑2 | These all refer to the same class of models, FML. |

↑3 | I disagree with the underlying sentiment though. If causal inference is the goal and longitudinal data are meant to improve the chances of successful identification, accounting for “the trait”—stable between-person differences—isn’t a question of fashion; it’s the sensible default. |

↑4 | The comparison group obviously matters. But in principle, I believe that personality researchers are on average quite technically competent; I also believe that we could make better use of those powers (hence, this blog post). |

↑5 | This is an oversimplification – the correlation in your possibly selective sample may be a biased estimator of the correlation in the population of interest; in that case you may still need to worry about and take into account third variables. But this is usually not why psychologists invoke third variables. Additionally, there may be concerns about measurement biases; I think those mostly can be rephrased as concerns about confounding (which is also reflected by the models that people usually use to tackle them, mostly under the implicit assumption that whatever biases the measurement is not correlated with some underlying substantive variables). |

↑6 | These should also be spelled out explicitly. But I have come around to believe that explicit estimands are the thing we have to tackle first to get anywhere. If I know your estimand but not your assumptions, I can figure them out on my own and there’s little room to argue about that. If I know your assumptions but not your estimand, I can maybe figure out your estimand. But reverse-engineering is tedious and error prone, and the vagueness with which researchers articulate their research questions makes it frustrating—they may always claim that the estimand implied by their analyses and assumptions was not the one they had in mind. Peer-review in a quantitative empirical science shouldn’t have to involve that much hermeneutics, and yet here we are. |

↑7 | Examples involve the back-and-forth regarding the whole purpose of the marshmallow test (prediction or explanation?), the endless debate about the age trajectory of happiness (what’s an age trajectory anyway?), and the major confusions that arise in Many Analysts projects. More details and references in section 4.1 here. |

↑8 | To me, it’s most clearly exemplified by how psychologists treat causal inference based on observational data. They will just try so hard to imply causality without ever owning their causal claims. |

↑9 | That results in much more interesting limitations sections than the usual nod to external validity (“Because this study took place in Luxembourg and only included psychology undergraduate students, findings may not generalize to the Global South”). |

↑10 | There’s a parallel phenomenon in the “theory crisis” discussion. People will point out that psychology often lacks rigorous theorizing, which is very true. But then sometimes you look at what those people consider serious theorizing, and it turns out that it’s mostly boxes containing a hodgepodge of variables, haphazardly connected by arrows based on either common sense or flimsy/confounded empirical studies; the central theoretical prediction being that “everything may be connected.” So if I hear somebody talk about how psychology needs more theory, I keep my guard up until I can confirm that this is not the type of theorizing they have in mind. |

↑11 | Unless that’s really your thing; this is a kink-shaming free space. |

“ it’s medical research/biostatistics/health research/epi/not-sure-how-to-call-it”

I don’t think there’s much common ground between Epi (observational / non-randomized studies) and RCTs. The paper by Brennan Kahan is excellent but the document he refers to in ICH E9 (revision 1) is exclusively for RCTs.

You see this in the difference in terminology (ATT/ ATE / ATU versus hypothetical, treatment policy, on treatment etc). The concept of the intercurrent event is absolutely crucial in the RCT world but a bit of an afterthought for the epi / casual inference people. It’s nevertheless helpful, for example when people run simulation studie to demonstrate that their method is unbiased, it is only ever unbiased with respect to a specific estimand. And if you aren’t interested in said estimans but another one, chances are your method is providing a biased estimate of what you are in fact interested in.

It’s almost as if the word estimated can mean whatever you want it to mean depending on the situation. So although undoubtedly helpful, and a good way of thinking (I agree with the general line of argumentation of the post), it is never likely to be a panacea.

Nevertheless, excellent post.

Thank you for the clarification! I was unsure myself where to put this; my assessment may be biased because I mostly know epi people who care a lot about causal inference (and thus also care about estimands across the board).

I think the RCT world angle may end up quite relevant for experimental psychologists (who unfortunately are often not that precise about estimands either; they just declare some mechanisms “confounds” and others not).

And I fully agree that it is never likely to be a panacea! Just a productive way to clarify some confusions, and then the concept of course at some point also runs into issues.