In some fields, researchers who end up with time series of two variables of interest (X and Y) like to analyze (reciprocal) lagged effects between them. Does X affect Y at a later point in time, and does Y affect X at a later point in time? These questions are usually addressed with some sort of panel model including lagged effects, most prominently (in psychology at least) the cross-lagged panel model (CLPM) or the random intercept cross-lagged panel model (RI-CLPM).[1]But there’s a whole world out there and it’s all connected, see Usami, Murayama & Hamaker (2019).
Such models and the resulting inferences have drawn a fair share of flak. But let’s say you really do want to investigate those lagged effects. How can you do that well?
Tier 1: Yes, you probably do want those random intercepts
Cross-lagged panel model
If your analysis is some variation of a cross-lagged panel model, you should most likely include correlated random intercepts for both time series to separate “within-person processes from stable between-person differences” (Hamaker et al., 2015). Putting it that way almost feels like underselling the matter. If modeled appropriately, the random intercepts rule out that time-invariant confounding explains away whatever lagged effects you find (Rohrer & Murayama, 2023). So, for example, you no longer need to worry that your results simply reflect confounding by variables such as gender, stable personality traits, or whatever happened in people’s childhoods. Think of the random intercepts less as substantive variables than as a handy way to screen off the effects of such confounders; don’t interpret their correlation as any sort of “effect”. The flipside of this is that if you don’t include the random intercepts, you actually do need to worry that such variables confound your conclusions.[2]Although one may be able to get rid of some of the bias within a cross-lagged panel model, in particular if e.g. lag-2 effects are included, see Murayama & Gförer (2024).
There’s some confusing side quest here in which personality psychologists suddenly claimed that maybe it’s okay to omit the random intercept because…reasons. See also footnote 1 in “Let’s do statistics the other way around.” This may be deeply confusing to non-personality psychologists, so once again I have relegated this to a footnote.[3]Some personality psychologists have convinced themselves that including random intercepts is actually wrong. If you’re interested in the effects of the stable trait, you don’t want to control it away. But unfortunately, if you’re really truly interested in the causal effects of the stable trait, longitudinal data are only really helpful insofar that you get a more precise estimate of the stable trait. So, you might as well just average all observations of X and correlate them with the average of Y. That is, of course, not very impressive from a causal inference perspective—which is to say, to identify the effects of stable traits, you will simply need a different causal inference approach. What other causal inference approach? I’m honestly not quite sure myself, but genetically informed designs may be one meaningful way forward, see Briley et al. (2018). In any case, longitudinal data won’t do the trick—it’s logically inconsistent to think they could fix your causal inference issue when the cause of interest is the thing that does not change over time.
Multilevel model
If you’re instead implementing some sort of multilevel model, the corresponding thing to do is—somewhat confusingly—to include fixed effects (i.e., fixed person-specific intercepts). Alternatively, one may use some within-between specification (e.g., including both the person’s mean and their deviation from the mean as predictors), or the Mundlak version of it (see Hamaker & Muthén, 2020, for an overview). All these things aim to achieve the same thing, namely once again removing stable between-person differences to ideally screen off time-invariant confounding.
Why won’t random intercepts do the job here? The difference is that in the cross-lagged panel model case, you can include a random intercept on both sides (one for X, one for Y) and allow these intercepts to correlate, which captures the correlations induced by time-invariant confounding. In contrast, in the case of a multilevel model, you’d only include random intercepts on one side (for the outcome Y), and these are by design assumed to be uncorrelated with your predictor X—so, they cannot properly capture the confounding.
(What if you’re not interested in causal effects at all? We turn to that question at the bottom of the iceberg, Tier 4).
Tier 2: Consider time-varying confounding
Ideally, at this point you’ve taken care of time-invariant confounding. Now you need to worry about time-varying confounding. There may be third variables that change over time within people that lead to spurious lagged effects; for example, maybe they engaged in some activity that affects both X and Y, maybe they experienced a life event or their life circumstances changed. It could be something as trivial as whether or not they recently had a cup of coffee.[4]Or the effects of time-invariant confounders may change over time, in which case one may want to think of time/development as the confounding influence. Unfortunately, longitudinal data cannot automatically take care of these influences, which was the topic of an earlier blog post (this post’s doppelgranger).
In my experience, these confounders are largely ignored in psychology. That’s unfortunate, because this means that inferences rest on the assumption that there’s essentially nothing else going on in people’s lives except for the two time series X and Y. The better approach here would be to pro-actively think about time-varying variables that may affect both X and Y, make sure they are included in the data collection, and then adjust for them when analyzing the data.
If relevant time-varying confounders have not been assessed, one should at least spell out how these may bias the conclusions. And, given that people seem quite unaware of the issue of time-varying confounding, authors should spell out that substantive interpretations of their findings rest on the assumption that there are no omitted time-varying confounders.
Tier 3: Picking the “correct” time lag
There’s been quite some discussion (in particular in the context of experience sampling methods) how one needs to pick the “right” time lag to get at the “processes” of interest. I low-key dislike this framing because (1) people are horribly vague about what “processes” actually means and (2) it’s usually employed in an asymmetric manner—if no lagged effect is found (or the effect is “too weak”) then this may be because of the wrong time lag, but it’s never “you overestimated the effect because of the wrong time lag.” This, in turn, implies that the right time lag is whatever time lag results in a large estimated effect. And that time lag is a feature of “the effect of X on Y” (or “the effect of Y on X”). Anne pointed out that this sounds a lot like people are actually doing (exploratory) phenomena detection (“When does the effect of X on Y peak?”), and that sounds about right.[5]Phenomena detection is a worthy research goal in its own right; although the optimal set-up and write-up for such a study would probably look slightly different than the standard lagged effects article.
From a causal inference perspective, I think it’s easier to think about this clearly by abandoning the notion of “the” effect. “The effect of X on Y” is an underspecified estimand; we can’t express that in a single number. But we can try to quantify the effect of X on Y after one minute, or after an hour, or after a week, or after a month. All of these can be well-defined research questions; it’s just that based on theoretical reasons, we may be more interested in one of these over the others. For example, usually we wouldn’t be interested in the effects of a painkiller on outcomes one year later.[6]Although if it’s something highly addictive, maybe it would be relevant how people are doing much later in time. We may also be interested in the effects over a range of time points—for example, we may want to estimate the effects of a painkiller from right after consumption up to 48h later.
Now, in psychology, which time lags would we be interested in? We have to determine that “based on theory”, except psychological theories are usually incredibly vague, so we are actually more likely to go by our commonsense understanding of the world. So, for example, for a psychological process that is supposed to play out over minutes, maybe analyzing annual panel data would not be particularly informative (Mulder et al. just dropped a preprint on the matter that looks promising).
There are multiple things that can go wrong here—the time lags could be too large (i.e., the data are not granular enough), but it could also be the case that the total duration of the study is not sufficient to detect effects (e.g., imagine a daily diary study over 5 days to evaluate how hormonal contraceptives influence women’s well-being in the long run). What can’t really happen here is that your data are too granular—you could always use a model with finer time lags to derive effects over longer time lags (see Gische et al., 2020). The fact that you could, in principle, also just throw away some data points should be sufficient to make the point that it’s never too much from the perspective of the data analyst.[7]It can, of course, be too much from the perspective of the participant, and I do think that people worry too little about what happens if you ask people the same stuff over and over again. But that’s a topic for a different blog post.
In practice, I think the best one can do here is to clearly communicate the time lag underlying one’s claims at central points throughout the manuscript (including the abstract), and to consider whether that time lag is actually informative with respect to improving our understanding of the world. Also, one should be careful when discussing discrepancies with previous findings; the very same process may result in very different estimates when analyzed at different time lags (see literature on continuous time modeling, e.g., Driver et al., 2017).
Lastly, all of the uncertainty here loops back to the point that maybe, just maybe, most researchers conducting these types of analyses are actually doing exploratory phenomena detection rather than hypothesis testing. Which is fine, if it’s communicated transparently and framed appropriately.
Tier 4: What if it’s only meant to be associations/prediction/Granger causality?
All of the above assumes that one is actually interested in causal inference and thus needs to worry about things such as confounding and “spurious” lagged effects. But what if you’re interested in less? What if you’re just interested in prediction?
The answer to that is that if you’re really interested in predicting an outcome, you should move on and build a predictive model for your specific use case, taking into account things such as predictive accuracy and out-of-sample performance. Note that this is unlikely to be something like an out-of-the-box cross-lagged panel model, which would imply that you’re literally trying to predict X and Y just from X and Y at one (or more) previous points in time. In most applied settings, one would hope that you have a couple of more variables available. Notice also that in such scenarios, any individual lagged coefficient is not of particular interest. You may, however, want to compare the predictive performance of a model that predicts Y from past values of Y with that of a model that predicts Y from past values of Y and past values of X.
As far as I can tell, virtually nobody does that in psychology because not many people—in particular not those who look at lagged effects—are sincerely interested in prediction. In those instances, prediction is mostly a weasel word to imply causality without owning it (see Grosz et al., 2020). [8]People, including myself, have rambled about this for quite some time, see for example the section “PostPredictive Modeling” in this blog post or the section “Incremental validity: An answer in search of a research question” in Rohrer (2024).
Many researchers seem to struggle with how to deal with causality. One straightforward way would be to spell out a clear causal research question, and then conduct an analysis that returns an answer to that question, if certain assumptions are met—such as a lack of unobserved confounders. That is, in a sense, very straightforward, but it is also clearly fallible insofar as assumptions may be unlikely to be met in practice.
So, researchers opt instead to conduct an analysis that pretends to answer a merely predictive or associational research question. The answer to that question is to some extent infallible, in particular if we phrase our question in terms of parameters in a statistical model (“in a RI-CLPM fitted to my data, what’s the cross-lagged coefficient of X on Y?”). And then, maybe, just maybe, that answer does somehow inform us about another research question that is actually causal.[9]More research is needed. Which raises the question: Well, what conditions must be met for the answer we provided to inform us about causality? It turns out that it informs us about causality if the assumptions are met that would justify a causal interpretation.
So, the choice here is between two scenarios:
- asking a clear research question that is actually of interest, and presenting an answer that is wrong unless certain articulable assumptions are met
- asking a vague research question, presenting an answer that is likely correct but also not really of interest, and then adding some speculation about a causal interpretation that is again wrong unless certain assumptions are met (see scenario a), except now it’s less transparent what is actually going on.
Returning to the surface
The possibility of time-varying confounding may make it sound like it’s pretty much impossible to successfully estimate lagged effects. I concede that this may be the case for many pairs of variables. But, in practice my bar is actually quite a bit lower—I’d already be happy if people routinely paid more attention to the basics.
For example, some years ago, Rich Lucas and I looked into the evidence that subjective well-being has downstream effects on health (see preprint here). Now obviously doing that with the help of the usual lagged models is probably fairly doomed from the get-go because of time-varying confounders: when people are suddenly happier, that usually has some reason (such as changes in life circumstances), and that reason might as well also affect their health. But then when looking at the longitudinal studies that were invoked as evidence, they actually fell short for way more basic reasons that are solvable in principle, such as no concern for confounders whatsoever in combination with no random intercepts, direct content overlap between the “predictor” and the “outcome”,[10]Including the prediction of pregnancy outcomes (health) from maternal feelings towards the pregnancy (subjective well-being?)—variables that are very clearly confounded by pretty much all pregnancy-related risk factors. Women who are worried about their pregnancies may often have good reasons to feel that way; at least to me that seems more parsimonious than “the power of positive thinking wards off preterm births.” and chronically weak statistical evidence.
That makes me think that for many pairs of variables, a good analysis of lagged effects may have never been tried. Do these models rest on strong assumptions about a lack of unobserved confounding? Yes, absolutely. Are these assumptions likely to be met? Well, I guess it depends. But in any case, if such analyses are reported transparently and with some conceptual clarity, that’s probably a big step up.
Footnotes
↑1 | But there’s a whole world out there and it’s all connected, see Usami, Murayama & Hamaker (2019). |
---|---|
↑2 | Although one may be able to get rid of some of the bias within a cross-lagged panel model, in particular if e.g. lag-2 effects are included, see Murayama & Gförer (2024). |
↑3 | Some personality psychologists have convinced themselves that including random intercepts is actually wrong. If you’re interested in the effects of the stable trait, you don’t want to control it away. But unfortunately, if you’re really truly interested in the causal effects of the stable trait, longitudinal data are only really helpful insofar that you get a more precise estimate of the stable trait. So, you might as well just average all observations of X and correlate them with the average of Y. That is, of course, not very impressive from a causal inference perspective—which is to say, to identify the effects of stable traits, you will simply need a different causal inference approach. What other causal inference approach? I’m honestly not quite sure myself, but genetically informed designs may be one meaningful way forward, see Briley et al. (2018). In any case, longitudinal data won’t do the trick—it’s logically inconsistent to think they could fix your causal inference issue when the cause of interest is the thing that does not change over time. |
↑4 | Or the effects of time-invariant confounders may change over time, in which case one may want to think of time/development as the confounding influence. |
↑5 | Phenomena detection is a worthy research goal in its own right; although the optimal set-up and write-up for such a study would probably look slightly different than the standard lagged effects article. |
↑6 | Although if it’s something highly addictive, maybe it would be relevant how people are doing much later in time. |
↑7 | It can, of course, be too much from the perspective of the participant, and I do think that people worry too little about what happens if you ask people the same stuff over and over again. But that’s a topic for a different blog post. |
↑8 | People, including myself, have rambled about this for quite some time, see for example the section “ |
↑9 | More research is needed. |
↑10 | Including the prediction of pregnancy outcomes (health) from maternal feelings towards the pregnancy (subjective well-being?)—variables that are very clearly confounded by pretty much all pregnancy-related risk factors. Women who are worried about their pregnancies may often have good reasons to feel that way; at least to me that seems more parsimonious than “the power of positive thinking wards off preterm births.” |