Reviewer notes: In a randomized experiment, the pre-post differences are not effect estimates

Reviewer notes are a new short format with brief explanations of basic ideas that might come in handy during (for example) the peer-review process. They are a great way to keep Julia from writing 10,000-word-posts,[1]but make her write ten 1,000-word-posts instead and also a great tool for Julia to pester her co-bloggers to write more posts.

When it comes to causal inference, randomization may be the closest thing to magic available to us. But just being close is not quite the same as being magic, and some things it cannot do for you, e.g., that, but also it cannot justify a causal interpretation of the differences between a pretest score and a posttest score.

You have collected a baseline measure of Y (pretest), randomized X, implemented the corresponding intervention (e.g., treatment vs. control; experimental condition A versus experimental condition B), and then collected another measure of Y (posttest). Now, what can you make of the differences between the pretest outcome and the posttest outcome? For example, how can you interpret an individual’s change score (YPost minus YPre)? Or what can you make of the average change score within the levels of X (e.g., the average within-group change, also referred to as within-arm change in clinical contexts?). Unfortunately, it turns out that in most circumstances, the answer is: not much.

“Alternative” explanations of pre-post differences

The problem is that any pre-post difference can be “explained away” by a ton of factors that have nothing to do with the randomized intervention that happened between the two occasions. For one, pre- and posttest were assessed at different points in time, and a lot could have changed – the weather, the season, the point of time in the academic year, the global political situation – leading to systematic differences between the two time points. Even if nothing happened in the world, at the posttest, your participants will likely fill out the same measures for the second time, and it could also simply be that having seen all the questions before changes things (“practice” effects, initial elevation bias,…).[2]If you don’t believe me, force yourself to participate in one of those studies where you need to fill out the same 40 items over and over again. You can hate thank me later for that insightful experience. 

Regression to the mean

And then there’s regression to the mean. In the abstract, this means that when you have two variables (A, B) that are correlated – but not perfectly correlated, so rA,B < 1 – and you select a sample based on one of the variables (A), in that selected sample the other variable (B) will take on less extreme values than the variable on which you have selected (A). What does that have to do with pre-post-differences? Samples will often be selected on their pretest outcome (A). Sometimes that selection is quite direct (e.g., when some symptom-score cut-off must be met to be eligible for participation), sometimes rather indirect (e.g., when the outcome of interest is some personality variable Y, and the study is advertised as a study on voluntary personality change, so that people who are dissatisfied with their Y are more likely to sign up). Pre- (A) and posttest (B) outcomes will virtually always be correlated, but not perfectly.  This means that if participants are selected based on their pretest outcomes A, posttest outcomes B will be less extreme due to regression to the mean. So, for example, if you sign up people who don’t feel well to participate in your study, and then just wait for a couple of weeks, and then look at the posttest scores, it will look like you made them feel better. Except you didn’t do anything; you just selected a group based on a variable and observed it regress to the mean, which is not a substantive finding but rather a statistical necessity.

The magic of randomization

The nice thing about randomized experiments with more than one group is that all of these alternative sources for pre-post-differences – regression to the mean, practice effects, effects of historic time – should affect all of your groups. And so if you compare the groups, those factors cannot “explain away” group differences, which you can in turn attribute to your randomized intervention. That’s the whole beauty of randomization. But you have to compare the groups to reap its benefits. You don’t even need the pretest outcome for the magic to work; the comparison of the posttest outcome should give you an unbiased estimate of the causal contrast between the different intervention assignments (e.g., of treatment vs. control). However, adjusting for the pretest outcome will give you a lot more statistical bang for the buck.[3]Technically, you can calculate change scores and then compare them between the groups – that gives you an unbiased estimate of the treatment effect (if randomization has occurred; observational studies are an entirely different beast), it’s just less efficient. So, you’re not wrong on average; but you don’t give the best possible answer either. The better strategy is to actually adjust for the pretest. Adjustment can be achieved in a standard regression analysis (“ANCOVA”, according to some deeply disturbing naming convention in psychology), or with a multilevel model, which is somehow more fashionable in psychology and does have its merits according to this blog post by Solomon Kurz (who was so kind to provide some nuance in the comments of this blog post). Sometimes I also see people calculate change scores and use these as outcome while also adjusting for the pretest. That is absolutely fine as well; it will return the same answer as the adjusted regression model for the posttest outcome. The procedure is just a bit redundant; “doppelt gemoppelt” in the language of love (German). All of this presumes that we are looking at Gaussian outcomes. Like Solomon says below, change scores for other type of data are probably a bad idea (unless you really think through what you are doing). You may still report or visualize pre-post differences to give a fuller picture of your data. However, don’t ever refer to them as “effects”, and don’t interpret them as such. What counts for claims about effects are not the changes within groups, it’s differences between the groups

Who misinterprets pre-post differences as effect estimates?

Admittedly, the take-home message that you cannot interpret pre-post differences as causal effect estimates is fairly basic and should have been covered in most introductions to research methods, as part of the motivation for randomized experiments. That said, introduction to research methods may have happened many years ago for many researchers, so this misinterpretation does make it into the published literature – and not just in psychology, as evident by this blog post on the topic by Darren Dahly: One simple trick that statisticians hate. To take a prominent example, the literature on the placebo effect is littered with people misinterpreting the pre-post difference in the control group with some sort of effect when it simply isn’t. So, if you now are more aware of these problems than before you read this blog post, I’d say that’s pretty convincing causal evidence for the importance of basic statistical education.

Or is it?

Footnotes

Footnotes
1 but make her write ten 1,000-word-posts instead
2 If you don’t believe me, force yourself to participate in one of those studies where you need to fill out the same 40 items over and over again. You can hate thank me later for that insightful experience.
3 Technically, you can calculate change scores and then compare them between the groups – that gives you an unbiased estimate of the treatment effect (if randomization has occurred; observational studies are an entirely different beast), it’s just less efficient. So, you’re not wrong on average; but you don’t give the best possible answer either. The better strategy is to actually adjust for the pretest. Adjustment can be achieved in a standard regression analysis (“ANCOVA”, according to some deeply disturbing naming convention in psychology), or with a multilevel model, which is somehow more fashionable in psychology and does have its merits according to this blog post by Solomon Kurz (who was so kind to provide some nuance in the comments of this blog post). Sometimes I also see people calculate change scores and use these as outcome while also adjusting for the pretest. That is absolutely fine as well; it will return the same answer as the adjusted regression model for the posttest outcome. The procedure is just a bit redundant; “doppelt gemoppelt” in the language of love (German). All of this presumes that we are looking at Gaussian outcomes. Like Solomon says below, change scores for other type of data are probably a bad idea (unless you really think through what you are doing).

4 thoughts on “Reviewer notes: In a randomized experiment, the pre-post differences are not effect estimates”

  1. I agree with all the major points, and it looks like you were careful with how you made them. There is a caveat, however, that it is perfectly valid to analyze pre/post experimental data with a change score as the DV, particularly if you also include the prescore as a covariate. That would follow the basic linear model

    change ~ tx + pre.

    The beta coefficient for tx in that model would be identical (both in terms of point estimate and standard error) to the more widely used linear model

    post ~ tx + pre.

    But this approach is really only valid for Gaussian-type data; I wouldn’t try this with binary or ordinal or count data. I mention the caveat only to deter any future Reviewer #2s from telling a substantive researcher they’re wrong for analyzing a change score. In some contexts, change scores are fine, but they are easy to misinterpret, for sure.

    1. Hi Solomon! Thanks for chiming in. Yes, you’re absolutely right — if you do both change score and adjustment, you get precisely the same estimates. I’ve adjusted the phrasing in the corresponding footnote from preventing overeager Reviewer 2#s from requiring adjustments that aren’t really necessary!

  2. Maybe it should be obvious, but for me it’s surprising that the word “control” pops up just once at the end of the post, while it seems central. The link in note 3 is about RCT after all. Am I missing something?
    Anyway I like the new format, even if 10,000 word posts are nice too!

    1. Hi there! Ah, no, I don’t think you’re missing anything. You’re thinking of “control” in terms of control group, right? I think at the beginning of the post I’m speaking in more general terms (so the groups could also be experimental manipulation A versus experimental manipulation B), but then in the end it clearly shifts towards RCT where it’s always intervention vs control. I’ll try to clarify that throughout to make the mapping clearer! And thank you 🙂 I like the new format as well, it was Anne’s idea (hope she’ll write one herself lol).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.