Correlation may not imply causation, but let’s just ignore that for a second. Correlations are standardized effect size metrics and as such have some quirks by design. These are benign enough when you just calculate a single correlation coefficient and look at it, but things can get really messy once you calculate multiple correlations and compare them, or even go so far as to correlate the magnitude of correlations with a third variable. It turns out that there are whole lines of research in psychology centered on correlations between correlations, but let’s start with a single correlation, considered in isolation.
A single correlation
The Pearson correlation coefficient (known as just “the correlation” to its friends) is the covariance between two variables divided by the product of their standard deviations. It is the same as the standardized regression coefficient in a simple linear regression model (Y ~ X), which can be calculated by taking the [unstandardized regression coefficient], multiplying it with the [standard deviation of X] and dividing it by the [standard deviation of Y].[1]We can swap X and Y here and arrive at the same number. While the unstandardized regression coefficient will be different, the standardized one will come out at the same value. This comes in handy as it helps reason through how the correlation will behave in various scenarios.
Let’s assume we have established that any correlation between X and Y occurs because X affects Y. That is usually obviously wrong, but will make it easier to understand what happens in various simple scenarios. Now let’s imagine we find a correlation that we would classify as small. For example, let’s say we find only a small correlation between positive Sex Doll Ownership Attitudes and Satisfaction with Life.[2]I chose this example because suspension of disbelief with respect to causal identification is futile anyway. Also, there are validated scales for both variables. By definition, that also means that the standardized regression coefficient is small, which means that [unstandardized regression coefficient] times [standard deviation of X] divided by [standard deviation of Y] is small.
There are various reasons why this number can end up small, and we are going to consider them one by one to make things easier.[3]In reality, the three ingredients to the standardized regression coefficient are not independent from each other — changing either the unstandardized regression coefficient or the standard deviation of X would also change the standard deviation of Y and thus additionally indirectly affect the correlation. But as long as X only contributes a small part to Y, that’s not a big issue.
First, the smaller the unstandardized regression coefficient, the smaller the correlation. So the correlation could be small because on average, a 1-point change in sex doll ownership attitudes only leads to a fairly small point change in life satisfaction, with points referring to whatever unit X and Y have. That is, the correlation is small because the causal effect is small. This scenario is probably easy to imagine for many social scientists.
Second, the smaller the standard deviation of X, the smaller the correlation. So, the correlation could be small because sexdoll ownership attitudes barely vary in the population. Maybe there’s actually widespread agreement in people’s reactions to items such as “I would never want one of my children dating somebody who owns a sex doll.”[4]Another example would be an X that is a rare behavior in a given population, such as heroin consumption.
Third, the larger the standard deviation of Y, the smaller the correlation. So, the correlation could be small because the life satisfaction varies a lot for reasons that have nothing to do with sexdoll ownership attitudes. For example, health likely has a strong causal effect on life satisfaction and may vary quite a bit, so it will add a lot of variance to life satisfaction that cannot be attributed to sex doll ownership (according to the current state of scientific knowledge).
Of course, all three parts jointly affect the correlation; they’re all mashed together. Which isn’t really a bug, but a design feature. A good An okay-ish one, for that matter, if you really don’t care about the units of X and Y – for example, if they are psychological self-report scales with completely arbitrary scaling.[5]Ruben says it’s still a bad reason and that psychologists should use units they respect. There’s no law of nature according to which “very strongly agree” is a 5 and “strongly disagree” a 1.
It’s a less helpful feature if one or even both variables have a unit that is actually interpretable. For example, researchers have calculated correlations between birth order position (first-born versus later-born) and personality traits. I personally don’t know what a standard deviation of birth order coded in such a manner means; “How many SDs is your birth order position above or below the mean?” is not a common smalltalk topic. But I can easily make sense of “On average, first borns score 0.2 SD higher on intelligence”, so a semi-standardized metric would probably be more helpful for communicative purposes. Likewise, I’ve seen people report correlations between personality variables and income, but I have no idea what to make of one SD of income.
Still, correlations may be perfectly fine to make sense of measures without any meaningful unit. However, this changes when researchers start to take these correlations and compare them between groups, or even start to correlate them with other variables. And, let’s be honest, that’s going to happen. Because researchers usually aren’t satisfied with just quantifying associations; they want to draw inferences that go further than that.

Many correlations
There’s a whole class of analysis approaches in psychology that essentially boil down to correlations between correlations, or group comparisons between correlations. For example, people may calculate profile correlations to quantify personality similarity, and then correlate this metric with other stuff.[6]I’m guilty of that myself. Other examples were brought up on Bluesky: comparing network models between healthy and clinical samples; comparing correlations of BOLD signals from different brain regions, correlating patients’ and therapists’ ratings and correlating that correlation with something else.
Consider the following simple situation: A researcher compares the correlation between family satisfaction and general life satisfaction between men and women and finds that the correlation is higher in women. But what does that mean, substantively? The correlations could differ because of any of the three moving parts, including their combination. The correlation may be higher in women because an additional point in family satisfaction has a stronger effect on their life satisfaction integers. The correlation may be higher in women because their assessment of family satisfaction in general varies more. Or the correlation may be higher in women because their residual variation in life satisfaction is lower; for example, maybe their life satisfaction is less strongly affected by what happened in the Roman Empire or whether they encountered someone who is wrong on the internet. We simply cannot know which of these apply if we only compare the correlations. So, at the very best, we have found something in need of an explanation.[7]A student of mine once looked into this for her thesis, using the German SOEP. The correlation is indeed slightly higher among women, and this is due to two moving parts: more variance in family satisfaction and a larger unstandardized regression coefficient.
But instead, researchers often pretend that approaches that compare correlations directly tap into something substantively meaningful (“Is family satisfaction more important for men or for women?”). Or they claim that their theoretical reasoning implies some hypothesis for how the correlations in two groups differ, or even for similarities between correlation matrices between different sets of variables. Their theory may imply such things, but only indirectly. Correlations are not the stuff from which the world is built and theories are usually concerned with how things affect each other, not with how they correlate. But how things affect each other in turn affects all sorts of things including the magnitude of effects and variances; and these parts flow into the correlation.
If you start from the other side, an observed difference in correlation doesn’t tell you what’s different between the groups.[8]And a lack of difference in correlation doesn’t tell you that the groups are the same. So even if your theory eventually implies a difference in correlations, comparing the correlations is simply a very blunt tool to test that theory. It would work much better to test whatever your theory predicts before mashing a bunch of stuff together.
For example, if your theory predicts that the strength of the effects varies, you may want to test for differences in the unstandardized regression coefficient (i.e., test an interaction) in a model that hopefully successfully causally identifies the effect of interest.[9]We actually dedicated a whole section of our paper on interactions (Rohrer & Arslan, 2021) on the difference between differences in correlations vs. differences in slopes. If, more exotically, your theory says that the correlation should change because some variance change (say, more variability of X in one group, or more residual variability in Y in one group), that’s something you can directly test as well. If your theory just says “X is more important for Y in this group compared to that group”, or “the role of X for Y varies between groups,” your analysis aim may actually not yet be clearly specified and you may have to return to the drawing board to figure out what you mean by “more important” or “role.”
Now, you may still end up in a situation in which you want to standardize a resulting metric to make it more interpretable, in particular if the original unit is arbitrary. For example, let’s say you found that the regression coefficients in the groups significantly differ, how can you express how much they differ? For that, ideally, you would actually use a population norm (e.g., the SD in the population; see Alsalti et al., 2024). Or, if that fails, just the SD based on your total sample. The important difference here is that you don’t standardize within groups but rather across everyone. So, you still get to standardize—but in the same manner across everyone, which ensures that you don’t mash all sorts of things together into your focal metric.
P.S.
All of this still applies if you Fisher z-transform your correlations. And it also applies to R^2 (which, in the simplest case, is literally the square of the correlation).[10]R^2 can also be used to evaluate non-linear models, so you can end up with a high R^2 and a low (linear) correlation. Many thanks to Federico for that minor nit. But even then, it’s going to depend on the variances of the variables. David Hugh-Jones has just published a great post on why R^2 probably isn’t the most intuitive way to evaluate the predictive power of polygenic scores for educational attendance, making some points overlapping with this post (plus some additional good ones). Partial correlations of course also work in the same manner, just more complicated.
Furthermore, this “mashing together” applies to other standardized and thus relative metrics as well. For example, in experiments, researchers routinely calculated Cohen’s d which is semi-standardized; Cohen’s ds may vary between experiments either because the effects vary and/or because the (unrelated) variability in the outcome varies (maybe because one of the experiments was conducted on a more heavily selected sample). To pick something less obvious, heritability—the share of the observed variability in a phenotype that can be explained by genetic differences—can go up if the effects of genes get stronger, if the genetic diversity increases, or if environmental differences as other sources of variance are reduced. But that’s a topic for another blog post.
P.P.S.
Almost forgot – what if you’re not willing to assume that any correlation between X and Y is causal? Then you can re-read the blog post and mentally substitute any mention of the effect of X on Y with “any unstandardized association due to X affecting Y; Y affecting X; both being jointly affected by third variables; and the data being selected on a joint outcome of X and Y.” You’re welcome!
Footnotes
↑1 | We can swap X and Y here and arrive at the same number. While the unstandardized regression coefficient will be different, the standardized one will come out at the same value. |
---|---|
↑2 | I chose this example because suspension of disbelief with respect to causal identification is futile anyway. Also, there are validated scales for both variables. |
↑3 | In reality, the three ingredients to the standardized regression coefficient are not independent from each other — changing either the unstandardized regression coefficient or the standard deviation of X would also change the standard deviation of Y and thus additionally indirectly affect the correlation. But as long as X only contributes a small part to Y, that’s not a big issue. |
↑4 | Another example would be an X that is a rare behavior in a given population, such as heroin consumption. |
↑5 | Ruben says it’s still a bad reason and that psychologists should use units they respect. |
↑6 | I’m guilty of that myself. |
↑7 | A student of mine once looked into this for her thesis, using the German SOEP. The correlation is indeed slightly higher among women, and this is due to two moving parts: more variance in family satisfaction and a larger unstandardized regression coefficient. |
↑8 | And a lack of difference in correlation doesn’t tell you that the groups are the same. |
↑9 | We actually dedicated a whole section of our paper on interactions (Rohrer & Arslan, 2021) on the difference between differences in correlations vs. differences in slopes. |
↑10 | R^2 can also be used to evaluate non-linear models, so you can end up with a high R^2 and a low (linear) correlation. Many thanks to Federico for that minor nit. But even then, it’s going to depend on the variances of the variables. |
Thank you for a nice overview of the limits of correlations!
Regarding what you say about the different influences on the correlation, such as:
“changing either the unstandardized regression coefficient or the standard deviation of X would also change the standard deviation of Y and thus additionally indirectly affect the correlation. But as long as X only contributes a small part to Y, that’s not a big issue.”
I don’t think this is the simplest way to understand the different effects though, as we get this mixing. I prefer to think of the variation in Y as being made up from two components: the variation linearly associated with X, and the error variation – the variation not linearly associated with X (i.e., independent or at least uncorrelated with X). Higher values on the former increases the correlation, which can be due to a steeper slope, or increased variability of X. If the relationship is truly linear, and the error is independent of X, then doubling the slope will work out exactly the same for the correlation as doubling the standard deviation of X (while keeping the same linear relationship), as both double the standard deviation of the model derived from X (i.e. a+bX). Conversely, higher values on the variation not linearly associated with X will decrease the correlation. And if both parts increase or decrease by the same amount, then the correlation will remain the same.
I also think that psychological researchers probably rely too much on standardized effect sizes like correlations (and too little on unstandardized effect sizes). However, I think the above may also help illustrate when it can be more informative to examine some standardized correlation-like measure, such as: (1) When the point is to examine predictive power and the total variation in Y remains the same (so higher explained variation must mean lower unexplained variation), such as when comparing the predictive power of two different variables for the same outcome in the same sample. (2) When the only thing of interest is the proportion of the variation that can be and cannot be predicted by X, not their absolute magnitude. One might predict that it is similar across settings, for example. Say that for each $1 variation in spending that is explained by the predictor, a $9 variation remains unexplained in each of two different stores. Then that can suggest a similar “role” to the predictor in both stores, even if average spending differs by a factor of 10 between stores.
To self-plug: The above is also part of the reason why I think that – if – one is to look at a correlation-like measure, then doing so with a measure that compares explained to unexplained variation is often the most natural. The correlation R does not do so directly (it is an increasing but non-linear function of the explained to unexplained variation). The explained variance R^2 does so for variances, but I prefer to look at standard deviations, as it measures variation on the same scale as Y, increasing its interpretability, as above with the $1 and $9 variations, rather than ($1)^2 and ($9)^2 ones. I discuss this in the following paper: https://psycnet.apa.org/doi/10.1037/met0000681 (PsyArXiv here: https://doi.org/10.31234/osf.io/svuf8_v1)
Thanks for chiming in! I don’t think I disagree with anything that you write — in particular, you absolutely want to rely on correlation-like standardized measure is the purpose to figure out a predictive contribution. That said, when I look at the psychological literature, I see more cases of researchers pretending to answer predictive research questions when they are clearly interested in something else (such as a causal effect in a presumed data-generating mechanism) than genuine predictive research questions. Hence the framing of the blog post. But genuine predictive research questions do exist for sure, and then the correlation (or any fancier version that can accommodate non-linearities etc) is the way to go. Thanks for sharing your article, looks super intriguing! I agree that the squared scale is…a bit odd and will check out your alterative suggestions.