Non-representative samples! What could possibly go wrong?

Earlier this year I saw that a study was making the rounds on Twitter under the catchphrase “Representative samples may be overrated.” Claims like this tend to spread like wildfire in certain parts of the psychological community at least in part because, I believe, people genuinely want to figure out what’s the deal with representative samples (although there may also be other reasons).[1]Such as motivated reasoning.

At one extreme, there are calls for samples that are either full-blown representative or samples that at least to some extent capture the diversity of the population of interest, which may actually often be all humans. Only this way, or so they say, we are able to arrive at generalizable insights about the human psyche. And frankly, it makes a lot of sense that claims about humans, in general, should be supported by data from humans in general.

Figure 1. The diversity of humankind in a bottle, according to DALL·E. For (m/w/d).

At the other extreme, some say that psychological studies are actually getting at mechanisms that are fundamental to the human mind. We are all part of the same species, so it should not be a big problem if our sample mainly consists of College Freshmen at a Large Midwestern University™. And frankly, it does seem like we have managed to identify some generalizable phenomena[2]Such as motivated reasoning. even without ample samples.

What gives? As so often in life, this is one of the situations where the most boring sounding of all answers is correct: It depends. It depends on the goal of the analysis; it depends on the assumptions one is willing to make. Spelling out both analysis goals and underlying assumptions doesn’t appear to be psychologists’ strong suit, so it only makes sense that we’d also be confused about the need for/value of representative samples. So, will you need a representative sample? Let’s consider three different scenarios: (I) you want to estimate a prevalence, (II) you want to estimate a causal effect using an experiment, or (III) you want to estimate a causal effect using observational data.

Scenario I: You want to estimate the prevalence of something

Or alternatively, some mean score, say average happiness on a scale from 0 to 10 feelings integer. For these types of questions, I think most of us would immediately ask “average for whom?” because it is obvious that average happiness (or, say, vaccination rate) is a property of some population of interest. And, ideally, we would want to have a representative sample of that population to get the estimate right. 

But what even is that, a representative sample? The term is used with different meanings within the non-scientific literature, within the broader scientific literature, and even within statistics; it can also simply be a way to give the sample a pat on the back (a phrase from Kruskal & Mosteller, 1979, that I really dig).[3]This was brought to my attention by Stas Kolenikov and Raphael Nishimura after this post had been published, how embarrassing. The two of them plus Andrew Mercer provided some super helpful explanations and pointers to the literature, for which I am very grateful. Folks, listen to your local survey methodologist/statistician! As a side note, readers of this blog may also enjoy Nishimura’s pinned tweet. Here, I will use it to refer to a scenario in which each person in the population had the same chance of ending up in the sample, which in more precise terms is an equal probability sample. An equal probability sample is actually not what smart survey planners would usually go for,[4]Stas brings up a study on American Jews for which he was the survey designer. For example, in America, Jews tend to live in some specific urban areas; sampling more heavily in those areas gives you better bang for the buck. but I think it provides a good abstraction to think about samples, so let’s stick with it for now.

For example, you simply send out study invitations to everyone and then there is a 1% chance for any person that they participate. Obviously, that’s not going to happen – even if your target population was, in fact, only College Freshmen at a Large Midwestern University. First of all, how do you even know who everyone is, much less contact them? Second, it is extremely unlikely that everybody participates with the same probability. For example, some people simply would never want to participate in your survey study or experiment, no matter what, and you (usually) can’t force them to change their minds.[5]…institutional review boards these days, amirite? Don’t worry; it’s not you, it’s them! But you still have to deal with the aftermath in a productive manner.[6]And the beforemath, but this is for a different blog post. 

Figure 2. Representative sample raffle drum. Pending IRB approval.

In a somewhat more realistic yet still benign scenario, some people are more likely to end up in your study than others – but you can explain that with help of variables that you have actually collected. For example, women and people with higher educational attainment may be more likely to participate in your study. However, within groups defined by gender and educational attainment, it’s again just random who provides data – so all women with bachelor degrees end up in your sample with some probability p; all men with high school diplomas end up in your sample with some (lower) probability q. Within those groups, those who do respond are not systematically different from those who don’t; they are exchangeable. 

This is great news because it means that you can accurately estimate the mean of the variable of interest for subgroups defined by gender and education. If you also happen to know the distribution of gender and education in the population of interest, you can furthermore combine these subgroup-specific values with appropriate weighting and end up with an unbiased estimate for the population. For this to work, of course, all groups need to be present in your data in the first place (this is referred to as positivity, see e.g. Mercer et al., 2017). If everybody in your study has a high school diploma but the population of interest contains a non-negligible number of people without one, that puts you in a bad spot. Thus, a few “atypical” sprinkles on your sample may often be helpful, even if it doesn’t live up to the ideal of representativeness. 

The most likely scenario is of course one in which you neither have a truly random sample of the population nor the necessary variables to render it random conditional on available information. These come in two variations with slightly different causal graphs.

First, imagine you want to estimate how many people are vaccinated against COVID-19 by asking people about their vaccination status. Now imagine that trusting the scientific establishment makes one more likely to get vaccinated, and also makes one more likely to participate in surveys conducted by scientists. We probably did not measure trust in the scientific establishment, and even if we did, we would additionally need to have a credible estimate of its distribution in the general population. But at least in principle, we can imagine a situation in which we had those pieces of information that allowed us to reweight the data. Second, start from the same scenario but now whether or not people participate in the study actually causally depends on their vaccination status, which is the outcome of interest. In this scenario, collecting additional variables won’t help, so referring to future studies offers little solace.

In either case, trying to credibly estimate the prevalence of vaccination will cause vexation. Depending on the stakes and the purpose of your research, you may still proceed, but any further step will hinge on assumptions and result in more uncertain claims. For example, you may be willing to make some assumptions about the direction of the bias introduced by nonrandom sampling. If people who trust the scientific establishment are more likely to participate and more likely to be vaccinated, your estimate may be a plausible upper bound. 

You could also reason that actually, the selectivity of your sample isn’t a big issue for estimating the population mean because of the relationships of the involved variables. For example, let’s say you want to estimate the population average of the ratio between the length of the first and the second toe (1D:2D ratio) but your sample only includes people who are very eager to participate in very important research studies. One could argue that the selectivity doesn’t matter, because why would people who are eager to participate in studies vary systematically with respect to their 1D:2D ratio? Of course, the lack of an association is an assumption that may be disproven in future studies, so better be on your toes when taking this inferential risk.

Speaking of which, assuming that you have collected all the information that you need to render the sample “as good as random” is also just that, an assumption that may be wrong. This also holds for “nationally representative” panel and survey studies that are frequently used in the social sciences. These studies often invest a lot of effort into ensuring that samples are as good as possible, implementing complex sampling schemes and taking measures so that people don’t drop out,[7]For example, interviewers of the German SOEP have a budget to buy small gifts such as flowers to keep respondents happy. And they say romance is dead. and then on top give you some weights to reweight the data to take into account sampling design and potential non-response bias. But those people aren’t magicians either, and so they have to rely on information they have (variables such as federal state, region, gender, and age). It is still possible that there are other factors that affect the probability of participation beyond that (at the very least, whether or not one has the patience to participate in a nationally representative panel study). So your sample may look representative on observable characteristics but could still be non-representative because of unobservables. Whether or not these will bias your estimate will depend on how they relate to the outcome of interest. Personally, I would often be willing to say “it’s the best type of data we can get, so this is our best bet” unless you are trying to estimate something like “liking of panel studies” in the general population.

Figure 3. It’s always those pesky little unobservables, isn’t it.

A truly representative sample is more a useful abstraction than something you will actually ever work with,[8]Except for edge cases, such as compulsory schooling or those magical Scandinavian countries that collect all sorts of registry data. Though this is cheating, because if the probability of inclusion in the sample is virtually 1, it’s no longer a sample, is it. and the heavy lifting is done by assumptions about the data-generating process, that is: about how your sample was generated, and how the involved variables relate to the outcome of interest.

Another way to think about this which you may find helpful[9]Or confusing, in which case just ignore this paragraph. to figure out the specifics of your data situation is missing data problems. If you (hypothetically) invite everyone to join your study and only some respond, the data of the non-respondents is missing. This missingness may be completely random (Missing completely at random, MCAR) which results in a representative sample; it may be random conditional on information available to you (missing at random, MAR); or it may be missing in a non-random manner (not missing at random, NMAR; also known as missing not at random, MNAR).[10]Whether Rubin was trolling when he came up with these NightMARish acronyms is unknown. Yet here we MAR. The last one maps onto the two vexing vaccination rate scenarios described above: there may be variables that render the missingness random but that you unfortunately did not measure; or alternatively, the missingness directly depends on the variable of interest.

There is a very nice paper by Felix Thoemmes and Karthika Mohan discussing these problems with DAGs from a causal perspective, because all of this hinges on the causal mechanism that generated the data, and thus, eventually, on the assumptions that we are willing to make.

This was a lot of exposition for an inferential problem that we rarely encounter in psychology. Sometimes we do want to estimate prevalences and means (e.g., in psychiatric epidemiology, when generating norms for IQ tests), but usually, we care little about it and happily standardize everything away (which isn’t necessarily a good thing and maybe a topic for a future blog post). But a lot of the previous points generalize to the other scenarios, so let’s move on to causal effects.

Scenario II: You want to estimate a causal effect, experimental edition

You run an experiment to identify some causal effect, but your sample is not representative of the population of interest. The good news is that your experiment will give you the average causal effect of the manipulation you implemented for the people in your sample. The not-so-good news is that this won’t necessarily be the average causal effect in the general population. Just like average happiness may differ between populations of interest, so may the average effect of some intervention on happiness. One way to “get rid of” this problem is to simply assume that the effect is the same for everyone and call it a day.[11]Putting the “con” in convenience sample

Figure 4. The ghost of Paul Meehl according to Midjourney.

A priori, this may not be the most satisfying solution given many psychologists’ insistence that the human psyche is so complex that everything (including interindividual differences) interacts with everything in mysterious ways. It’s even less satisfying once we additionally consider that, in non-linear models, everything may be interacting even if the coefficients don’t directly tell us that.

So effect homogeneity should not be the default assumption for all of psychology. It may still be a solid, defensible assumption for certain domains–for example, I’m personally quite willing to assume that certain fundamental psychophysics effects are very similar across people from the same population. For anything closer to the social spectrum of psychology, I’d be more skeptical. This is where the aforementioned study enters the stage: Coppock, Leeper, and Mullinix (2018) tried to quantify the amount of effect heterogeneity in a number of “standard social science survey experiments”[12]I couldn’t find a table explaining all included experimental procedures, so I pulled up two examples to get an idea of the type of paradigm they looked at. In one experiment, the outcome of interest was support for the death penalty and the treatments involved three argument conditions – either respondents were asked right away (1), or they were presented with the information that sources say that it is unfair because most people who are executed are African American (2) or that it is unfair because too many innocent people are being executed (3). In another one, respondents were asked for their support of different bills, but the description of the bill either contained no cue (1), or mentioned that it was supported by either Obama (2) or by McCain (3). along certain dimensions (e.g., age, gender, and education, but also partisanship). They conclude that there seems to be fairly little effect heterogeneity for these types of experiments, along these dimensions, which is good news for people conducting these types of studies in non-representative samples. Whether this generalizability generalizes to other types of experiments is, of course, another question. The authors are quite careful with their language, saying explicitly that “[t]he response to this evidence should not be that any convenience sample can be used to study any treatment without concern about generalizability.” (It’s a good paper and short, so give it a read.)

But let’s assume we can’t assume that the effects are homogeneous. What now? The logic from Scenario I still applies. When everybody from the population had the same chance to end up in your sample (“representative”, equal probability sample, missing completely at random), you can simply do your usual analysis (e.g., a t-test) to estimate the average effect in your sample, and this will be an unbiased estimate of the average effect in the population of interest.

Heeeey, MCARena!

In the scenario in which your sample is selective but in ways that you can fully explain (e.g., women and highly educated individuals may once again be more likely to end up in your sample, missing at random conditional on available information), you can again essentially reweight the data to determine the average effect of interest in the target population.[13]There are different ways to do this; in Deffner et al. (2022) we suggest multilevel regression with poststratification as a principled approach. And what if you don’t understand the selectiveness or lack the information you need to generalize your estimate to the population (not missing not at random…not)? Again, you can try to make some more or less well-informed guesses, but those should be supplied with appropriate justifications and a discussion of the necessary assumptions. 

Figure 5. A psychological experiment machine. Don’t ask.

Of course, you may run an experiment and just say “well I just want to estimate the average effect in that particular sample, isn’t that interesting enough?” As Borysław Paulewicz writes in this Twitter thread, at least you’re able to show that your intervention works, even if only for some people. One way to think about this is that you just declare your sample the population of interest (I’m interested in precisely those people). But given that nobody writes discussion sections about precisely those people (“These findings demonstrate that for John, Robert, Michael, Linda, Jessica, and Sarah, thinking of a recent painful episode in their lives leads to…”), what is more likely to happen is some sort of implicit generalization fueled by implicit assumptions. Surely, effect heterogeneity won’t be so large that the intervention has the opposite effect in large parts of the population? And even if there were some people for which it had a large negative effect, at least some would have ended up in your non-random sample by chance alone, right? Those assumptions may sometimes be justified (e.g., when we have good reasons, including prior studies, that make us suspect there won’t be much systematic heterogeneity), and maybe sometimes they aren’t. 

Claiming that you don’t want to generalize at all may often be the path of least resistance, not the least because of the following (generalizable, robust) empirical law: If there is a tempting yet not very well-supported conclusion that could be drawn from your study, readers will draw it even if you tell them they shouldn’t.[14]This was once revealed to me in a dream.[15]A corollary to this is “The Law of Lakens’ guidelines” according to which, whenever you try to make the point that researchers should not follow certain guidelines per default, you will get cited as a source of said guidelines. Many thanks to Taym, who brought this to my attention. Whether you like it (because it makes your research more interesting while maintaining plausible deniability) or not (because you want to stay in control of the narrative). This is why I am somewhat skeptical about the utility of so-called “Constraints on Generality” statements and more in favor of spelling out the assumptions necessary to support generalizations.

Scenario III: You want to estimate a causal effect, non-experimental edition

But let’s say your question doesn’t lend itself to experiments and you end up trying to estimate a causal effect from observational data. This makes things more challenging because even with perfect data from the whole population, estimating causal effects from observational data requires a lot of assumptions. But the interesting thing that I want to highlight here is that these requirements can collide with non-representative samples in unfortunate ways. If you’ve been following this blog, you can already guess what’s coming next. It’s our old frenemy collider bias.

Figure 6. Trust me, it is always collider bias, leider.

Collider bias can become an issue when participation in your study is causally affected by the variables whose causal relationship you’re interested in. To regurgitate  resuscitate a previous example with some modification, let’s say you’re interested in how intelligence affects people’s willingness to keep doing some mindless repetitive task when told to do so. It’s hard to get the general population to participate in a study that involves mindless repetitive tasks, but psychology students are readily available.[16]To the same degree that course credit is available to them. So, you invite them into your lab, let them fill out some intelligence test, and then tell them they have to cross out every single occurrence of the letter “p” in Dostoevsky’s “The Idiot.” Then you hand them the book and a pen and record how long it takes for them to say “what the hell?” and quit. 

You observe that smarter people quit earlier, which lends itself to a certain narrative: smart people notice how stupid your study is and thus leave. But correlation does not equal causation, so you may start thinking about potential confounders (e.g., gender, age). However, in the described setup, you don’t only have to worry about common cause confounding (the vanilla flavor of confounding), you also have to worry about collider bias induced via sampling. Being smart increases people’s chance to study psychology (or really to study anything at all), but so does being conscientious (in particular in Germany, where you need to finish school with top grades to be admitted). This alone is sufficient to induce a spurious negative association between intelligence and conscientiousness in psychology students. So, your smarter psych drones may actually quit earlier not because they are clever, but because they are less conscientious – and your sampling induced that spurious association between smarts and grit(s) in the first place. Thus, the non-representative sample induces new threats to causal claims based on observational data, as if there hadn’t been enough already![17]As a side note, even if you end up with a representative sample, collider bias via selection can still be an issue for causal inference because the target population may already be affected by it. For example, the following populations have been going through some sort of selection filter that may introduce all sorts of spurious correlations: women in STEM, people in management positions, people who survive to the age of 80, people who survive childhood, humans who have been born. 

Actually, even if you are just interested in the correlation (e.g., between intelligence and task persistence), the correlation you observe in your sample won’t necessarily be the correlation in the population.[18]If you are doing an experiment but then want to investigate how non-manipulated third variables “moderate” the effect (even just in a non-causal sense, see here), all of the concerns in this section apply. Consider the case of online personality Aella, who happens to be smart and popular in rationalist circles, and a sex worker. Aella likes to do Twitter polls that ask for two things at once, say “are you a boy or a girl” and “what is your opinion on [societally shunned sexual act that doesn’t really harm anyone]?” This can be used to get at the correlation between gender and opinions on [societally shunned sexual act] among Aella’s followers. But it is likely quite uninformative about the correlation in the general population. Because following Aella on Twitter likely causally depends on gender (potentially mediated by horniness) and on being into rationalist stuff. That means that Aella is a collider that introduces (or masks) all sorts of correlations between gender and attitudes on a wide variety of topics in ways that are hard to predict unless we know what causes people to follow Aella. That said, this needn’t invalidate all conclusions from her polls (it depends on the causal graph of the involved variables), and she is now also using other sources to recruit respondents for her forays into human sexuality. She is also doing many things right (like sharing her raw data, plotting and reporting means) and frankly nobody is going to include questions on obscure kinks in major household panels anytime soon, so let him/her who never fell for collider bias or who only uses representative samples cast the first stone.[19]Never fell for collider bias –> Casting stone <– Only uses representative samples

What to do? You already know the MCAR/MAR/NMAR logic, and it applies here as well. If it is random who from the population of interest ended up in your sample and who didn’t (MCAR), there is no (additional) problem (beyond the usual headaches of causal inference). If we are willing to assume that we understand enough about the selectivity and if we collected the right variables, we may be able to fix the selectivity issue by modeling it (MAR). In economics, if I understood correctly, James Heckman received a Nobel Prize for his works on that topic; in psychology, corrections for range restrictions have been developed.[20]I used to think these corrections were “causally agnostic”, but Brenton Wiernik informed me that these corrections are a lot more sophisticated than I had thought, and I absolutely trust Brenton on these issues (and so should you). And if we don’t have a proper grasp on the selectivity, we have to tread carefully as inference will heavily hinge on additional assumptions.

Figure 7. Participant from your hypothetical idiot study, just before he notices the futility of it all.

So are representative samples redundant at best?

Of course, they aren’t. Non-representative samples can threaten both external validity (scenarios I-III) and internal validity (scenario III). To which extent those threats should worry you will depend on what you are trying to do and which assumptions you are willing to make (e.g., considering the selection mechanism, considering the amount of effect heterogeneity in the population). Going “forward”, this may affect how you plan your studies. Going “backward” – taking the data as is, with all its imperfections (#AllDataAreBeautiful) — this may affect how you analyze the data to arrive at the best possible answer. I feel like we as a field can do much better on both fronts. 

Considering study planning, one conclusion could be that sometimes, a convenience sample just won’t scratch the inference itch, so maybe it would be wiser to pool resources for higher quality sampling (to, e.g., at least get a couple of non-students so that you can try to project estimates on the general population using assumptions)[21] Ruben pointed out that we may be doing ourselves a disservice by only having catchy words for representative samples (expensive and potentially unattainable) and convenience samples (which may range from psych students at Harvard to MTurkers). I concur, all samples are inconvenient, but some are (more) useful (for generalization). Of course, the survey people are way ahead of us by not talking about representative samples at all, but instead providing more precise definitions of features of samples in relation to inferential goals. or to rely on existing national surveys (even if they don’t contain the 300-item personality questionnaire of your dreams/nightmares).

Considering data analysis, it feels to me that while psychologists are very happy to talk about interactions of all sorts, there is no broad understanding that effects may vary between individuals in less tractable ways, thus rendering effect estimates from individual studies a property of the respective sample. This, of course, leads to the broader issue that we often treat “the effect of X on Y” as some disembodied law of nature, rather than an abstraction that can be specified in precise terms which will vary depending on your study. More precision about what we are trying to estimate (i.e., estimands) won’t hurt, even if at some point it does turn out that empirically, effect heterogeneity is not that big of a deal. Maybe we felt the future all along! But until then, it’s just an assumption.

Footnotes

Footnotes
1, 2 Such as motivated reasoning.
3 This was brought to my attention by Stas Kolenikov and Raphael Nishimura after this post had been published, how embarrassing. The two of them plus Andrew Mercer provided some super helpful explanations and pointers to the literature, for which I am very grateful. Folks, listen to your local survey methodologist/statistician! As a side note, readers of this blog may also enjoy Nishimura’s pinned tweet.
4 Stas brings up a study on American Jews for which he was the survey designer. For example, in America, Jews tend to live in some specific urban areas; sampling more heavily in those areas gives you better bang for the buck.
5 …institutional review boards these days, amirite?
6 And the beforemath, but this is for a different blog post.
7 For example, interviewers of the German SOEP have a budget to buy small gifts such as flowers to keep respondents happy. And they say romance is dead.
8 Except for edge cases, such as compulsory schooling or those magical Scandinavian countries that collect all sorts of registry data. Though this is cheating, because if the probability of inclusion in the sample is virtually 1, it’s no longer a sample, is it.
9 Or confusing, in which case just ignore this paragraph.
10 Whether Rubin was trolling when he came up with these NightMARish acronyms is unknown. Yet here we MAR.
11 Putting the “con” in convenience sample
12 I couldn’t find a table explaining all included experimental procedures, so I pulled up two examples to get an idea of the type of paradigm they looked at. In one experiment, the outcome of interest was support for the death penalty and the treatments involved three argument conditions – either respondents were asked right away (1), or they were presented with the information that sources say that it is unfair because most people who are executed are African American (2) or that it is unfair because too many innocent people are being executed (3). In another one, respondents were asked for their support of different bills, but the description of the bill either contained no cue (1), or mentioned that it was supported by either Obama (2) or by McCain (3).
13 There are different ways to do this; in Deffner et al. (2022) we suggest multilevel regression with poststratification as a principled approach.
14 This was once revealed to me in a dream.
15 A corollary to this is “The Law of Lakens’ guidelines” according to which, whenever you try to make the point that researchers should not follow certain guidelines per default, you will get cited as a source of said guidelines. Many thanks to Taym, who brought this to my attention.
16 To the same degree that course credit is available to them.
17 As a side note, even if you end up with a representative sample, collider bias via selection can still be an issue for causal inference because the target population may already be affected by it. For example, the following populations have been going through some sort of selection filter that may introduce all sorts of spurious correlations: women in STEM, people in management positions, people who survive to the age of 80, people who survive childhood, humans who have been born.
18 If you are doing an experiment but then want to investigate how non-manipulated third variables “moderate” the effect (even just in a non-causal sense, see here), all of the concerns in this section apply.
19 Never fell for collider bias –> Casting stone <– Only uses representative samples
20 I used to think these corrections were “causally agnostic”, but Brenton Wiernik informed me that these corrections are a lot more sophisticated than I had thought, and I absolutely trust Brenton on these issues (and so should you).
21 Ruben pointed out that we may be doing ourselves a disservice by only having catchy words for representative samples (expensive and potentially unattainable) and convenience samples (which may range from psych students at Harvard to MTurkers). I concur, all samples are inconvenient, but some are (more) useful (for generalization). Of course, the survey people are way ahead of us by not talking about representative samples at all, but instead providing more precise definitions of features of samples in relation to inferential goals.