Overfitting vs. Open Data

Estimated reading time: ~6 minutes

In the debate around open data, I’m missing voices from people working with public or scientific-use-public (i.e. after application) data. Whereas I understand and agree with the arguments for publicly sharing single-purpose data (a category whose borders I would not try to defend) and for sharing subsets of data used for a published analysis, I think especially large datasets used by many researchers require a different approach.^[1]I outlined some of this reasoning in an earlier post in which I overwrought a fishing metaphor so heavily (seriously, Simine-Vazire-and-car-hoods-level overwrought) that you may be excused if you stopped reading halfway through because of a sudden appetite for trout. Since then, I have learned about synthetic data from Rogier Kievit and reusable holdouts implemented using differential privacy algorithms. Armed with these tools, I think I can make a good and fish-free case against open data for all.

The 100% CI readers probably tend to agree that overfitting, p-hacking, and publication bias are the bane of today’s science, and that we should use open study planning (preregistration), open methods, open data, and open access to fight them. By publishing our data after analysis we enable others to reproduce our results and test alternative explanations that we neglected.

However, when publishing data before analysis, as is the norm for new DFG-funded projects in Germany (e.g., TwinLife), I believe we gain too little transparency to justify the increased potential for overfitting and rushed analyses.

Namely, our confidence that the data was analysed and interpreted well is lowered…

if we cannot be sure that researchers did not have access to the data before planning their analyses (overfitting, aka the Texas sharpshooter fallacy)
if we have reason to believe that multiple analysts (I like to think of them as “gunslingers”), maybe even with preregistered analysis approaches, tackle the same problem, and only the quickest draw, not necessarily the best shot,^[2]I’m overworking metaphors again, aren’t I? Stop me, I have a problem. I have never caught a fish nor fired a gun. ends up being published (a rat race)

The first problem is well-known. Although a SIPS workgroup is currently drafting guidelines for preregistrations of analyses of existing data, I have to admit that this would not sway me much if I didn’t already trust the authors more than I trust myself (aka the Göttingen crapshooter fallacy).^[3]Julia, who I trust more than I trust myself, tells me that the workgroup isn’t especially optimistic about this solution either, so they’re focusing more on transparently and completely reporting data documentation and analysis. In fact, I preregistered several analyses on existing data on the Open Science Framework, but in the end decided to not release them with the papers, because I did not think it would do any good.

As an example of the second problem, let me cite this series of tweets from Stuart Ritchie.

Already happened to me twice with UKB. Nothing stopping other folks publishing quickly while you work on your version.

— Stuart Ritchie (@StuartJRitchie) April 21, 2017

The UK Biobank, in case you are not familiar with it, is a tremendous wealth of medical, psychological, genetic, and neurological data on half a million British people. New additions to the data become available to all approved applicants at the same time. When it “drops”, the applicants have pre-written their data cleaning and analysis code, and often even buildings blocks for manuscripts for several scenarios for the results. Kind of like preregistrations. Science at its best, right?
Well, no. Due to the tremendous time pressure in this race to publish, I worry that inferior analyses are published first. Then, superior analyses (e.g., with more robustness checks as in Stuart’s case) might not end up published, because the analysts decide to move on. Journals won’t publish a very similar “re-analysis”, independently conceived or not. Even if finished and published, a second analysis might not be cited as much as the first paper, because it makes a smaller splash.

With these problems in mind, here are some potential solutions that trade some data secrecy in for better analyses, i.e. reduced overfitting and rat races.

Existing ideas

Some data holders freely release their data into the wild. This is normally seen as better than only releasing the data to those who apply, unless privacy concerns justify secrecy. Funders, like the DFG, increasingly agree that those who get grant money to collect useful data shall not hoard. I agree with this in principle, but want to make the case that overfitting concerns also justify secrecy, lest unfettered access degenerate into rat races to overfit.
Here’s a different idea:^[4]And here’s an appeal to authority: Registered Report guru and High Expectations Dad Meme incarnate Chris Chambers says he “likes it”. data holders join efforts with the Registered Reports initiative, so that applications for data access can double as a stage 1 Registered Report. This way, they are assured of publication, scooping concerns are reduced, and peer review is independent of the sexiness of the results. RR reviewers could even pick from several proposals to answer the same question, of course only if one approach is superior, not complementary. Although it’s unpleasant to hear your proposal was considered inferior, at least you would waste less time.

But this is still a primitive idea. It takes very experienced analysts to propose a sound set of analyses before seeing the data, even very^[5]very experienced analysts say things like “we learned most of what to look at after looking at the data”, and even if with lots of experience it’s hard not fool yourself. I think we can do better, by elaborating on this approach.

Elaboration 1: Release training data only

The data holder releases only part of the data, a training set. A holdout is kept in a locked vault. The training data are made available to applicants (as above) or publicly. Scientists then try their best to extract the maximum amount of signal while preventing overfitting (using methods such as cross-validation). Then, upon acceptance of e.g. a Registered Report written on the basis of the training data, the data holder runs the reproducible scripts on the holdout. Now, with these results and subsequent discussion, the Registered Report is published.

The main drawback of this method is that splitting the data in training set and test set (+ holdout) is wasteful compared to more efficient approaches such as leave-one-out cross-validation.

Elaboration 2: Release synthetic data only

The data holder publicly releases synthetic data, but keeps the real dataset in a vault. Synthetic data mirrors the associations and missingness patterns in the real data, allowing researchers to specify code and obtain results that would also work with the real data.^[6]If real data have curves, so do synthetic data. However, the data can be randomly perturbed, so that privacy is preserved for the individual data points, and so that highly overfitted analyses are less likely to replicate on the real data.

Then, upon acceptance of e.g. a Registered Report written on the basis of the synthetic data, the data holder runs the reproducible scripts on the real data. Now, with these results and subsequent discussion, the registered report is published. Alternatively, the researchers can submit changed code upon seeing the results, but all submissions of code to be run with the real data become part of the published record.

The main drawbacks of this method are that I don’t know of a package that generates good synthetic data for multilevel structured data. For the one I know, synthpop,^[7]Not to be confused with synth pop, a music genre that has produced songs perfectly capturing my feelings about synthpop. there are no guarantees how faithfully the synthetic data will reproduce the “true” associations found in the real data, so you might actually underfit to the real data. I released synthetic data for our most recent preprint, but synthpop could not reproduce the multilevel structure of the data, so this will not be particularly useful, except for people who want to prepare code and have me run it instead of agreeing with our fairly mild privacy conditions for data sharing.

Elaboration 3: Differential privacy

Analysts train their analyses on a training dataset. Then they send it to an automated service that runs their analyses on the holdout in a vault. They see the results, but through a differentially private mechanism. To quote the paper introducing this idea, “The intuition is that if we can learn about the data set in aggregate while provably learning very little about any individual data element, then we can control the information leaked and thus prevent overfitting.”

This way, the analysts can reuse the holdout. Changing your approach after seeing the data is not necessarily bad,^[8]I’m too lazy to make the case for this, so I’ll hope you’ll simply allow my appeal to authority Andrew Gelman. but it’s hard to know what is bad and what isn’t without replication. By ruling out overfitting by design, the above algorithm lets you rest easy, basically alerting you when you have exhausted the holdout’s signal mine and started mining noise.

The challenges of this approach are that, as far as I know, no simple package or web service makes this algorithm easy for data holders to use (yet). Further, applying this to statistical results more complicated than “performance of a classifier”,^[9]The authors have a machine learning background, where you often want to train a classifier. Apparently, this is not an inherent limitation, but I’d probably need a lot of time or preferably a statistics maven to implement this for my kind of work. such as all the associations in an SEM, GWAS, or the images from an fMRI analysis, may be difficult. On the other hand, it might force people to boil down the question of their model test down to a few simple, interpretable statistics.

I don’t think the first step, releasing training data, is strictly necessary, but I sure prefer to start with some plots and I don’t think the algorithm could easily make all kinds of plots differentially private.

I’m aware that these solutions are not the perfect fit for every problem, but they seem like an improvement of the status quo. The cases I’m mainly interested in are those where some institutions managing the data access are already in place. I could imagine that data holders like the SOEP, TwinLife, the NLSY or the various biobanks could be convinced^[10]I’m not saying they will be less recalcitrant than journal editors though. to implement this for future waves of data. Still, nowadays small teams of individual investigators like Sam Gosling and Jeff Potter and Michal Kosinski and David Stillwell can also produce a wealth of data comparable in size to the SOEP. These data could also be shared like I outline above, but for this to be feasible for small teams, web services as in Elaboration 3^[11]By the way, you’ll have to find a different name too cause that’s what Malte’s synth pop band is called would have to be very easy to use.^[12]Of course, if given your workload, the only realistic alternative to sharing data publicly is burning it on a CD and keeping that somewhere in your office, please share.

My strength is waning, the “wave” was closer to fishy metaphors than I’m comfortable with, so I’ll end this here before I reely start going on about bobbers and bycatch again. I’m interested in herring any counterarguments people might have, or alternatively data holders who are convinced by this.

Fin^[13]I have a problem.

Footnotes[+]

Footnotes
↑1	I outlined some of this reasoning in an earlier post in which I overwrought a fishing metaphor so heavily (seriously, Simine-Vazire-and-car-hoods-level overwrought) that you may be excused if you stopped reading halfway through because of a sudden appetite for trout. Since then, I have learned about synthetic data from Rogier Kievit and reusable holdouts implemented using differential privacy algorithms. Armed with these tools, I think I can make a good and fish-free case against open data for all.
↑2	I’m overworking metaphors again, aren’t I? Stop me, I have a problem. I have never caught a fish nor fired a gun.
↑3	Julia, who I trust more than I trust myself, tells me that the workgroup isn’t especially optimistic about this solution either, so they’re focusing more on transparently and completely reporting data documentation and analysis.
↑4	And here’s an appeal to authority: Registered Report guru and High Expectations Dad Meme incarnate Chris Chambers says he “likes it”.
↑5	very
↑6	If real data have curves, so do synthetic data.
↑7	Not to be confused with synth pop, a music genre that has produced songs perfectly capturing my feelings about synthpop.
↑8	I’m too lazy to make the case for this, so I’ll hope you’ll simply allow my appeal to authority Andrew Gelman.
↑9	The authors have a machine learning background, where you often want to train a classifier. Apparently, this is not an inherent limitation, but I’d probably need a lot of time or preferably a statistics maven to implement this for my kind of work.
↑10	I’m not saying they will be less recalcitrant than journal editors though.
↑11	By the way, you’ll have to find a different name too cause that’s what Malte’s synth pop band is called
↑12	Of course, if given your workload, the only realistic alternative to sharing data publicly is burning it on a CD and keeping that somewhere in your office, please share.
↑13	I have a problem.

7 thoughts on “Overfitting vs. Open Data”

Simon Gates says:

September 15, 2017 at 8:05 am

The real problem is surely with the model of scientific publishing, rather than data access. We need to change that to a system where reanalyses, replication and alternative approaches are the norm, and there isn’t a premium to being first to do a particular analysis. That would be better in lots of ways.
Ruben Arslan says:

September 15, 2017 at 4:16 pm

I agree with that goal, but don’t see a concrete proposal that will fix all of science (and maybe some of human nature, when it comes to the preference for novelty) in the medium-term, so I think we need hacks like this.
Eiko Fried says:

September 15, 2017 at 8:49 pm

Tried for many hours to create a synthetic dataset that not only has the means, variability, and covariance structure of my actual data, but also skewed distributions of the ordered categorical symptom data I work with. I didn’t find any way to do so properly: generating normal data is not an issue, but most data in clinical psych is not normal.

Any thoughts / packages / help appreciated.
1. Ruben Arslan says:
  
  September 16, 2017 at 8:45 pm
  
  Well, what did you try?
  I experimented only with synthpop, and it was too limited for my purposes, but it did have mechanisms to deal with categories (based on mice).
  So I’m fairly new to this myself. I think better, more user-friendly solutions are needed, but I really like the idea and I think even synthetic data that only reproduces means and simple covariances can be worthwhile for some purposes.
Alex Danvers says:

September 16, 2017 at 4:39 pm

I really appreciated this post, and I think it raises some really interesting issues that need exploration in open science. Some thoughts:

(1) I’d give more prominence to releasing only training set data. If you’re working with large data sets, the “waste” of not having an extra few k cases is not preventing you from finding real relationships unless you’re really hunting around in subgroups. And if you’re hunting around in subgroups, that seems like the type of thing you should need to preregister and submit in writing before getting the full data.

(2) I’d penalize the idea of synthetic or differentially private data more. I think your intuition that forcing folks to use simpler, more easily digested statistics to answer questions will improve inference is debatable. I think people tend to over-rely on linear, additive relationships in a world full of feedback loops leading to non-linearity. I could easily see this solution slowing down promising new approaches (like network science or non-linear machine learning algorithms) in favor of implausibly simple explanatory models. Who’s to say that it’s the correlation table between 10 variables that captures the most important information about their relationship?

(3) We really need a mechanism to publish secondary analyses addressing the same questions. This doesn’t seem like a crazy proposition in some fields. It seems like econ and sociology publish differing accounts of a theoretical question using similar data sets. Does that sound right? Aren’t there examples of important re-analyses outside of psychology?
1. Ruben Arslan says:
  
  September 16, 2017 at 8:58 pm
  
  1) exactly, but this may be a hard sell, because it’s inefficient. But this is the only solution I know is working in the wild (kaggle)
  2) I may be blue-eyed about what these mechanisms can accomplish, but I think preventing overfitting is maybe more important than preventing “underfitting” (after all, models would still be checked using the real data). Still, some areas I’m familiar with could definitely make use of this and work with mainly linear, additive models (e.g. GWAS).
  3) preprints are a mechanism, right? But a) as always the real problem is incentivising it, people generally don’t care about reanalysis unless you can convincingly debunk a high profile claim, not if you just check it for robustness and confirm it b) reanalyses after seeing somebody else’s account may well overfit (i.e. you know they adjusted for X and Y, so because you don’t want to believe the conclusion, you propose adjusting for Z, but if you hadn’t seen their work, you’d also have gone with X and Y). Not that I don’t think reanalyses are important (I dabble ;-)! But just saying that parallel RRs for the same dataset would add value.
  1. Alex Danvers says:
    
    September 16, 2017 at 9:08 pm
    
    This discussion actually gave me a good idea that we may want to talk about more elsewhere: you know how Rebecca Saxe and others have argued that first year grad students/research methods classes should replicate an interesting finding as part of their training? Well what if we had first year quant students or early stats classes replicate interesting analyses as part of their training? I think people may do this informally, but as journals have more and more open data sets published with their articles, maybe we could have stats classes do things like replicate and extend primary analyses? Or do robustness checks (like in Julia’s new paper)? Seems like a way to incentivize reanalysis. Of course, then you still have to publish those somewhere to get academic credit…

Comments are closed.