The 100% CI

A few months ago, I put a $3,000 bounty on my own research.

In a Red Team Challenge, my collaborators and I gave five bounty hunters financial incentives to spend three weeks searching for errors in a submission-ready manuscript. After scouring the materials, data, code, and manuscript, the Red Teamers reported 107 issues–several of which were deemed critical by a neutral arbiter. Now, instead of putting the final touches on my submission cover letter, I am back at the drawing board–fixing the fixable, designing a follow-up study to address the unfixable, and considering what role Red Teams can play in science more broadly. My reflection upon this Red Team Challenge will be part of a three-part blog series featuring posts from (1) Ruben Arslan (Red Team Challenge arbiter and co-organizer) and (2) Daniël Lakens and Leonid Tiokhin (Red Team Challenge co-organizers).

The Red Team Challenge was motivated by the belief that, all else equal, scientists should trust studies and theories that have been more critically evaluated (Lakens, 2020; Mayo, 2018). It was also motivated by recent reminders that researchers can improve their current error-detection methods. For example, after months of follow-up studies, the original Lancet study outlining the harms of hydroxychloroquine was retracted due to an inability to verify the veracity of the data and analysis procedures (Mehra, Desai, Ruschitzka & Patel, 2020). In addition, a recent Psychological Science paper on religion, violence, and IQ has been retracted due to major concerns about the validity of the measures (Clark et al., 2020). These are not just one-off examples. Systematic investigations highlight that traditional peer reviewers consistently miss deliberately placed errors (Baxt, Waeckerle, Berlin, & Callaham, 1998; Godlee et al., 1998). And those investigations do not include the types of errors that peer-reviewers would only discover upon close inspection of the data, materials, and code.

Calls for reforming error-detection methods are not new (Agnew, 1999; Bornmann, 2011; Smith, 1997). However, the problem of inadequate error detection in science persists (Vazire, 2019, 2020; Lakens, 2020). The Red Team Challenge was a feasibility study of a case where external reviewers were financially incentivized to find errors. The observations it yielded are anecdotes, and it doesn’t take a Red Team to acknowledge the limitations of these anecdotes. Nevertheless, I believe that the results of this challenge are sufficiently interesting to facilitate discussion of how error-detection can be better incorporated into the research process.

The “Blue Team” Perspective

It is difficult to describe how it felt to place my research in “the crosshairs” of financially incentivized error detectors. Sometimes I was anxious about placing myself in such a vulnerable position. Sometimes I was humbled to see that my scientific practices had much room for improvement. But, most of the time, I was simply awestruck by the power of such a critical and thorough approach to detecting errors in science.

I feel that most aspects of the research being evaluated survived scrutiny. Nevertheless, the Red Team identified 107 instances where the work could be improved. Some of the issues may have been caught by traditional peer review (e.g., recommendations to expand the literature review, requests to more extensively describe modifications to the measures, questions about the appropriateness of the power simulation). Many, however, reflected a level of thoroughness I have not encountered in traditional peer review (e.g., recommendations of exclusions based on an examination of experimenter notes, identified cases of computational irreproducibility, thoughts about oddities that were discovered while exploring the data).

Most issue reports can be corrected easily, such as fixing computationally irreproducible code. (This was ironic given my own work on computational reproducibility [Obels et al., 2019].) In other instances, the problems are immutable because they involve how we conducted the experiment. For example, the Red Team made a compelling argument that the unblinded nature of the study may have threatened the validity of a critical experimenter-delivered manipulation. In addition, the Red Team identified a previously unknown confound in the critical manipulation. Without more data, it is not clear if these limitations undermine our conclusions. But, we nevertheless think they are worth evaluating. As my PhD advisor and manuscript co-author, Jeff Larsen, once said: “Science isn’t about being right today. It’s about being less wrong tomorrow.” In that spirit, the pre-print now indicates that there are known limitations being examined in an ongoing follow-up study.

Acknowledging that you’re wrong is easy. But figuring how wrong and why you are wrong is not. In that domain, Red Teamers did a lot of heavy lifting on my behalf. They used their relatively fresh and unbiased perspectives to help identify which areas of the research we should be confident in (e.g., I’m now quite confident that there isn’t a devastating data processing error in the code) and which we should not (e.g., I am now less confident about the validity of our critical manipulation). Regardless of what the follow-up study yields, I believe that the Red Team feedback lends itself to credibility. Scientists should trust studies and theories that have been more critically evaluated. And the Red Team approach seemed more critical and thorough than standard peer review.

Concluding Thoughts

A strong science requires a strong commitment to error-detection. No system will be perfect, but is our current approach really the best that we can do?

The Red Team Challenge provides one idea for how we might do things differently. At the same time, though, there are many other potential reforms to improve error detection in science, and we have little information about which reforms are feasible and effective. The Red Team Challenge may not end up being the best solution, but I hope it inspires a serious conversation about how to improve error-detection in science. After all, if we don’t begin to take quality-control seriously, why should we expect fellow scientists and the general public to take us seriously?

Stay tuned for Part 2 and Part 3 of this blog series. In Part 2, Ruben Arslan will discuss challenges in Red Team arbitration. In Part 3, Daniël Lakens and Leonid Tiokhin will discuss current and future implementations of Red Teams.

References

Agnew B. (1999). NIH eyes sweeping reform of peer review. Science, 286(5442), 1074-1076.

Baxt, W. G., Waeckerle, J. F., Berlin, J. A., & Callaham, M. L. (1998). Who reviews the reviewers? Feasibility of using a fictitious manuscript to evaluate peer reviewer performance. Annals of Emergency Medicine, 32(3), 310-317.

Bornmann, L. (2011). Scientific peer review. Annual Review of Information Science and Technology, 45(1), 197-245.

Clark, C. J., Winegard, B. M., Beardslee, J., Baumeister, R. F., & Shariff, A. F. (2020). Declines in Religiosity Predict Increases in Violent Crime—but Not Among Countries With Relatively High Average IQ. Psychological Science, 31(2), 170-183.

Godlee, F., Gale, C. R., & Martyn, C. N. (1998). Effect on the quality of peer review of blinding reviewers and asking them to sign their reports: a randomized controlled trial. Jama, 280(3), 237-240.

Lakens, D. (2020). Pandemic researchers — recruit your own best critics. Nature, 581, 121.

Mayo, D. G. (2018). Statistical inference as severe testing. Cambridge: Cambridge University Press.

Mehra, M. R., Desai, S. S., Ruschitzka, F., & Patel, A. N. (2020). Hydroxychloroquine or chloroquine with or without a macrolide for treatment of COVID-19: a multinational registry analysis. The Lancet.

Obels, P., Lakens, D., Coles, N. A., Gottfried, J., & Green, S. A. (2019). Analysis of Open Data and Computational Reproducibility in Registered Reports in Psychology. Advances in Methods and Practices in Psychological Science, 3(2), 229-237.

Smith, R. (1997). Peer review: reform or revolution? Bmj, 315, 759.

Vazire, S. (2019). A toast to the error detectors. Nature, 577(9).

Vazire, S. (2020, June 25). Peer-Reviewed Scientific Journals Don’t Really Do Their Job. Retrieved June 29, 2020, from https://www.wired.com/story/peer-reviewed-scientific-journals-dont-really-do-their-job/

Posts

The Red Team Challenge (Part 1): Why I placed a bounty on my own research

28 thoughts on “The Red Team Challenge (Part 1): Why I placed a bounty on my own research”