In my scientific work I strive to be as open as possible. Unfortunately I work with data that I cannot de-identify well enough to share (aka weird sex diaries) and data that simply isn’t mine to share (aka the reproductive histories of all Swedish people since 1950). To compensate this embarrassing lack of openness I’ve tried to devise other ways of being transparent about my work. After a few failures (nobody ever tuned into my Rstudio-speedrun-Twitch-channel), this is one that’s worked quite well for me in the past. I now think it might be interesting even for people who don’t have to compensate for anything.
Namely, I’ve tried to share not only my statistical code, but also to ensure that it is accessible by turning it into a browsable, reproducible website.
I did this, because there’s a bunch of problems you can run into if you just share data and code:
- Incomplete directions or code. Maybe you have some private knowledge like “First you need to set the working directory to C:\DavemasterDave\Nature_paper\PNAS_paper\JPSP_paper\PAID_paper\data, load the statistical packages in the order passed down to us from our elder advisors, rename this variable, and pray to Meehl.”
- Inconsistent package versions. You used dplyr v0.5.0 instead of v.0.4.9 to reproduce the analyses. Turns out they changed distinct() to mean “throw away all my data and don’t warn me about it”.
- Hassle. First you download the code from the supplement file. Then get the data from Dryad. Then put them in the same folder and install the following packages (oh by the way you’ll need to compile this one from source for which you’ll need R, Rstudio, Rtools, and OS X Xcode command line tools). Oh and to get the codebook you just have to scale Mount Doom. Just don’t awake my advisor and you’re almost there.
- Poor documentation of the code. What does the function transformData_???_profit do? What do the variables prc_stirn_b, fertile_fab mean? Is RJS_6R the one that’s reverse-coded or did you name it this way after reversing RJS_6? And the all-time classic: Are women sex == 2 or sex == 1 (or twice as sex as men)?
- It starts with the cleaned data. When I work on a project, I spend 80% of my time on data preparation, cleaning, checking. Yet when I download reproducible files, they often start with a file named cleaned_final_2_Sunday23July_BCM.csv. Something went missing there. And there might be mistakes in that missing bit. In fact it’s approximately 99% more likely that I made a mistake cleaning my data than that Paul Bürkner made a mistake writing brms. As psychological science gets more complex, we should share our pipelines for wrangling data.
- Last, but most aggravatingly: Loss. Turns out personal university pages aren’t particularly reliable places to store your data and code and neither is that backup you made on that USB stick which you lent to your co-worker who.. fuck.
Reproducible websites solve these problems. Side effects may include agonising uncertainty about whether you should release them and whether mistakes will be found (fun fact: nobody reads the supplement. I’ve been including a video of Rick Astley in every single one of my supplements and so far nobody noticed. Another fun fact: nobody reads blog posts longer than 1000 characters, so admitting this here poses no risk at all).
This is the stack that I use to make my reproducible websites.
- R & RStudio. RStudio is an integrated development environment. This is not strictly necessary, but as RStudio makes or maintains a lot of the packages necessary to achieve reproducibility, using their editor is a smart choice.
- Packrat. Packrat solves messes due to different R package versions. Unfortunately, it’s a bit immature, so I’d currently recommend activating it on only in the final stage of your project. If you’re often working on many projects simultaneously and you know the grief it causes when a package update in one project breaks your code in another, it might be worth the hassle to keep it activated.
- Rmarkdown (knitr). Markdown is an easy to learn markup language. Rmarkdown lets you document your R code, put the graphs where they belong etc. It also now generates websites and bibliographies with little additional work.
- Git & Github. Git does distributed version tracking. Many scientists work alone, so Git may seem like overkill, but a) when you go open, you will be never be alone again (cue sappy music) and b) the features Github offers (notably: hosting for your website through Github pages) make up for Git’s somewhat steep learning curve. RStudio provides a rustic visual interface for Git, I personally prefer SourceTree.
- Zenodo. They will permanently archive your website (and anything else you care to share) for free. If you make your stuff public, you get a citable DOI, which will lead to your document, even if Github and Zenodo should one day cease to exist. Zenodo can be integrated with Github, so that releases on Github are automatically uploaded to Zenodo.
To make this stack work well together, there’s a few hurdles to clear. And let me be completely frank with you: You still need R, RStudio, and a working knowledge of Mount Doom’s geography. But your readers will only need a web browser to make sense of your work (okay, printing works too, but one of my co-authors once tried to print out all my Mplus outputs, deforesting a substantial part of Honduras in the process).
To make it easier, I’ve uploaded a starter RStudio project that you can fork on Github to start out with a configuration that worked well for me in the past. I’ve tried to paper over some of the rough edges that this stack still has and I added some initial structure and code, so you can adapt it.
With these projects I document my entire research process using Rmarkdown (e.g. loading & processing the raw data, wrangling it into the right shape, analysing it, making graphs).
But instead of sharing the raw scripts (which only make sense when run interactively with the data), I create a website where readers see the code together with the resulting graphs and other output.
I use the reports generated like this to make sense of my own results and to share extensive results with my co-authors. Some friends even write their complete manuscripts using Scholarly Markdown, and Papaja, but for me this is more about not losing all the interesting details that can’t make it into the manuscript.
Here’s two recent projects where I’ve used this stack (or an earlier version):
- https://rubenarslan.github.io/paternal_age_fitness/ – the online supplement for this manuscript. This one is the largest I’ve made so far (~90 pages, >7000 graphs) and involved a complex setup including models run on a cluster, but documented offline. In this project I was in the situation that I wanted to repeat the same commands for different samples a lot, so that I had to learn to compartmentalise my Markdown into components (more on that on another day).
- https://rubenarslan.github.io/generation_scotland_pedigree_gcta/ – the online supplement for this manuscript. This one is much simpler, just a single page with a lot of tabs. Here we presented a few models from our model selection procedure in the manuscript, but we wanted to show the results other choices would have had and how different components influenced one another.
The stack should also work with some modifications for Shiny apps.
So, go fork and reproduce: https://github.com/rubenarslan/repro_web_stack/
(There’s some more boring instructions on the Github page, but it’s really simpler than all those software names make it sound).
PS.: If this isn’t nerdy enough for you, maybe Jon Zelner’s blog post series will be for you, he uses make files, Docker and continuous integration like a proper programmer would.