Skip to main content
FAIRly big: A framework for reproducible processing of large-scale data - a UK Biobank showcase
Adina Wagner, Laura Waite, Małgorzata Wierzba, Felix Hoffstaedter, Alexander Q. Waite, Benjamin Poldrack, Simon B. Eickhoff & Michael Hanke
Presenting author:
Adina Wagner
Datasets, such as the Human Connectome Project, the Adolescent Brain Cognitive Development Study, or the UK Biobank project contain heterogenous data that scales to millions of files and hundreds of terabytes. This large-scale data presents unique research opportunities, but also challenges. Storage and computational demands strain hardware capabilities. As large-scale computations are more difficult to reproduce, comprehend, and verify, the trustworthiness of derivative data decreases. Moreover, restrictive software licenses as well as analytical flexibility can impede transparent analysis provenance in the first place, and privacy considerations prohibit the open distribution of sensitive data. Yet, with large scale datasets, sharing and reusing derivatives emerges as the most viable way to extend previous work. It minimizes duplicate efforts to perform resource-heavy computations with considerable environmental impact, and it can open up research on large data to scholars who do not have access to adequate computational resources. In such contexts, data should thus not only be as findable, accessible, interoperable, and reusable (FAIR) as possible, but also handled in a sustainable manner that places data sharing and reuse as a priority. Here, we present a DataLad-based framework to reproducibly process large-scale datasets, and share results and provenance as openly as the responsible use of sensitive data or legal requirements permit.

We built a scalable processing framework using open source software for version control, containerization, and workload management, and workflows from distributed software development. To test its scalability and portability, we performed voxel-based morphometry (VBM) on N = 42,715 participants of the UK Biobank Imaging dataset using the Computational Anatomy Toolbox (CAT) on a mid-sized computational cluster and a modular supercomputer. We used machine actionable provenance generated by the framework for automatic recomputations on infrastructure as small as a personal laptop, and performed structured investigations of variability between recomputations. To demonstrate the framework’s potential for open and transparent sharing, we created a fully open showcase with open data and open source software.

The framework was portable across systems and their employed workload managers, and recomputation of results based on provenance records succeeded on different hardware. In an investigation of result variability between recomputations we found high congruence, with more than 50% of all output files being identical across computations. Remaining variability was largely attributable to minor numerical differences. We complemented the framework implementation with open tutorials and materials available at github.com/psychoinformatics-de/fairly-big-processing-workflow.

Among the framework’s main features are the ability to generate machine-actionable provenance records that allow for automatic recomputation of results on arbitrary compute infrastructure, and the ability to mitigate storage constraints (Figure 2). With this, the framework facilitates large-scale analyses, enables independent consumers to verify or reproduce processing results based on machine-actionable records of computational provenance, and empowers data providers to create and share transparent derivatives that can be readily shared with appropriate audiences. With the increasing necessity for traceable reproducibility of large scale computations, and the potential of their reusability for scientific progress across disciplines, we believe that the framework can help open and transparent data processing and sharing at the largest scale.