A Data Steward Walks into a Bar: Experiences managing open and restricted brain imaging data

Laura K. Waite, Alexander Q. Waite, Michael Hanke

Presenting author:

Laura K. Waite

Introduction
There is a push to develop international standards and tools to facilitate data sharing and management (Wiener et al., 2016) — spurred along by the growing effort to make research data more findable, accessible, interoperable, and reusable (FAIR) (Wilkinson et al., 2016) as well as the enactment of regulations, such as the GDPR (Regulation (EU) 2016/679, 2016). A variety of solutions now exist, but the difficulties of managing large-scale data, the ever-evolving constraints and requirements that come with sensitive data, and the high degree of variability in the way data providers elect to share their data collectively increase the barrier to execute responsible and effective data management. The technical skill required for these tasks is often quite high, and managing the bespoke and oft-changing requirements imposed by data providers is time consuming. Having interacted with numerous large-scale datasets across 15 different data providers as a data steward for an institute of 80-100 researchers, I present the research data management approach that we’ve taken and discuss how we’ve addressed some of these challenges.

Methods
Our institute uses DataLad (Halchenko et al., 2021) to manage our data. It allows us to flexibly track and structure data, facilitate reproducibility, and support collaboration. All centrally controlled datasets are managed with DataLad and consumed by users on the institute’s mid-sized HTC cluster. Content is version controlled, so we can track and integrate upstream additions, redactions, or fixes, as well as our own clean-ups or transformations (e.g. BIDS) that we apply to data. Download protocols vary across data portals and can be difficult to decipher. We normalize this as much as possible through the development and use of DataLad extensions, such as datalad-ukbiobank (Hanke et al., 2021) and datalad-xnat (Halchenko et al., 2021). Additionally, DataLad provides the ability to structure data into modular units, so we can better manage storage demands as well as control permissions and access to data.

Results
Using DataLad, we built an institute superdataset that is the main entry point to all data holdings centrally managed for the institute. It comprises four main collections: original data (data, as acquired from upstream), processed data (derivative data that are a result of on-site (pre)processing), archived data (projects archived according to institute procedures), and containers (computational pipelines set up as DataLad container datasets). This superdataset is hosted on a GitLab instance, improving discoverability, coordination, and issue tracking. Only the top-level collections are exposed. The actual content and file information is stored on the institute’s bulk storage infrastructure, where data is password protected and access is limited to align with the various Data Use Agreements. User credentials are preseeded on the infrastructure, making authentication transparent. Through the superdataset, users can seamlessly obtain objects in any collection or dataset they are permitted to access.

Conclusions
Our approach successfully addresses many of the hurdles of research data management facing users. However, much of this success is only possible due to our institute having the resources available to dedicate someone as a data steward. A few improvements could help put this within reach of everyone: (1) data providers can lower the effort cost of data stewards if they provide data structured according to international standards (i.e. BIDS), use version control to provide provenance-tracked data, and reduce and simplify the overhead required to gain access to and download data; (2) the efforts of data stewards are shared and built as a collaborative effort. Continued efforts by data providers to promote FAIR Data Principles while reducing the friction and effort cost for data consumers will improve the potential for more responsible and effective data stewardship.

References

Halchenko, Y. O., Meyer, K., Poldrack, B., Solanky, D. S., Wagner, A. S., Gors, J., MacFarlane, D., Pustina, D., Sochat, V., Ghosh, S. S., Mönch, C., Markiewicz, C. J., Waite, L., Shlyakhter, I., Vega, A. de la, Hayashi, S., Häusler, C. O., Poline, J.-B., Kadelka, T., … Hanke, M. (2021). DataLad: Distributed system for joint management of code, data, and their relationship. Journal of Open Source Software, 6(63), 3262. https://doi.org/10.21105/joss.03262

Halchenko, Y., Poldrack, B., Wagner, A., Waite, L., Hanke, M., Wodder, J. T., Raina, J., Heunis, S., Tsankeu, O., Szczepanik, M., Waite, A., & Portoles, O. (2021). DataLad XNAT extension (Version 0.2) [Computer software]. https://zenodo.org/record/5541844#.Ybc4WPHMIq0

Hanke, M., Waite, L.K., Poline, J.-B., & Hutton, A. (2021). DataLad UK Biobank extension (Version 0.3.3) [Computer software]. https://zenodo.org/record/4773629#.YKUvjyWxUUE.

Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), (2016) Official Journal L119, p. 1.

Wiener, M., Sommer, F.T., Ives, Z.G., Poldrack, R.A., & Litt, B. (2016). Enabling an Open Data Ecosystem for the Neurosciences. Neuron, 92(3):617–621, https://doi.org/10.1016/j.neuron.2016.10.037

Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), 160018. https://doi.org/10.1038/sdata.2016.18