Skip to main content
DataCat: Generate a user-friendly data browser from structured metadata using DataLad Catalog
Stephan Heunis, Christian Mönch, Benjamin Poldrack, and Michael Hanke.
Presenting author:
Stephan Heunis
Both machine and human readable metadata are essential ingredients for making our research data outputs Findable, Accessible, Interoperable and Reusable. DataLad Catalog is a command-line tool that aids this process by ingesting machine readable metadata from a DataLad dataset, and generating a user-friendly browser-based application with which to explore rich information about structured and linked datasets. Importantly, it puts this capability into the hands of any DataLad user, allowing decentralised metadata access and browsing where current alternatives most likely depend on a centralised solution with access to the data.

DataLad's metadata capabilities (via DataLad-Metalad) support extracting structured metadata from DataLad datasets, subdatasets, files, and additional data types/formats (e.g. BIDS, yaml) and to aggregate and export this information in a machine-readable format. These exported metadata objects can be fed into DataLad Catalog, a command line interface with a Python API, which then translates metadata into structured data rendered by the user-interface.

DataLad Catalog tailors the output data based on the particular metadata extractor used, and creates a file system structure for the catalog that allows seamless addition and removal of (versions of) a full dataset and its file tree. Importantly, these operations are done on DataLad datasets and extracted metadata, without requiring direct access to the data itself. These features make DataLad Catalog highly suitable for distributed contribution and maintenance of data catalogs.

DataLad Catalog also generates all required assets of the VueJS (https://vuejs.org/) and BootstrapVue (https://bootstrap-vue.org/) based user interface. The resulting catalog can be hosted as a standalone website or added to an existing website without requiring any infrastructure-specific building steps, the only requirement being a basic web-server (e.g. GitHub pages, https://pages.github.com/).

The browser interface supports: navigating directly to a globally unique dataset location, viewing dataset-specific metadata (e.g. id, version, description, authors, DOI), downloading the data with DataLad, exploring the data with Binder, browsing and filtering sub-datasets, navigating the dataset-specific directory/file tree, and viewing related publications and funding information.

DataLad Catalog itself is a DataLad extension package, and the open-source code base can be accessed and contributed to at: https://github.com/datalad/datalad-catalog.

In summary: DataLad Catalog is an open-source Python package with a command line interface that allows seamless translation of DataLad-generated metadata into a user-friendly data browser, and brings the powerful functionality of decentralised metadata handling and data publishing into the hands of users.

References

Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
Yaroslav O. Halchenko, Kyle Meyer, Benjamin Poldrack, et al. DataLad: distributed system for joint management of code, data, and their relationship. Journal of Open Source Software, 6(63), 3262 (2021). https://doi.org/10.21105/joss.03262
Metalad, 2021. https://github.com/datalad/datalad-metalad