Metalad, a flexible Metadata system
Presenting author:
Researchers have long since realized the necessity for and the advantages of Research Data Management (RDM). To further improve the use of research data, the FAIR guiding principles for scientific data management have been established. Their aim is to improve the Findability, Accessibility, Interoperability, and Reuse (hence FAIR) of digital assets.
Although being FAIR is a highly desirable property for research data, it is not easily
implemented. In order to make research data FAIR, a multitude of different properties of the research data have to be captured or extracted and stored as metadata (i.e. data about data, here: data about research data). Depending on the nature of the research data the respective metadata might vary widely in formats, size, and the processes that are necessary to generate it. In fact, there is a multitude of different metadata standards for a multitude of applications. Those standards are geared toward the nature of the primary data as well as to operations that they support. Given the large amount of possible metadata and operations, it is a challenge to implement a system that supports this wide range of use cases.
For example, findability will require very different indexing methods depending on the
research data itself, as well as on the search mechanisms that should be used. In addition, there might be other concerns that must be fulfilled, for example, privacy concerns with regard to research subject data. Individual, research lab-specific requirements might further increase the number of metadata objects and metadata formats that are used. Add the wish to publish different metadata under different circumstances for different audiences and you are faced with an even larger number of processes and formats.
We have developed Metalad to cope with the large amount of vastly different requirements, the large variety of research data types, the multitude of metadata standards, and the large number of desired processes. Metalad is an extension to Datalad [datalad], a transparent data management system, that works with every file-based data representation.
Metalad creates an overlay structure that is independent and transparent to the data stored in a Datalad dataset. This overlay structure is able to store metadata. It allows the association of arbitrary many metadata records with every dataset object. A dataset object is either a datalad dataset, a file in a datalad dataset. Metalad is agnostic to the metadata format, i.e. it is able to supports all metadata formats. Stored metadata can be retrieved given a dataset object and a metadata format name. Metadata can be provided from any source, for example, from the output of an indexing-process, from manually created files, or from publicly available
data. Metalad also supports the automated extraction of metadata from object in datalad datasets. The automated extraction can be performed in different way, either by executing datalad-provided extractors or filters, or by executing external programs. Metadata can be transported independently of the primary data. Metadata can be filtered to provide a limited view, in order to, for example, create a public search index for a dataset, that does not contain confidential data. Metalad supports filter operations on metadata, i.e. processess that read one or more metadata-records and create new metadata-records based on those. Filter do not require the original data.
In short: Metalad stores arbitrary metadata-records. Metadata can be processed, extended, distributed, exported, or imported independently of the original data. It can be processed in different ways, either by datalad extractors and filters, or by external programs.
The following are just a few examples of operations that Metalad supports and for properties that Metalad provides:
- Data anonymization (e.g. k-anonymization)
- Search-index creation (e.g. google datasets search-entry creation)
- Result reproducibility (e.g. re-execute processing steps in new environments)
- Extensibility (e.g. processing of new metadata formats, when they arrive)
Metalad is a platform that allows the integration of all metadata-related processing under one roof. This platform is extensible and future-ready.
-----
[datalad] https://www.datalad.org
Although being FAIR is a highly desirable property for research data, it is not easily
implemented. In order to make research data FAIR, a multitude of different properties of the research data have to be captured or extracted and stored as metadata (i.e. data about data, here: data about research data). Depending on the nature of the research data the respective metadata might vary widely in formats, size, and the processes that are necessary to generate it. In fact, there is a multitude of different metadata standards for a multitude of applications. Those standards are geared toward the nature of the primary data as well as to operations that they support. Given the large amount of possible metadata and operations, it is a challenge to implement a system that supports this wide range of use cases.
For example, findability will require very different indexing methods depending on the
research data itself, as well as on the search mechanisms that should be used. In addition, there might be other concerns that must be fulfilled, for example, privacy concerns with regard to research subject data. Individual, research lab-specific requirements might further increase the number of metadata objects and metadata formats that are used. Add the wish to publish different metadata under different circumstances for different audiences and you are faced with an even larger number of processes and formats.
We have developed Metalad to cope with the large amount of vastly different requirements, the large variety of research data types, the multitude of metadata standards, and the large number of desired processes. Metalad is an extension to Datalad [datalad], a transparent data management system, that works with every file-based data representation.
Metalad creates an overlay structure that is independent and transparent to the data stored in a Datalad dataset. This overlay structure is able to store metadata. It allows the association of arbitrary many metadata records with every dataset object. A dataset object is either a datalad dataset, a file in a datalad dataset. Metalad is agnostic to the metadata format, i.e. it is able to supports all metadata formats. Stored metadata can be retrieved given a dataset object and a metadata format name. Metadata can be provided from any source, for example, from the output of an indexing-process, from manually created files, or from publicly available
data. Metalad also supports the automated extraction of metadata from object in datalad datasets. The automated extraction can be performed in different way, either by executing datalad-provided extractors or filters, or by executing external programs. Metadata can be transported independently of the primary data. Metadata can be filtered to provide a limited view, in order to, for example, create a public search index for a dataset, that does not contain confidential data. Metalad supports filter operations on metadata, i.e. processess that read one or more metadata-records and create new metadata-records based on those. Filter do not require the original data.
In short: Metalad stores arbitrary metadata-records. Metadata can be processed, extended, distributed, exported, or imported independently of the original data. It can be processed in different ways, either by datalad extractors and filters, or by external programs.
The following are just a few examples of operations that Metalad supports and for properties that Metalad provides:
- Data anonymization (e.g. k-anonymization)
- Search-index creation (e.g. google datasets search-entry creation)
- Result reproducibility (e.g. re-execute processing steps in new environments)
- Extensibility (e.g. processing of new metadata formats, when they arrive)
Metalad is a platform that allows the integration of all metadata-related processing under one roof. This platform is extensible and future-ready.
-----
[datalad] https://www.datalad.org