Provenance: how and why to track it

Hi, everyone!

Today I’m excited to start a discussion on provenance of genetic data, because it is essential for trustworthy and significant research.

Data quality is a challenge we all face, and to ensure it, we firstly must have extensive knowledge of where the data comes from, whether we’re merging data from many studies or working with a variety of genetic samples. I will enlist some questions we must make ourselves and some resources available, but please comment any other tools, resources, or discussions you may have. Let’s keep the discussion going!

Provenance refers to “the description of the origins of a piece of data and the process by which it arrived in a database” ( Why and where: A characterization of data provenance) and encompasses the entire lifecycle of data, including sample collection, processing, and analysis steps.

Some questions we can be making ourselves are:

• What strategies should be employed to track and document the provenance of genetic data, including sample collection, processing, and analysis steps?

• How can we integrate genetic data from multiple sources maintaining traceability and transparency?

• Are there tools or methods for automating the capture of data provenance and metadata in genetic analyses?

• How can we ensure that the data used in our analysis is reliable and well-documented, when dealing with large-scale genetic studies?

A recent article proposed a Common Provenance Model that later aims to become an International Standards Organization (ISO) standard

Also, here are some links if you are interested in specific models or workflows for provenance tracking.

So, can you share:

If you are part of a group generating data, how do you capture metadata and how do you document? Is there any tools you use?

If you employ available data, how do you track provenance and how do you include it in your workflow to ensure data quality?

2 Likes

Thank you, Paula, for sharing these resources about provenance and raising these important questions about metadata and provenance!

I recently attended a workshop series from the GO FAIR foundation about metadata, particularly focusing on workflows that ensure human and machine readability of metadata. The GO FAIR website has a lot of useful guidance on making (meta)data findable, accessible, interoperable, and reusable. The suggest a workflow that heavily involves the CEDAR metadata center.

This article offers suggested tools and guiding principles for making metadata FAIR, and they point to the International Neuroinformatics Coordinating Facility (INCF)’s criteria checklist on the topic.

Another useful resource is the INCF’s taxonomy on open metadata.

I’d love to hear others’ experience with metadata and provenance in response to Paula’s question.

2 Likes

Thanks for sharing all these cool resources, Maya! I think a lot of this is also useful for the recent post of @vdardov about metadata

1 Like