Provenance: how and why to track it

paularp · August 10, 2023, 1:18am

Hi, everyone!

Today I’m excited to start a discussion on provenance of genetic data, because it is essential for trustworthy and significant research.

Data quality is a challenge we all face, and to ensure it, we firstly must have extensive knowledge of where the data comes from, whether we’re merging data from many studies or working with a variety of genetic samples. I will enlist some questions we must make ourselves and some resources available, but please comment any other tools, resources, or discussions you may have. Let’s keep the discussion going!

Provenance refers to “the description of the origins of a piece of data and the process by which it arrived in a database” ( Why and where: A characterization of data provenance) and encompasses the entire lifecycle of data, including sample collection, processing, and analysis steps.

Some questions we can be making ourselves are:

• What strategies should be employed to track and document the provenance of genetic data, including sample collection, processing, and analysis steps?

• How can we integrate genetic data from multiple sources maintaining traceability and transparency?

• Are there tools or methods for automating the capture of data provenance and metadata in genetic analyses?

• How can we ensure that the data used in our analysis is reliable and well-documented, when dealing with large-scale genetic studies?

A recent article proposed a Common Provenance Model that later aims to become an International Standards Organization (ISO) standard

Also, here are some links if you are interested in specific models or workflows for provenance tracking.

So, can you share:

If you are part of a group generating data, how do you capture metadata and how do you document? Is there any tools you use?

If you employ available data, how do you track provenance and how do you include it in your workflow to ensure data quality?

maya.sanghvi · August 16, 2023, 6:09pm

Thank you, Paula, for sharing these resources about provenance and raising these important questions about metadata and provenance!

I recently attended a workshop series from the GO FAIR foundation about metadata, particularly focusing on workflows that ensure human and machine readability of metadata. The GO FAIR website has a lot of useful guidance on making (meta)data findable, accessible, interoperable, and reusable. The suggest a workflow that heavily involves the CEDAR metadata center.

This article offers suggested tools and guiding principles for making metadata FAIR, and they point to the International Neuroinformatics Coordinating Facility (INCF)’s criteria checklist on the topic.

Another useful resource is the INCF’s taxonomy on open metadata.

I’d love to hear others’ experience with metadata and provenance in response to Paula’s question.

paularp · August 21, 2023, 10:41pm

Thanks for sharing all these cool resources, Maya! I think a lot of this is also useful for the recent post of @vdardov about metadata

Topic		Replies	Views
Systematic review methodology tips Analyzing and Reusing Data genetic-data , how-to , methodology	1	68	September 8, 2023
What makes useful metadata so that you can reuse and analyze data sets? Analyzing and Reusing Data meta , metadata , data-quality , documentation	1	43	August 17, 2023
Genetic data short courses? Analyzing and Reusing Data genetic-data , how-to , data-analysis	5	106	November 10, 2025
Lack of Harmonization: A Special Task for the Data Community of Practice Analyzing and Reusing Data genetic-data , communication , data-interpretation	1	77	November 13, 2024
Experience with asking for raw data Accessing and Understanding Data genetic-data , how-to , data-access	4	77	July 10, 2023

Provenance: how and why to track it

Related topics