Detecting sample swaps as a QC step in -omics datasets

Sample swaps and labelling errors/misannotations can be a huge problem for large cohort studies. The expected number of swaps per study varies but has been estimated to be as high as 3% in genomics datasets. This can compound into a huge problem, especially for large cohorts. I’m looking at you genomics, metabolomics, transcriptomics and proteomics !

We have some datasets with multiple types of -omics from the same patient (ie, WGS and RNA-seq) data and have identified some sample swaps in our data using the CrosscheckFingerprints tool in GATK, which uses a set of linkage disequilibrium blocks and correlates the SNPs in these blocks to estimate sample relatedness. In the corresponding paper, they estimated 1% of ENCODE to have substantive mislabeling errors. It takes SAM/BAMs or VCF files as input - so it is assuming that you have some version of read-based data as your input.

One thing of note - we did use a different method as a QC step for the transcriptomics data in AMP PD, to check it against the WGS, and that analysis is documented in Terra, so a similar QC step is incorporated into the release of that dataset.

What are your thoughts? Does this issue keep you up at night :scream: or do you trust others to address it when they release their data? Do you have other methods for detecting sample swaps that would be useful? Are their approaches that exist for other -omics types - like proteomics, microbiome, or other data? (Anyone or @vdardov @fbbriggs @hirotaka @mcbrumm @psaffie @peixott @mdeleeuw @mattk )

4 Likes

We have encountered some mislabeling in LARGE-PD, which is expected when working with hundreds or thousands of samples. To prevent issues, we follow a mini-protocol that uses demographic information to identify these problems.

When genetic samples are sent to LARGE-PD, we request a table containing key information (sex, age, status, volume, concentration, etc.).

In addition to this table, we also require that all samples be enrolled in the LARGE-PD study to be genotyped, which includes an interview with responses entered into REDCap.

We use TaqMan assays to determine the genetic sex of the samples and compare it with the information in the table and REDCap. We also cross-check REDCap with the table (sex, age, status, etc.). If any inconsistencies are found, we contact the cohort to verify the accuracy of the information.

While this system is not perfect (it’s possible to mislabel someone of the same sex, preventing TaqMan from flagging the issue), it’s our current solution to the problem.

Fun fact: @paularp can share a case where one person was enrolled twice, which we discovered through genetic QC.

2 Likes

Hi @ehutchins :

For my PhD project, I’ve encountered significant challenges, starting with the DNA extraction itself, which had issues so severe that the protocol had to be fixed and couldn’t be overlooked. However, I understand that your question is more focused on sample swaps, labeling errors, and misannotations—critical risks for large cohort studies. That said, even after resolving the extraction problems and achieving the minimum DNA quantity and quality for sequencing, 23.5% of the 285 PD case samples still failed QC. Here’s the breakdown of the reasons for failure:

I think this is a really high (and quite disheartening) outcome. As a solution, I’ve been reflecting on the importance of improving training at the local centers before samples are shipped to LARGE-PD.

I’m not sure if this fully answers your question, @ehutchins, but as part of the training and mentorship Task Force, this issue needs focused attention. In Chile, we’ve struggled with small but impactful errors, which makes a big difference. Reinforcing training on sample handling and providing more context to the local teams could really help. In a way, this is more of a “mea culpa” from my side

3 Likes

It surely was not ideal, but at least served as a proof that both QC and sample labeling were right haha

2 Likes

Thank you for your response! That sounds like a great approach. I’m glad you mentioned using demographic data.

I have used transcriptomics and genomics data to determine expected genetic sex of samples as well. It is definitely a useful tool and has been used to catch swaps in the past but is not perfect at all - you could have a swap that still matches the expected genetic sex.

Also, I deal with some sparse data, and I’ve found it more difficult to use this method with sparse data and/or extracellular RNA.

Depending on the project and the data available, I’ve done a combination of these approaches - matching with expected genetic demographics such as sex, cross-referencing with the vcfs, etc. as part of the initial QC.

1 Like

This is so so important! Thank you for giving a real world example and being honest about your data. I agree that training and sample handling can help a lot from the lab side, in order to reduce the frequency of these errors.

And then knowing on the informatic side as well that mistakes happen sometimes because we’re all human, we can incorporate some QC tools into our study designs and analysis to catch as many as possible.

I think too philosophically, having ongoing discussions from the wet lab side to the dry lab side and everywhere in between helps bring attention to these issues - my hope is we can all work together to address these, especially in large cohort studies, with a mix of design, training, sample handling, and good QC practices.

1 Like

Absolutely agree! These ‘Translational Gaps’ are exactly what @peixott and I have been discussing. Fostering ongoing dialogue among all ‘actors’ in the research journey—from bench to bedside—has the potential to truly transform the way we approach science. Sharing these pitfalls within the Data Community Innovators can be foundational for others. I personally hope we can tackle these challenges together!

2 Likes