Hello everyone,
As I mentioned in the previous post, today I will be less philosophical and more practical. We will focus on the most important step in genetic analysis: Quality Control (QC). In this post, I will discuss QC in genotyping data (sequencing data involves different QCs, which could be the topic of another post if there’s interest).
QC is the most crucial step in genetic analysis because an error at this stage can be catastrophic. Imagine a hypothetical scenario where you find a variant associated with your phenotype, showing a great Odds Ratio and an incredible p-value. You excitedly share this with your collaborators, only to discover that it’s a spurious result caused by a mistake in your QC protocol. So let’s dive into this post full of professional figures.
All the steps explained in this tutorial are based on the LARGE-PD QC publicly available in our GitHub. In this text I will use the word variant and not SNP because SNP is a subset of variants. The Figure 1 shows some genetic variations
Figure 1: Super professional figure. In this figure we have represented some genetic variations like SNP (single Nucleotide Polymorphism), CNV (Copy Number Variation) and Indel. There is more variations that was omitted for visual clarity
Genome Reference
The Human Genome Project was an international scientific research initiative that began in 1990 and was completed in 2003. This project involved several research groups and companies. You can learn more about the project on the NIH timeline (https://www.genome.gov/human-genome-project/timeline). One of its major achievements was the creation of the first human genome reference (hg16). Since then, several updated versions have been released: hg16 (NCBI Build 34), hg17 (NCBI Build 35), hg18 (NCBI Build 36.1), hg19 (GRCh37), hg20, and hg38 (GRCh38). More recently, the T2T-CHM13 or hs1 version was published. The main difference between these versions is that newer versions have fewer sequencing gaps than the previous ones.
One major challenge with the Genome Reference is that not all sequences will always align with the reference genome you’re working with. Some tools are based on hg38, while others rely on hg19. So, when working with a genetic dataset, the first questions you should ask are: “What genome reference am I using?” and “Which human genome reference should my data follow?”
The most common tool for converting between builds is LiftOver. According to the LiftOver How To (http://genomewiki.cse.ucsc.edu/index.php/LiftOver_Howto), “Creating a LiftOver file is very similar to a whole-genome alignment. A LiftOver file is a chain file, where, for each region in the genome, the alignments of the best/longest syntenic regions are used to translate features from one version of a genome to another.” Because of this, when using LiftOver to change the build, sometimes variants may switch chromosomes.
The Neurobooster Array, used by GP2, was originally built on hg19. Illumina used LiftOver to convert it to hg38, and when we compared two annotation tables from Illumina (one in hg19 and one in hg38), our team at Mata Lab identified around 5,000 variants that had changed chromosomes. Therefore, your first QC step should be to ensure that (i) your samples are aligned with the correct human genome reference for your analysis, and (ii) if you used LiftOver, check whether the variants remain on the same chromosome before and after the conversion.
Chromosome Zero
If you are performing genotyping calls and exporting your data to PED/MAP using Illumina GenomeStudio or the Illumina Array Analysis Platform (IAAP), you may encounter some variants listed as being on “chromosome zero.” Chromosome zero is how Illumina flags variants that do not pass internal QC, which is based on Illumina GenCall Score and Illumina Gentrain Score. These variants should simply be removed.
Sex Discrepancy
If you have a covariate file and variants from the X chromosome in your dataset, you can perform a sex discrepancy analysis. This analysis helps identify potential sample swaps during the wet lab process.
The first step in this analysis is to separate the PAR (Pseudoautosomal Region) from chromosome X in your dataset. Afterward, you calculate the F-statistics for chromosome X using Plink --check-sex. For non-admixed samples, you can use Plink’s default values (females with F < 0.2, males with F > 0.9). However, for admixed samples, it is recommended to use females with F < 0.4 and males with F > 0.9.
For association studies, if the genetic sex differs from the annotated sex, it’s best to remove the sample because other covariates could also be incorrect. For projects that do not involve covariates (e.g., historical or demographic studies), you can use the --impute-sex option to assign the genetic sex in your dataset.
Missing data
When discussing missing data, we can refer to either sample-level or variant-level missing data. A high rate of individual missing data (where a sample has many non genotyped variants) can be associated with issues in library preparation or poor DNA quality. On the other hand, a high rate of variant missing data may indicate problems with the probe, the array, or the genotyping machine.
Our recommendation is to first remove samples with high individual missing data, followed by variants with high missing data. The most common cutoff is 10% missing data, though some analyses use a more stringent 5% threshold.
Redundant variants
The Illumina genotyping array has a notable issue: it doesn’t always genotype variants on the “+” strand. This problem is compounded by the fact that Illumina doesn’t always use the human genome reference as the standard for genotyping. For example, several variants on the Neurobooster array use other references, like the GBA1 RefSeqGene, which is on the “-” strand. In short, you may not know whether your genotypes are on the “-” or “+” strand, yet many analyses assume that the variants are on the “+” strand.
We refer to variants with complementary reference and alternative alleles (A/T and C/G) as “redundant variants.” The challenge with these variants is that both possible nucleotides can exist on either the “+” or “-” strand. For example, in an A>T variant where the genotype is TT, it’s unclear whether the genotype is TT on the “+” strand or TT on the “-” strand (which corresponds to AA on the “+” strand) (Figure 3).
Figure 3: The redundant variant problem
Duplicate variants
You don’t need the same variant appearing multiple times in your dataset. At this stage, you’ll want to remove one of the duplicates. There are several ways to do this: (i) by using genotyping calling information and removing the variants with lower Illumina GenCall (IGC) and GenTrain scores (though this method is rarely used), (ii) by randomly keeping one variant, or (iii) by keeping the variant with the lower missing data rate (Figure 4).
Figure 4: In this step, you aim to remove the variant with the higher amount of missing data, which in this case is represented by the one on the right that is missing a leg.
Heterozygosity
High and low rates of heterozygosity can indicate sample contamination or inbreeding within the dataset. Heterozygosity is usually assessed using F-statistics. However, in admixed populations, elevated heterozygosity can result from variants originating from different parental populations. Therefore, removing samples solely based on F-statistics may lead to unnecessary sample loss.
In admixed populations, the better approach is to calculate the heterozygosity rate and remove samples that deviate more than 3 standard deviations from the mean.
Hardy Weinberg Equilibrium
Yes, he is back. The most important concept on population genetics needs to be mentioned here. If your variant deviates significantly from HWE it may indicate a problem in the genotyping process and we need to remove it.
But you may ask
Congratulations, you are absolutely right! That’s why we apply Hardy-Weinberg Equilibrium (HWE) filtering twice. The first round is done on controls, with a p-value cutoff (usually p < 1e-6), and the second round on cases, using a lower cutoff (usually p < 1e-10). With this approach, we can remove variants that deviate from HWE without discarding those that might be crucial in the pathophysiology of Parkinson’s disease.
Relationship control
Disclaimer: The author of this post has a tool to do relationship control published.
Relationships among samples can pose a problem for analyses that assume data independence. Kehdy et al. (I am one of the et al.) observed this in an unsupervised admixture analysis. Admixture analysis is a tool that infers individual ancestry, and one of its key assumptions is that the samples are independent. This assumption does not hold true for the Bambuí cohort, which is an admixed population.
At K=7, the admixture analysis identified a biogeographical cluster based on the Bambuí population (Figure 5). Upon closer examination of this dataset, we found that these clusters were associated with two families from Bambuí. After removing the related samples, the analysis proceeded as expected.
Figure 5: Familiar structure in Bambuí consistently identified by REAP, ADMIXTURE and Principal Component Analysis (PCA). The ADMIXTURE (K=7) identifies ancestry clusters (brown and black) that match a set of relatives identified by REAP kinship analysis and by our network approach. Individuals from the black cluster were also identified by the second component of the PCA (red points) performed only for the Bambuí cohort. Figure (and legend) extracted from Supplementary Information from Kehdy et al. 2015
In this work, we developed a new methodology called NAToRA (GitHub). NAToRA is a network-based approach that aims to remove the minimum number of samples necessary to achieve a dataset without relationships exceeding a specified cutoff (what is a NP-complete problem). In the context of LARGE-PD, we use second-degree relationship as the cutoff, meaning we remove self-degree, first-degree, and second-degree relatives.
Our methodology offers two modes: the optimal mode, which guarantees the minimum sample size, and the heuristic mode, which does not guarantee the best result but allows for the application of NAToRA to larger datasets. Both modes outperform existing methods, including KING and PLINK.
After performing this QC, you have a beautiful dataset to perform your genetic analysis. Other pipelines include more or less QC steps (for example the XWAS step has other less common QC like potential probe sites, etc) that I will not mention here.
In the next topic I will be talking about phasing or ancestry analysis. See you there