Tips for Fox Insight's whole-genome genetic data

Fox Insight has a comprehensive single nucleotide polymorphism (SNP) genotyping. One must first register through FoxDEN (Tier 1 access) and then apply for Tier 2 access for these genetic data (see here: Genetic Data Request Process).

The data are in .bgen format (see here for more: Tips on how to convert genetic data from BGEN to another format?)

Sample Size
As of now there are ~10,700 subjects with genetic data, the majority are individuals with PD (10,250) - which is a fantastic sample size.

Genotyping Platform
The subjects have been genotyped across five different Illumina platforms, which reflects the chip version used by 23andMe’s research enterprise who generated the data.
v2: 2.0% on Illumina 550Qv1 Custom v2 (~550K SNPs, used starting in 2009)
v3: 5.1% on Illumina OmniEXv3 Custom v3 (~950K SNPs)
v4: 10.2% on Illumina Custom v4 (~600K SNPs, used starting in 2013)
v5.2 0.6% on Illumina GSA Custom v5.2 (~690K used starting in 2017 - overlaps well with v5)
v5: 82.2% on Illumina GSA Custom v5 (~670K in use as of 2022)

There is meaningful overlap across the platforms, but considering >90% of subjects are genotyped across v4 and v5 - this might be a good starting point. v4 is the Illumina High-Throughput Screening iSelect panel and v5 (lumping in v5.2) is the Illumina Global Screening Array HD. See this Venn Diagram (pre-QC, where there were duplicate SNPs - see next section). For one genetic investigation, inside of focusing on the overlap between v4 and v5 for imputation, we imputed v4 data and v5 data separately - and after the standard imputation QC, we were able to analyze >7.5 million overlapping SNPs for v4 & v5. So despite the Venn diagram, there is the potential of comprehensive large-scale genetic analyses. It is likely that v2 and v3 can be similarly individually imputed, and merged with v4 and v5. If you proceed with this approach, as a best practice, do include an indicator variable for platform as a covariate in your models.

image

Duplicate SNPs
This may have changed, but in a data version we downloaded in 2022, there was extensive duplication in SNPs (majority having the same rsID, position, and alleles). For the raw v4 and v5 data, 1.4 million SNPs were in duplicate, 2500 were present 4x, ~400 were present 6x, and 60 were present 8x. For most the duplicate was an empty lines that would be removed with basic QC (i.e. missingness >10%). There were ~3000 SNPs that had replicate data with discordant allele calls - these we dropped in totality. So as best practice, engage is practical QC for genetic data for raw genotype calls.

7 Likes

Hi, I’m working on the Fox Insight genetic data. Have you found the best way to convert BGEN files to PLINK format? I encountered challenges in that there are multiallelic variants in BGEN, and PLINK does not support the import of data with multiallelic variants.

@mlai3 Alas, we did not retain multi-allelic variants for that very same reason, but I do believe you can convert BGEN file with multiallelic variants to VCF using bgenix followed by bcftools.