For many, depending on the platform or the imputation server used, or if it is historical summary statistics from various sources (i.e. the GWAS catalog) - the variant information may be incomplete or not up-to-date. In the least, chromosome and base-location location will be available but it might not be for the most updated hg build. So there are a few best practices to take:
-
Confirm the build of the genetic data. It is it not in the build of choice, then use LiftOver (https://genome.ucsc.edu/cgi-bin/hgLiftOver). This is particularly important if you will be conducting a meta-analysis of summary statistics conducted for different builds.
-
Annotate SNPs once the build is confirmed. There are several tools that can map a lot of very cool information to each variant, i.e. if it’s genic and its function in the gene; if its in a regulatory region (e.g. CpG island, miRNA target site); if its associated with other phenotypes (e.g ClinVar, GWAS catalog), and various other traits. A few examples of tools: a. SNP-nexus (SNP Annotation Tool) - a web application; b. ANNOVAR (ANNOVAR Documentation) - a perl program; and c. AnnoGen (GitHub - shengqh/annogen) - a python program. I was a huge fan of SeattleSeq, a web application, but it seems it is no longer available/supported as of a few months ago.
-
Regulatory potential can be succinctly evaluated using RegulomeDB, which scores a variant based on evidence from ChIP-seq data (if SNP is in a transcription factor [TF] binding site), impact on chromatin states, findings from DNA-seq experiments, location within TF motifs, and expression QTL (eQTL - if SNP is associated with the expression of a gene in a specific tissue)& chromatin accessibility QTL (caQTL - if SNP is associated with nucleosome packing/positioning in a specific tissue). The RegulomeDB score ranges from 1a to 7, with 1a having the highest evidence and 7 having no regulatory evidence.
-
Gene expression potential can be more extensively investigated using various user-friendly web-application tools. While SNP annotation reports on whether that one SNP is a possible eQTL (expression quantitative trait locus), it is quite like that the SNP is in the linkage disequilibrium (LD) with other nearby variants that are eQTLs for similar or different genes across various tissues. Two of my favorite tools are:
a. FIVEx (https://fivex.sph.umich.edu/) which are easily search for a SNP of interest and all SNPs in LD up to a distance of +/- 1Mb. It includes a lot of different data sets (e.g. ROSMAP, TwinsUK, GTEx, FUSION, etc) - hence, it has been my go-to eQTL tool. There is also an option to look for splicing QTLs - few data sets, but something novel.
b. LDlink (https://ldlink.nih.gov/) is hosted by NIH and it has a lot of features that can be explored. LDexpress is similar to FIVEx but only for GTEx. LDhap - evaluates population-specific haplotype frequencies - this is an under-utilized new in SNP interpretation as a haplotype may likely be the causal structure versus a single variant. LDassoc examines LD structure in specific populations, which can be converted into a heatmap with LDmatrix. LDpair can look for correlated pairs of variants. LDtrait can look for al ist of variants in LD with variants of interest and extract their results from the GWAS catalog. There are also a few other tools, i.e. LDproxy, etc - but as you can see - it is a great one-stop interface from which a lot of functionality for a SNP or SNPs could be hypothesized.
They are always new tools popping up - do you have any suggestions: @paularp @ehutchins @vdardov @gdp22 @Vidash @danieltds @psaffie