The human genome has over 150 million known variants cataloged in population databases. Single nucleotide variants (SNVs) are the most common, with projects like the 1000 Genomes Project identifying about 84.7 million of them
These variants shape traits, disease risk, and how people respond to drugs. To understand these variants, scientists use a process called annotation. Variant annotation involves determining the functional consequences of genetic variants, such as their impact on gene function or association with diseases.
This process adds information about how variants link to genes and their effects on proteins. Such data helps predict if a variant might cause a disease or change how the body works.
Handling the huge amount of genetic data needs advanced tools, and there are several to facilitate this process. Personally, I have only used two:
• Ensembl Variant Effect Predictor (VEP): provides annotation and filtering of genomic variants. It predicts molecular consequences using gene sets and reports phenotype associations, allele frequencies, and deleteriousness predictions. VEP is accessible via command-line, API, and a web interface.
• ANNOVAR: it facilitates fast and easy variant annotations, including gene-based, region-based, and filter-based annotations. ANNOVAR is a command-line tool
As a very short example, let’s annotate variant rs41432647 chr17:4631740 (GRCh38.p14)
- First, let’s try VEP
As an output, we can get something like this, and you can always use their filters and options to get more or less information
2. Now, ANNOVAR
To use Annovar, we must format our variant as a vcf file, following something similar to this format
##fileformat=VCFv4.2
##source=Varinat-to-VCF
#CHROM POS ID REF ALT QUAL FILTER INFO
perl annovar/table_annovar.pl \ # Path to the table_annovar.pl script
data/rs41432647.vcf.gz \ # Input VCF file (gzipped)
annovar/humandb/ \ # Directory containing ANNOVAR databases
-buildver hg38 \ # Genome build version
-out Annovar_results/rs41432647.annovar \ # Output prefix and directory
-remove \ # Remove intermediate files after annotation
-protocol refGene,clinvar_20140902,dbnsfp47a \ # Databases used for annotation
-operation g,f,f \ # Type of operation for each database (g=gene-based, f=filter-based)
--nopolish \ # Skip variant normalization/polishing
-nastring . \ # String to represent missing values in the output
-vcfinput # Specify that input is a VCF file
And once again, filtering the output (which can be suuuper long), you can access columns of your interest:
So in summary, what he have learned about our variant?
• Impact: The variant occurs in an exonic or protein-coding region
• ClinVar Status: No associated clinical significance reported
• Predictions:
• Polyphen-2 predicts the variant to be damaging (D) under the HDIV model.
• ClinPred also predicts the variant to be damaging (D).
• PolyPhen Prediction: 0.958, indicating the variant is likely damaging to the protein function.
• ClinPred Score: 0.998, further supporting the likelihood of a deleterious effect.
• CADD Scores:
• PHRED: 24.2 (higher scores suggest greater potential deleteriousness).
• RAW: 4.074235.
Please let me know any other tools you have used and for what purposes has annotation been useful!