Exomes on the go – adventures with wANNOVAR

Going cloud. This post is about my most recent discovery when I was trying to modernize some of the bioinformatics tools that I had on my laptop. My experience with variant annotation is a good example of the latest trend in bioinformatics: replacing precise, but difficult-to-use tools by web-based convenience – I didn’t need to install anything after all. This is a brief journey into the world of variant annotation, taking advantage of my new favorite tool, wANNOVAR and applying it to the Epi4K dataset.

A VCF file with the de novo mutations found in the Epi4K publication

A VCF file with the de novo mutations found in the Epi4K publication

Change of mind. If you read some of my earlier bioinformatics posts, you will find that I have always advocated for the “I believe it because it runs on my laptop” paradigm. Genetics is currently transforming from a relatively data-poor science to big data science. This comes with the need to use tools that help us understand this complex landscape, the need to use tools that mine and filter genomic data. The reason why I am writing this post is my most recent discovery: it’s good to understand these tools and how they work. However, you don’t need to do this all by yourself. There are already all kinds of tools out there to help you.

ANNOVAR. Every genomic scientist has her or his favorite annotation tool. I have always been a fan of ANNOVAR, as this program runs beautifully on all platform and gives you a data format that is easy to interpret. Basically, ANNOVAR does something really simple: it takes a genomic variant and annotates this variant against various genetic databases. You end up with table of variants that includes genomic annotation, frequency in control databases, conservation and functional prediction tools. As my laptop version of ANNOVAR was almost two years old, I wanted to update it to include the most recent control databases such as ExAC and novel functional prediction tools such as the CADD score. I then realized that the web-based version of ANNOVAR was much faster than any update that I could run on my laptop.

How to use it. I thought that I would walk you through wANNOVAR step-by-step. As an example, I have used the Epi4K de novo mutations, which is interesting as the publication of this dataset predated the release of the ExAC database. ExAC is a large dataset of variants from more than 60,000 exomes at the Broad Institute, which is currently the largest control database that we have access to. Admittedly, ExAC is not a true control database as some of the individuals included have neurological disorders (namely schizophrenia, bipolar disorder, and Tourette syndrome). However, especially when it comes to rare population variants, it is the only database that holds large datasets of various ethnicities. What you need to start with is a VCF file. I have attached my example of the VCF file that I have used for the analysis. VCF files require an 8 column format. However, the columns after the REF and ALT column are quality measures that don’t impact on the annotation – in our file, I have put in fixed numbers. If you want to run a variant through wANNOVAR, take this file and simply replace the hg19 coordinates as well as the reference allele and variant allele. Then save this file as a txt file and upload it into wANNOVAR (link). Wait for 5-10 min. Then download the csv file from genome_annotation. If you’re working on a PC, it should automatically open in Excel.

How to interpret it. First of all, wANNOVAR managed to annotate all variants correctly – having something work the first time around is usually something that’s unpredictable in the bioinformatics sphere, so this is already a first important step. The Excel sheet can be found here (link). In columns A-J, you can find information about the annotation, columns K-AA provide us with the frequency in various databases including 1000 Genomes, the Exome Variant Server, and ExAC. Finally, columns AQ-BS report on the functional annotation. For example, take column BM, the phredded CADD score, a sophisticated measure of in silico variant pathogenicity. Scores above ~20 are usually considered likely pathogenic. Let’s now come back to my initial question – what does ExAC do to the Epi4K de novo mutations?

Epi4K post ExAC. Let’s filter by column Q and set it to non-empty. This results in 65 de novo mutations that are also seen in ExAC. Even when taking into account that ExAC is not a “clean” control database as some patients with schizophrenia are included, roughly 20% (65/329) of the de novo mutations fail one important criterion that is sometimes applied to judge pathogenicity: they are not entirely absent in controls. At this point, it’s not clear whether this is due to mutations hotspots or if these are recurrent mutations in patients with undiagnosed neurodevelopmental disorders. However, none of the de novo mutations seen in ExAC is in any of the genes that we would consider high-yield candidates. The only exception may be a de novo mutation in the TRIO gene, which is one of the genes found in various neurodevelopmental disorders. Given that this particular variant is also seen in 1000 Genomes, it is unlikely that this is an individual with a neurological disease. In contrast, there might be variants in this gene that occur de novo and that may be tolerated in controls. Other de novo mutations found in controls include mutations in CREBBP, GPR98, SMURF1, and TTN.

This is what you need to know. With the files attached to this blog post, you should be able to upload a variant into wANNOVAR and annotate it. Programs like wANNOVAR provide us with the possibility to generate variant annotation without being dependent on a bioinformatics server and without needing more sophisticated programming skills. Given that it made my life much easier of the last two weeks, I thought that I wanted to share this information with you. It is interesting to see that re-annotating exome data that is only two years old exposes some de novo mutations that were initially already thought to be non-pathogenic as variants also seen in control databases. However, it also add more credibility to the Epi4K pathogenic variants in genes like SCN1A or STXBP1 that are entirely absent in control databases.

Ingo Helbig is a child neurologist and epilepsy genetics researcher working at the Children’s Hospital of Philadelphia (CHOP), USA.

Twitter