Here is why CADD has become the preferred variant annotation tool

Variant annotation. In both clinical practice and within existing research projects, we’re often faced with the issue of telling whether a given variant is benign or whether it is pathogenic. In silico prediction tools are designed to help this decision making process. However, there are so many of them and it is often hard to assess which tool works best. In a 2014 publication in Nature Genetics, the CADD score was introduced as comprehensive tool that aims to take the results of many known prediction tools into account. Follow me on a journey that takes us on hyperplanes, support vector machines and every possible variant in the human genome.

What the CADD score does. This is a prediction about the Epi4K de novo mutations. On the left, several functional annotation tools including SIFT and Polyphen are color coded from tolerated to damaging. The entire table was then sorted by the CADD score. It can be seen that at the top, there is more red (damaging) than green (tolerated) but this distinction does not hold up for each and every prediction tool. The CADD score aggregates the information of all these predictors. A link to the full table can be found here. The right hand side figure is modified from a presentation entitled “From A Gentle Introduction to Support Vector Machines in Biomedicine” by Alexander Statnikov (New York University), Douglas Hardin (Vanderbilt University), Isabelle Guyon (ClopiNet), Constantin F. Aliferis (New York University) available at http://www.med.nyu.edu/chibi/sites/default/files/chibi/Final.pdf. We understand that this presentation is not copyrighted and we have approached the author to clarify that this permission is correct.

What the CADD score does. This is a prediction about the Epi4K de novo mutations. On the left, several functional annotation tools including SIFT and Polyphen are color coded from tolerated to damaging. The entire table was then sorted by the CADD score. It can be seen that at the top, there is more red (damaging) than green (tolerated) but this distinction does not hold up for each and every prediction tool. The CADD score aggregates the information of all these predictors. A link to the full table can be found in the blog post. The right hand side figure is modified from a presentation entitled “From A Gentle Introduction to Support Vector Machines in Biomedicine” by Alexander Statnikov (New York University), Douglas Hardin (Vanderbilt University), Isabelle Guyon (ClopiNet), Constantin F. Aliferis (New York University) available at http://www.med.nyu.edu/chibi/sites/default/files/chibi/Final.pdf. We understand that this presentation is not copyrighted and we have approached the author to clarify that this permission is correct.

Variant annotation. Not every variant is the same in the human genome when it comes to assessing the role of genetic variation in disease. For example, when we encounter a truncating SCN1A mutation (stop mutation) in a child with fever-associated epilepsy, we don’t really need much more information prior to calling this variant disease-causing. However, for most genetic variants, this situation is less clear, especially when it comes to missense mutations that do result in an exchange of amino acids rather than a truncation of the protein. As functional data is only available for a small minority of variants, we have to rely on prediction tools to help us decide whether such a variant is damaging or benign. Tools that are commonly used for this purpose include Polyphen, SIFT, and GERP. However, there are always situations when these online tools don’t agree. In order to address this problem, Kircher and collaborators developed Combined Annotation-Dependent Depletion (CADD), a method that integrates the information from many various functional annotations and condenses this information into a single score.

Support vector machine. After reading the publication by Kircher and collaborators, I first needed to sit back and google some basic principles of machine learning. The authors use a support vector machine (SVM), which is an algorithm to make sense of complex data. Basically, when you have two groups, SVM uses a complex mathematical transformation to separate out the data. In technical terms, it uses a kernel machine to transform the data into a higher dimensional space where the two groups can easily be separated by a hyperplane (don’t ask me how long it took to write this last sentence). This transformation can then be used to separate future data. For their study, Kircher and collaborators used existing genomic variation compared to simulated genomic variation.

The basic idea. The concept behind their training set was to compare genetic variation that is tolerated in the human species from simulated data. This simulated data should represent possible de novo mutations in human that may have occurred, but that were not fixed in the human population, possibly because they are disease-causing. Based on these two groups, they generated annotations with more than 60 functional prediction tools and trained their algorithm. They then generated possible scores for virtually every possible genomic variant, allowing us to use a full set of annotations for every new variant that we possibly encounter in genetic studies. This helps us compensate for the gaps that some of the individual annotation tools have.

How good is CADD? You can think about CADD as a “meta-annotation” tool that uses information from many functional annotation tools that are around (Figure, link to table here). I did a little test run with the Epi4K data and sorted the de novo mutations that were not found in ExAC by the CADD score. Usually, a scaled CADD score of 20 means that a variant is amongst the top 1% of deleterious variants in the human genome. A scaled CADD score of 30 means that the variant is in the top 0.1% and so forth. When we look at the Epi4K data, the top 10 de novo missense mutations are in CNTN5, ANKRD12, STXBP1 (2x), ASXL1, SCN2A, DHDDS, TRRAP, SMG9, and GABRB3. In contrast, the least deleterious de novo missense mutations are in PRR19, ZSWIM8, FCGR2B, STPG2, ACOT4, UNC5CL, HDAC4, CHIA, HNRNPH1, and ZNF467. Realize something? The group of the de novo mutations predicted to be highly deleterious is highly enriched for epilepsy genes, while there are much fewer epilepsy genes in the second group. CADD has basically helped us sort causative mutations from non-causative mutations. Kircher and collaborators test run their score on cohorts of patients with autism and intellectual disability and come to a similar conclusion. CADD outperforms many of the other annotation tools.

This is what you need to know. Combined Annotation-Dependent Depletion (CADD) is a novel functional annotation tool that allows for an unbiased annotation of a large number of possible variants in the human genome. In contrast to other annotation tools, CADD integrates data from existing tools in an innovative way. This method compensates for the incompleteness and bias of many existing methods and provides a tool that allows us to have a one-stop approach at variant annotation.

Ingo Helbig is a child neurologist and epilepsy genetics researcher working at the Children’s Hospital of Philadelphia (CHOP), USA.

Twitter