The Zero ExAC problem

Evidence and absence. There is a time before and after ExAC, the gigantic variant repository based on more than 60,000 exomes sequenced at the Broad Institute. ExAC was released in October 2014 and has suddenly provided the community with access to variant data of roughly ten times more individuals than previous resources. But what happens when you check variants that were previously considered pathogenic and they are seen at low frequency in ExAC? Welcome to the Zero ExAC problem, providing us with a taste of the complications that epilepsy gene variant interpretation will face in the future.

The Zero ExAC problem, exemplified by the EFHC1 R221H variant. When looking at smaller cohorts largely of European ancestry, this variant is absent, which is one of the prerequisites of calling this variant pathogenic. However, when compared to the full ESP6500 and ExAC dataset, we can see that this variant is actually present at a comparable frequency in individuals for African ancestry.

The Zero ExAC problem, exemplified by the EFHC1 R221H variant. When looking at smaller cohorts largely of European ancestry, this variant is absent, which is one of the prerequisites of calling this variant pathogenic. However, when compared to the full ESP6500 and ExAC dataset, we can see that this variant is actually present at a comparable frequency in individuals for African ancestry.

Databases. Prior to talking about the complications of databases, let us first look at an easy-to-use tool for variant visualization. Let me introduce you to DIVAS, a web-based interface that helps you query specific variants in a total of 150,000 individuals. In contrast to other databases that require you to upload data in specific format, DIVAS allows you to simply put in the gene and the variant and press submit. Within a few seconds, you get an overview on how common a variant is in any of the published cohorts.

Pathogenic variants. Let’s be conservative in our variant interpretation first. Any variant that we consider a disease-causing variant for a genetic epilepsy due to a de novo mutation should not be in ExAC. Given the huge number of rare variants in the human genome, a variant found in ExAC is way more likely to be a false positive rather than a disease-causing variant with incomplete penetrance or a variant that may also predispose to schizophrenia or Tourette Syndrome. My personal guess would be that a variant found in ExAC even at a low frequency is 10-100x more likely to be a false positive finding rather than a causative variant that made it into ExAC somehow. My second guess would be that this will be an issue with many allegedly pathogenic variants derived from gene panels that were reported prior to 2014 and that were not followed up in parents.

Precedents. How much bigger can it get and what are the risks of adding more and more data and simply regard them as controls? Things can get complicated. The Toronto Database of Genomic Variants teaches us an instructive example on how a large collection of heterogeneous data can be difficult to interpret. TGV is a large database for copy number variants and also includes old studies using BAC arrays and more recent studies using calling of microdeletions using next generation sequencing. There is a consensus in the community that it is extremely difficult to use this database as a control database as many findings in TGV may not be validated. You may find that your candidate gene for epilepsy has been found to be deleted in some control individuals – but you don’t know the certainty of this finding. For example, the 1q21.1 microdeletion, one of the more common microdeletions, had been masked by a common CNV whose boundaries had been overcalled. Similar phenomena can be expected to happen once we move from 100K exomes to 1M exomes in public database.

The ultra-rare variants. But could there be pathogenic variants that occur at very low frequencies in the general population? The answer is yes, at least theoretically. But making this assumption comes with a few important caveats. First, we would need to realize that these variants would be risk factors, not causative mutations. We have to switch paradigms and be completely honest with ourselves that we have a tendency to drift back into “monogenic disease mode” and overcall the relevance of these findings. Secondly, when we look at microdeletions, there are some good examples out there of how some of the more common microdeletions with a frequency of ~1% may be associated with disease with a high effect size (odds ratio), but also be found in some unaffected individuals in the populations. Finally, we have to be honest again that for ultra-rare variants, we will likely not have sufficient statistical power to show that these variants are truly associated. And association is all that really matters when it comes to rare variants. This means that even though an ultra-rare SCN1A, CACNA1A, or KCNQ2 variant may seem to be disease-contributing, we have no way of proving this.

This is what you need to know. We have a Zero ExAC problem. Take 10 of the variants without parental data that you thought of as disease-causing in 2014. Run them through a large database like ExAC. You will likely find that several variants will drop out – and you will need to reconsider how this variant should be interpreted in 2015.

Ingo Helbig is a child neurologist and epilepsy genetics researcher working at the Children’s Hospital of Philadelphia (CHOP), USA.

Twitter