Are you fully covered? My experience with a phenomenon I shall call exome fallacy began in 2011. While browsing the exomes of a few patients with epileptic encephalopathies, we wanted to have a quick look at whether we could exclude mutations in the epilepsy gene SCN1A in our patients through exome data. As some of you might already guess, the words “exome” and “exclude” don’t go well together and we learned the hard way that each individual exome covers certain parts of the gene quite well. However, if you expect your exome data to have sufficient quality to cover an entire gene in several individuals, you end up disappointed. But there is even more to the exome fallacy than you might think…
Coverage. I will use the term coverage quite often in this post, but what exactly is meant by this? Basically, the building blocks of exome data are the so-called reads. Reads are short sequences that are determined by the next-generation sequencing (NGS) machine. The NGS technologies can produce millions of these reads that cover various parts of the target sequence. An exome, i.e. a panel of a large number of exons in the human genome, might be such a target. Once the reads leave the NGS sequencer, there is no more wet lab involved until researchers decide to validate findings with PCR. The fitting of all reads to the target sequence (alignment), the identification of genetic variants that are different compared to the human reference genome (calling) and any further interpretation of these variants is purely computer-based. A particular base pair in the human genome is usually included in many sequence reads. The number of the reads stretching across a defined base pair is referred to as coverage, a quality criterion for NGS data. While a meaningful interpretation can be achieved with a 5x or 10x coverage in some special cases, many studies use a 20x cut-off. For NGS data used for diagnostics, even higher coverages (100x) are sometimes required. Last but not least, coverage and read lengths might differ between NGS platforms, complicating matters even further.
The 68% solution. Exome data can be quite heterogeneous when reseachers have very specific questions. Figure 1 shows such an example, which was presented by Bobby Koeleman from UMC Utrecht at the EuroEPINOMICS meeting of the 1000 exomes. In this case, a family with three affected children was investigated for mutations in a very specific region that was identified with linkage analysis. We would usually think that we have a good chance of finding the causative mutation with exome data. However, the exomes don’t keep their promise. If we compare the exons in the region of interest, only two thirds of the targeted base pairs have sufficient coverage in all three patients. This means that every third exonic mutation would simply be missed in this region.
Exomes are screening tools. This example tells a very clear and simple story. Exomes are screening tools for sequence, not sequencing tools in the conventional sense. Exomes can never exclude mutations in a particular gene unless we are very sure and thorough about the coverage in the complete coding region. We simply don’t know about false negatives as there is not a gold standard to compare to. In addition, exomes are ripe with false positive findings.
The “Me, Me, Me” genes. Some genes show up in every exome. There are parts of the human genome that are highly polymorphic (i.e. there are many variants) and these genes tend to show up in places where they might blur the otherwise clear look at potentially causative variants. There are lists of these genes available online, but as a brief guideline it is sufficient if you simply ignore all genes starting with MUC… or USP…
The unknown unknowns. It wasn’t the impetus of this post to discourage exome sequencing – quite the contrary. However, exome data should be handled with caution. Even though some published papers make interpretation of exome data look like a breeze, we deal with many unknowns. Simply imagine that many possible high-impact candidate genes might be hiding out in exome data with 19x coverage while you use a 20x cut-off. We simply don’t know what we don’t know.