Genome meets phenome to find novel recessive diseases

N=1. Even though many recessive disorders have been identified through next-generation sequencing, there is a major conceptual problem when it comes to interpreting the results of these studies. Recessive disorders are very rare and it is sometimes difficult to assess whether a given variant is truly disease-causing or simply an innocent bystander. A recent study in Nature Genetics has developed a novel concept to identify recessive disorders that rise above the overall genomic noise, finding four novel recessive disorders. In addition, the authors have enhanced their analysis by a statistical analysis of disease phenotypes.

Study by Akawi and collaborators. The authors analyzed family exome data in 4,125 families and identified genes with more bi-allelic variants than expected by chance. Bi-allelic variants are variants that are either homozygous or compound heterozygous. For the analysis, the authors focused on loss-of-function mutations. In addition, they analyzed the relatedness of the phenotypes by comparing Human Phenotype Ontology (HPO) terms. This resulted in a combined p-value that was genome-wide significant for KIAA0586 and HACE and was suggestive for PRMT7, CSTB, COL9A3, and MMP21. CSTB and COL9A3 are already known disease genes for Unverricht-Lundborg disease and Stickler Syndrome, respectively.

Study by Akawi and collaborators. The authors analyzed family exome data in 4,125 families and identified genes with more bi-allelic variants than expected by chance. Bi-allelic variants are variants that are either homozygous or compound heterozygous. For the analysis, the authors focused on loss-of-function mutations. In addition, they analyzed the relatedness of the phenotypes by comparing Human Phenotype Ontology (HPO) terms. This resulted in a combined p-value that was genome-wide significant for KIAA0586 and HACE and was suggestive for PRMT7, CSTB, COL9A3, and MMP21. CSTB and COL9A3 are already known disease genes for Unverricht-Lundborg disease and Stickler Syndrome, respectively.

Genomic noise. There are two different ways to look at the results that modern massive parallel sequencing technologies give you. One way to think about next generation sequencing (NGS) is that these technologies have a decent chance of finding a causative gene if the respective mutation is located in one of the exons that are covered. Therefore, if you have a patient with a presumed recessive disorder, you have a good chance of finding the gene that you are looking for. The other way of looking at NGS is the signal-to-noise problem. We have so many rare variants in our genomes that we will invariably have a problem identifying the causative gene if all we have is genomic information. This is particularly true for novel genes. Finding a homozygous truncation mutation in a promising candidate gene may be suggestive, but we don’t really have a good sense on what we would expect by chance.

Real estate. For de novo dominant disorders, there are clever strategies to compare the number of de novo mutations to what would be expected and to identify genes in patients that can be implicated statistically. For recessive disorders, this paradigm has not really worked so far, largely due to the fact that the sample sizes that we have handled so far may be underpowered to perform a similar analysis. In the recent publication by Akawi and collaborators, the entire DDD data set comprising 4,125 families was analyzed and the authors established a paradigm that allowed them to compare the number of expected versus observed families with recessive or compound heterozygous mutations. Using such a comparison, they were able to analyze recessive disorders statistically, relying on genomic data alone.

HPO. In contrast to the p-values that we can generate for genetic associations, the inclusion of phenotypic data into genetic studies has long been an issue. To put this more bluntly, phenotypes are our big problem, especially when it comes to neurodevelopmental disorders. In order to overcome this problem, Akawi and collaborators used the Human Phenotype Ontology (HPO), an emerging standard phenotyping classification tool that we had also used in the initial EuroEPINOMICS-RES project; we were involved in developing the epilepsy terminology of this ontology. The HPO is structured as an ontology, which means that phenotypic terms at different levels of precision and detail can still be represented within the same system. A mechanism like this is ideal for human genetics, as it allows us to compare phenotypes without assessing each feature for each patient in the entire cohort. Akawi and collaborators used the similarity between phenotypic terms to develop a method that would allow them to estimate the likelihood of randomly finding two or more patients with the same level of similarity.

Novel genes. Equipped with these two methods and an additional tool to combine p-values, the authors set out to find novel recessive disease genes, mainly focusing on loss-of-function mutations. The authors identified four novel genes and two previously known genes to be either genome-wide significant or to show at least a suggestive level of evidence. The most promising genes were KIAA0586, HACE1, PRMT7, and MMP21. The two known genes were SLC9A3, a gene previously involved in Stickler Syndrome and CSTB, the gene for Unverricht-Lundborg disease, a progressive myoclonus epilepsy (PME). Recessive mutations in KIAA0586 were identified in 6 families with Joubert Syndrome, a genetic brain malformation that affected the cerebellar vermis. Recessive HACE1 mutations were found in 4 families with a neurodevelopmental condition characterized by brain atrophy and hypoplasia of the corpus callosum. At this point, it is worthwhile to notice the difference in phenotypes for both diseases. While hypoplasia/agenesis of the cerebellar vermis is highly specific, overall atrophy and corpus callosum hypoplasia are relative non-specific findings that can be found in many genetic pediatric neurological conditions. These examples demonstrate that the phenotype analysis performed by the authors works for both distinct and for less specific phenotypic features.  For the two families with CSTB mutations, it is worthwhile noticing that the known mutations for Unverricht-Lundborg disease are repeat expansions of a dodecamer repeat that would not be found on exome. The mutations picked up by Akawi and collaborators are loss-of-function mutations, suggesting that there may be other genetic mechanisms by which CSTB can cause neurodevelopmental disorders. There is little detail about the phenotypes in the current publication and reviewing the phenotypes further would be an interesting project for the future.

This is what you need to know. The current study by Akawi and collaborators is the first attempt to apply a statistical approach to recessive disorders that helps us distinguish causative mutations from genomic noise. The sample size of more than 4,000 families is sufficiently large to find evidence for six genes, suggesting that we would need similar numbers to identify novel recessive epilepsies. Until then, we need additional genetic evidence such as segregation analysis or functional studies to implicate genes for recessive epilepsies. However, with larger samples, we may be able to find additional genes that we simply don’t see yet.

Ingo Helbig is a child neurologist and epilepsy genetics researcher working at the Children’s Hospital of Philadelphia (CHOP), USA.

Twitter