Copy numbers. When we think about genetic causes of neurodevelopmental disorders and the epilepsies, we typically discuss single genes and de novo variants. Over the last few years, exome and genome data of hundreds of thousands of people have been analyzed, creating large-scale resources to understand genetic variation in health and disease. However, there has been one resource that has always been larger by at least one order of a magnitude – information on copy number variation derived from SNP arrays and array CHG. Now, a recent publication pulls all the existing information together and performs a meta-analysis of rare copy-number variants in nearly one million people. Here is what this study tells us about neurodevelopmental genes and how we can use mismatches between CNV and exome data to answer old questions and find novel genes.
A brief disclaimer. the publication by Collins and collaborators is so rich and dense that it is virtually impossible to provide a good overview. As with the Kaplanis 2020 publication, it is a “leave the gun, take the cannelloni” publication where you can actually also skip the actual publication and go straight to the Supplement. However, before jumping into the main topic of this post, here is a brief attempt at summarizing their study.
A CNV map. In brief, Collins and collaborators performed a meta-analysis of existing CNV resources, comparing almost 300K individuals with various disorders and 500K controls. They examined this enormous dataset in a variety of ways. The authors were able to discover hundreds of associations between rare copy number variants (CNVs) and phenotypes, including many new associations. In addition, Collins and collaborators demonstrated that CNVs and exome data largely point towards the same genes that are dosage-sensitive, i.e. genes where either lack of one copy (haploinsufficiency) or additional copies (triplosensitivity) cause disease. Finally, their study creates a novel metric, which is the topic of this blog post. This metric, pHaplo, was calculated for all autosomal genes and reflects the probability of each gene of haploinsufficiency. For example, SCN1A, SCN2A, SCN8A, and STXBP1 have a pHaplo of >0.95, indicating that these genes are highly likely to be haploinsufficient.
pLI and DEPDC5. Let’s jump right into the main topic of this post. When we typically evaluate candidate genes, we often look at the top right corner in the gnomAD browser where the pLI metric is displayed. The pLI measurement reflects the probability of loss-of-function intolerance. For example, for SCN1A, we would expect almost 90 protein-truncating variants (PTV) in gnomAD based on gene size. However, there are only two PTVs in gnomAD, only 2% of what would be expected. This is an indication that loss-of-function variation is not tolerated in the general population, which allows for the conclusion that these genes are likely disease-causing. This metric typically works well – almost all epilepsy genes with haploinsufficiency as the known mechanism have a “red dot” in gnomAD, indicating a very high pLI. However, there have always been some strange exceptions. For example, what about DEPDC5?
DEPDC5. The DEPDC5 gene is one of the most common genes for non-lesional focal epilepsies. However, if we didn’t know about the gene yet, we would be somewhat confused by gnomAD. The common rule that haploinsufficient epilepsy genes have a high pLI does not hold true for DEPDC5. While there are less PTVs than expected, this does not amount in a difference that is significant. Plus, the frequency of PTVs in gnomAD would suggest that there are more than 40,000 unaffected individuals in the US with stop, frameshift, or splice variants in DEPDC5. This obviously does not seem right.
Explaining the DEPDC5 mystery. There are several potential explanations for this discrepancy. For example, sub-clinical phenotypes of focal epilepsies may be common and frequently missed – accordingly, these individuals will find themselves in large population databases. Alternatively, the population PTVs reported in gnomAD may be artifacts or the PTVs affect isoforms that are biologically less important (both unlikely as they are spread throughout the gene). A third explanation would simply be statistical noise – gnomAD is not large enough to capture the true gene constraint in DEPDC5 beyond random fluctuation. Either way, this story seems somewhat unfinished.
pHaplo vs pLI. Now let add in the new metric developed by Collins and collaborators. As a reminder, this gene dosage score is developed based on a resource that is almost 10x larger than gnomad. DEPDC5 has a pHaplo of 0.927, well above the 0.86 cutoff that Collins and collaborators suggest. This suggest that DEPDC5 is a haploinsufficient gene. Being aware of the repeatedly demonstrated disease-causing role of DEPDC5, this is one example where pHaplo appears to be superior to pLI. When comparing both metrics (Figure 1 inset), you can easily see that the correlation between both metrics is far from perfect. There is a large number of genes with a high pHaplo and a low pLI, including several known disease genes such as ATM, KCNA1, NKX2-1, and COL4A2. And we can use this mismatch to potentially identify novel conditions that would have been obscured by the gnomAD pLI.
De novo variants. In the the larger panel in the figure above, I have queried the Kaplanis et al. dataset for de novo protein truncating variants in these mismatch genes, genes with a low pLI, but a high pHaplo. Again, these genes would not have been considered good candidates a week ago, but could now be considered given the new information on gene dosage provided by Collins and collaborators. There are several genes with more than three de novo PTVs. For example, LRP1B, SALL3, and GAK are expressed in the Central Nervous System (CNS). Given the fact that there are several individuals reported with de novo protein-truncating variants in these genes, it might be worthwhile trying to understand whether these conditions represent novel neurodevelopmental disorders. The most common gene on this list, KIDINS220, is a known cause of a complex phenotype characterized by spastic paraplegia, intellectual disability, nystagmus, and obesity. The pHaplo analysis provides additional evidence and certainty towards the role of KIDINS220 in human disease.
What you need to know. The publication by Collins and collaborators provides a much-needed resource about gene dosage based on the information available from human copy number variations (CNVs) in almost one million individuals. In addition to providing the definite, authoritative overview of the CNV landscape in human disease showing that a GWAS approach based on rare CNVs can supplement association testing, the authors have developed novel ways to assess each human gene for haploinsufficiency. I jumped straight into the supplement of this paper, and this blog post, I compared the pHaplo metric developed by Collins and collaborators to the gnomAD pLI. Analyzing “mismatch genes” identified several known conditions such as DEPDC5 and several new candidate genes.