Decoding rare disease through 77,000 genomes

Genome sequencing. Despite continual progress in understanding the genetic etiology of human disease, more than half of rare disorders remain unsolved. Resolving the remaining etiologies in rare disease are a major focus of ongoing efforts in the field, including a shift towards standardized analysis of large-scale genome sequencing data from large patient cohorts. In a recent study, Greene and collaborators aimed to identify associations between genes and rare disease subgroups, leveraging genomes of 77,539 people including 29,741 probands. Here is a brief review on their publication in the context of etiological resolution in rare disease.

Figure 1. Overview of specific disease phenotypes within the Neurology and Neurodevelopmental disorders rare disease class included in Greene et al. 2023. Genetic associations identified from the study for neurodevelopmental disorders are shown to the right, with the posterior probability of association, or probability that the SNV is truly associated with a phenotype, of >0.95. Only genes for neurodevelopmental disorders with pLI > 0.90 of identified genetic etiologies are shown in the graph (recreated and modified from the Supplementary data in Greene et al, 2023).

Discovery rate. In a recent study, Greene and collaborators analyzed the genomes of 77,539 people, focusing on SNVs and indels in coding regions of the genome. They identified 260 associations with a high probability of true association with a rare disease phenotype, of which 241 had been previously published. The study included 269 rare disease classes, with Neurological disorders unsurprisingly representing the largest subgroup. The disease class was further stratified into numerous case sets, including Neurodevelopmental disorders, Inherited Epilepsy Syndromes, and Motor Disorders of the CNS. To provide an idea of the neurogenetics population represented in the study, 5,529 individuals were grouped under the specific phenotype of intellectual disability within Neurodevelopmental disorders, which likely constitute a highly heterogeneous population, and there were almost three times more people with hereditary ataxia than people with epileptic encephalopathies. What did the study find?

Disease associations. Forty-nine percent of associations within Neurological disorders were with the Neurodevelopmental disorders subgroup and included common genes such as ARID1B, ANKRD11, and DDX3X, in addition to genes commonly implicated in genetic epilepsies such as SCN2A and SCN8A (Figure 1). Only 3 genes were implicated in the broader Inherited Epilepsy Syndromes disease class: SCN1A, CHD2, and DEPDC5. Of the 19 previously unrecognized associations identified in the study, only three genes were linked to disease classes within Neurological disorders, namely: LRRC7 in intellectual disability, ARPC3 in Charcot-Marie-Tooth disease, and RAB3A in hereditary ataxia. WWOX, a gene implicated in developmental and epileptic encephalopathy (DEE) and spinocerebellar ataxia (SCA12), was interestingly associated with gastrointestinal disorders. Greene and collaborators prioritized three of the 19 etiologies for validation based on evidence supported by population genetics metrics, co-segregation of variants, and manual review of the literature for a biological link to the underlying disease pathophysiology. This led them to uncover novel associations implicating loss-of-function variants in ERG in primary lymphoedema, PMEPA1 in thoracic aneurysm disease, and GPR156 in a recessive congenital hearing impairment.

Validation bottleneck. The remaining associations were not validated, including the association between intellectual disability and LRRC7. The gene LRRC7 encodes a protein in postsynaptic densities in the brain, and three nonsynonymous de novo SNVs in LRRC7 absent from population databases had previously been reported in a study of >30,000 individuals with neurodevelopmental disorders. Nevertheless, for all identified candidate genes, establishing gene-disease validity under a formal, unbiased framework, such as the Clinical Genome (ClinGen) consortium, will remain critical and entail downstream efforts to support an association with a disease phenotype. However, the true promise and strength of this study in the context of etiological discovery seems to lie in the development of a database for large-scale genomic data, which the authors termed the “Rareservoir.” 

Relational databases. Our lab typically emphasizes the concept of the phenotypic bottleneck, reflecting our ability to handle and analyze large-scale genomic data with relative ease in contrast to large-scale phenotype data. However, the analysis of data from whole genome sequencing (WGS) of tens of thousands of people pose challenges, especially as the representations of genotypes are stored in files that are terabytes in size – equivalent to roughly more than 300 computers with 32GB storage. To overcome this challenge, Greene and collaborators developed a relational database only 5.5 GB in size, representing a reduced, compressed representation of genotypes corresponding only to rare variants. The rationale behind their approach is that given the minor allele frequency of rare variants with a large effect size is typically <0.1% due to negative selection, analysis of this end of the genetic corridor comprising just 1% of all genotypes should enable etiological discovery. In brief, Greene and collaborators provide a framework that accelerates analysis of large-scale genomic data through refining and selecting for rare variants.

Genomes. WGS provides coverage across the entire genome, and while it is recognized that up to 10% of coding regions are not sufficiently covered through exome sequencing, WGS potentially only leads to a marginal increase in diagnostic yield over exomes when investigating exonic regions. The analysis of non-coding regions and regulatory regions that affect gene expression may facilitate further etiological resolution, however, this comes with its own challenges and was not a focus of their study. Furthermore, while Greene and collaborators focused on monogenic models, it is possible that the genetic architecture of a subset of rare diseases have polygenic contributions, which may explain the phenotypic heterogeneity in clinical presentations within disorders caused by variants in the same gene. These approaches remain promising avenues for future studies and stem from computational analysis of large-scale genomic data.

This is what you need to know. Whole genome sequencing is emerging as a standard technology for investigating the etiology of rare disease in both the diagnostic and research contexts. However, as the analysis of large-scale genomic data requires a standardized and computable framework, Greene and collaborators developed a novel database to facilitate the analysis of rare variants from >70,000 genomes in a recent study. Their advancement enabled discovery of novel disease associations across rare diseases, providing insight into the biological mechanisms of these disorders and avenues for developing new therapies and precision medicine approaches for affected individuals.

Julie Xian is a Data Scientist in the Helbig Lab at Children’s Hospital of Philadelphia (CHOP).