Narrowing the phenotype gap through vector embedding

Sparse data. Trying to match the growing body of genomic datasets with associated clinical data is difficult for a variety of reasons. Most importantly, while genomic data are standardized and can be generated at scale, clinical data are often unstructured and sparse, making it difficult to represent a phenotype fully through any type of abbreviated format. Quite frequently in our prior blog posts, we have discussed the Human Phenotype Ontology (HPO), a standardized dictionary where all phenotypic features can be mapped and linked. But these data also quickly become large and the question on how best to handle them remains. In a recent publication, we translated more than 53M patient notes using HPO and explored the utility of vector embedding, a method that currently forms the basis of many AI-based applications. Here is a brief summary on how these technologies can help us to better understand phenotypes. Continue reading

Phenotypes are like water – Rare Disease Day 2023

Phases. Today is Rare Disease Day. I would like to use this opportunity to explain some of the phenotype science that is critical for rare diseases. In contrast to common disorders, rare diseases face an unusual challenge. Once identified, the overall rareness of these condition poses the question of where phenotypes begin and where they end. For rare genetic disorders, is the phenotype of the first individual identified with a rare disease characteristic, or is there a larger spectrum that we should be aware of? Enter the various approaches to phenotype science that aim to decipher the full depth of clinical features associated with rare diseases. In order to understand the various approaches to rare diseases phenotypes, I would like to suggest a somewhat unusual analogy: phenotypes are like water.

Continue reading

Claude Shannon and the U-shaped Information Content of developmental phenotypes

Spüre die Welt. This is the second post in our “phenotypic atomism series”, trying to explain how we can gauge the amount of information that phenotypes provide. However, let me start by going very far back. As a graduation gift, my high school teachers gave me a book that set me on the path of becoming a neuroscientist – the User Illusion by Tor Nørretranders, a book that has a more poetic title in its German translation (“Perceive the world”). This book examined the inner workings of human consciousness and explored how our human brains process information. Now, more than 20 years later, I am encountering the idea of measuring information again when trying to understand what phenotypic information is meaningful and how we can assess this. This is a blog post on how we can describe the value of phenotypic information, the importance of time, and how we slowly chip away at the mystery of developmental phenotypes. To put it differently: “Show me the longitudinal information content (IC) for absence seizures – it is going to be U-shaped and you have 60 min.” Continue reading

Phenotypic atomism – understanding outcomes by rethinking clinical information

Natural History. Over the last few years, there has been a renewed interest in outcomes and natural history studies in genetic epilepsies. If one of the main goals of epilepsy genetics is to improve the lives of individuals with epilepsy by identifying and targeting underlying genetic etiologies, it is critically important to have a clear idea of how we define and measure the symptoms and outcomes that characterize each disorder over a lifetime. However, our detection of underlying genomic alterations far outpaces what we know about clinical features in most conditions – outcomes such as seizure remission or presence of intellectual disability are not easily accessible for large groups of individuals with rare diseases. In this blog post, I try to address the phenotypic bottleneck from a slightly different angle, focusing on how we think about phenotypes in the first place. Continue reading

SCN2A – a neurodevelopmental disorder digitized through 10,860 phenotypic annotations

HPO. SCN2A-related disorders represent one of the most common causes of neurodevelopmental disorders and developmental and epileptic encephalopathies (DEE). However, while a genetic diagnosis is easily made through high-throughput genetic testing, SCN2A-related disorders have such a broad phenotypic range that understanding the full scale of the clinical features has been traditionally difficult. In our recent study, we used a harmonized framework for phenotypes based on the Human Phenotype Ontology (HPO) to systematically curate phenotypic annotations in all individuals reported in the literature and followed at our center, a total of 413 unrelated individuals. Mapping phenotypic data onto 10,860 terms with 562 unique concepts and applying some of the computational tools we have developed over the last three years, we were able to delineate the phenotypic range in unprecedented detail. SCN2A is now the first DEE with all available data systematically curated and harmonized in a computable format, allowing for entirely novel insights. Continue reading

Entering the phenotype era – HPO-based similarity, big data, and the genetic epilepsies

Semantic similarity. The phenotype era in the epilepsies has now officially started. While it is possible for us to generate and analyze genetic data in the epilepsies at scale, phenotyping typically remains a manual, non-scalable task. This contrast has resulted in a significant imbalance where it is often easier to obtain genomic data than clinical data. However, it is often not the lack of clinical data that causes this problem, but our ability to handle it. Clinical data is often unstructured, incomplete and multi-dimensional, resulting in difficulties when trying to meaningfully analyze this information. Today, our publication on analyzing more than 31,000 phenotypic terms in 846 patient-parent trios with developmental and epileptic encephalopathies (DEE) appeared online. We developed a range of new concepts and techniques to analyze phenotypic information at scale, identified previously unknown patterns, and were bold enough to challenge the prevailing paradigms on how statistical evidence for disease causation is generated. Continue reading

Big data, ontologies, and the phenotypic bottle neck in epilepsy research

Unconnected data. Within the field of biomedicine, large datasets are increasingly emerging. These datasets include the genomic, imaging, and EEG datasets that we are somewhat familiar with, but also many large unstructured datasets, including data from biomonitors, wearables, and the electronic medical records (EMR). It appears that the abundance of these datasets makes the promise of precision medicine tangible – achieving an individualized treatment that is based on data, synthesizing available information across various domains for medical decision-making. In a recent review in the New England Journal of Medicine, Haendel and collaborators discuss the need in the biomedical field to focus on the development of terminologies and ontologies such as the Human Phenotype Ontology (HPO) that help put data into context. This review is a perfect segue to introduce the increasing focus on computational phenotypes within our group in order to overcome the phenotypic bottleneck in epilepsy genetics. Continue reading