Big data, ontologies, and the phenotypic bottle neck in epilepsy research

Unconnected data. Within the field of biomedicine, large datasets are increasingly emerging. These datasets include the genomic, imaging, and EEG datasets that we are somewhat familiar with, but also many large unstructured datasets, including data from biomonitors, wearables, and the electronic medical records (EMR). It appears that the abundance of these datasets makes the promise of precision medicine tangible – achieving an individualized treatment that is based on data, synthesizing available information across various domains for medical decision-making. In a recent review in the New England Journal of Medicine, Haendel and collaborators discuss the need in the biomedical field to focus on the development of terminologies and ontologies such as the Human Phenotype Ontology (HPO) that help put data into context. This review is a perfect segue to introduce the increasing focus on computational phenotypes within our group in order to overcome the phenotypic bottleneck in epilepsy genetics.

Infographics. One reason why I picked Humboldt for this blog post was his uncanny sense for seemingly modern looking infographics. This chart portrays all the major mountains and elevations on earth. Humboldt created this infographic (or “plate” as it was called back then) by sorting mountains by geographical latitude and coloring them by continent. The lower chart does the same for Europe, detailing some of the smaller mountains. If you’re looking for Mount Everest, you won’t find it on this chart. Mount Everest reaching 29,029 feet (8,848 meters) above sea level, was only discovered and accurately measured in 1856. The highest mountain on this chart is Dhaulagiri, the seventh highest mountain in the world at 26,795 feet (8,167 meters) above sea level, and the highest mountain in Nepal. [figure modified from Plate 6 in “Atlas zu Alex. v. Humboldt’s Kosmos in zweiundvierzig Tafeln mit erläuterndem Texte” by Traugott Brumme, 1851, in the public domain]

Background. I have been meaning to write a blog post on data science ever since our lab moved into the Roberts Center for Pediatric Research at CHOP. My first attempt to put my thoughts together was entitled “the natural science of data exploration” and was firmly rejected by our internal editorial board as being too contorted and abstract (I completely agree in retrospect). It now lives amongst the many other blog posts that I had written, but never published. Humboldt’s chart on the elevations around the globe is the only segment of this unpublished post that made it into this final version, an example of how data visualization allows us to connect dots and express concepts that pure numbers never could – which brings me to the review by Haendel, Chute, and Robinson in the New England Journal of Medicine.

The other data. When we look at the recent breakthroughs in gene discovery, novel findings were made possible through large datasets, such as the discovery of the DNM1 or CACNA1E gene in the epilepsy field. However, it is not the data per se that allowed for these discoveries, but the fact that genetic information could transformed into digital data and that this information was both accessible and interpretable. While this may seem trivial with respect to genetic data, accessibility and interpretability becomes a major issue when datasets should be analyzed that differ from the clearly defined structure of genomic data where every base pair has its defined place. As Haendel and collaborators put it: “Data without interpretation are facts without understanding”. Data sources such as clinical notes, laboratory tests, medication, survey instruments, and medical devices may be available, but without a “backbone” to interpret this data, it remains meaningless. Standardized terminologies and ontologies can provide such a backbone.

Ontologies. The review by Haendel, Chute, and Robinson puts a main focus on ontologies such as the Human Phenotype Ontology (HPO). Ever since we had contributed to the generation of the first generation of HPO terms for seizures during the early stages of EuroEPINOMICS, working with this type of data had been one of our ambitions – and the review by Haendel and collaborators now provides an in-depth description of what ontologies are. Basically, in order to be used meaningfully, standardizing data requires to things: structure and semantics. Structure is often the easy part. For example, when we collect clinical information about a new genetic epilepsy syndrome in an Excel table asking for seizures types, presence or absence of intellectual disability and movement disorders, we provide a structure for a dataset. Semantics, however, is more complex, at is refers to the meaning or relationship of meanings. This is typically something that requires our expertise and pattern recognition. For example, if I have three sub-groups of patients, group A with absence seizures, group B with myoclonic seizures, and group C with a hyperkinetic movement disorder, most of us would automatically conclude that A-B are more closely related than A-C or B-C. This is because we recognize that both myoclonic seizures and absence seizures are generalized seizures and they belong to the same higher-level concept. However, this categorization that is obvious to us, is lost to a computer or algorithm when we do not provide information on how these concepts are connected. Ontologies provide such a connection and allow us to perform “computable semantics” by telling us how the various data elements are connected.

The Human Phenotype Ontology. The Human Phenotype Ontology (HPO) currently has more than 13,500 terms for phenotypic features and is currently the emerging framework to work with computable phenotypic data. We have used HPO in EuroEPINOMICS and this terminology is also used by many genomic diagnostic laboratories. In addition to allowing us to describe phenotypic features at different levels of detail (“seizure”, “generalized seizure”, “absence seizure”), it provides us with a hierarchy of these terms, e.g. the term for absence seizure is a child term of generalized seizures. This now allows us to walk up and down the hierarchy and connect each term in each patient with each other. We can now ask the question how far two phenotypes are apart, ideally translating our clinical sense or phenotypic relatedness or distance into a computable format. In 2012, we had presented a poster at the ECE in London that used our EuroEPINOMICS data for such an analysis, showing that a group of 8 patients with self-resolving infantile spasms can be identified from a group of 171 patients with developmental and epileptic encephalopathies based on a similarity analysis that uses HPO terms (link). More recently, HPO data derived from EMR data has been used to identify patients with rare monogenic diseases out of a large patient cohort at Vanderbilt.

Our own work. As a child neurologist, I am intrigued by the clinical presentations of patients with genetic epilepsies. While we can perform exome and genome sequencing on an industrial scale, our ability to derive phenotypic data in a format that can be used for large-scale analyses is often limited to manual chart review – this is the phenotypic bottleneck. Developing strategies to overcome the phenotypic bottleneck in childhood epilepsies by working with computational phenotypes has become a major aspect of our group at CHOP. Currently, we connect genomic data with information extracted from electronic health records, and develop methods for integrating epilepsy-specific HPO data into genetic studies, translating large datasets into a standardized language that we can use for our computational effort. The aim of this work is to better understand how genes and clinical presentations are related, to identify patterns that we would not appreciate with manual analysis, which will help us identify disease mechanisms to develop new therapeutic avenues.

Ingo Helbig is a child neurologist and epilepsy genetics researcher working at the Children’s Hospital of Philadelphia (CHOP), USA.