Narrowing the phenotype gap through vector embedding

Sparse data. Trying to match the growing body of genomic datasets with associated clinical data is difficult for a variety of reasons. Most importantly, while genomic data are standardized and can be generated at scale, clinical data are often unstructured and sparse, making it difficult to represent a phenotype fully through any type of abbreviated format. Quite frequently in our prior blog posts, we have discussed the Human Phenotype Ontology (HPO), a standardized dictionary where all phenotypic features can be mapped and linked. But these data also quickly become large and the question on how best to handle them remains. In a recent publication, we translated more than 53M patient notes using HPO and explored the utility of vector embedding, a method that currently forms the basis of many AI-based applications. Here is a brief summary on how these technologies can help us to better understand phenotypes.

Figure 1. More than 53 million patient notes were translated to HPO using the cTAKES algorithm, creating more than 15,000 distinct phenotypic categories. In the publication by Daniali and collaborators, we examined whether we can borrow so-called vector embedding techniques to make these data easier to work with. We transformed the 15,000 categories to a 128-dimensional vector, which preserved the essential relationships between phenotypic terms related to neurological disorders, which was aligned with scoring of 1,300 comparisons by domain experts

HPO analysis. Our traditional approach to handle phenotype data is to map it to landscapes and to consider phenotypic features, including severity and age, as distinct building blocks of an individual’s natural history. This framework that we internally refer to as “phenotypic atomism” has been the concept on which we try to make sense of the complex clinical landscape that we encounter in individuals with neurodevelopmental disorders (NDD). Data harmonization, i.e. making sure that all clinical terms have the same standard format, is a major step in making clinical information accessible. In this format, we typically pay particular attention to making sure that the data remains close to the language we use clinically.

Representation learning. While building phenotype maps and landscapes is intuitive, there are also major downsides to it. First, it is slow and computationally intense. For example, if we start out with more than 2,000 phenotypic categories, we might end up having more rather than fewer categories after harmonizing our data. It doesn’t actually simplify the datasets, which makes them more difficult to handle. Other fields of data science deal with much more complex datasets where the number of categories is even higher. Accordingly, data scientists have put quite some thought into strategies to map data to simpler formats. One of these strategies is called vector embedding. Large amounts of clinical data that are typically binary are mapped to a vector, basically a list of numbers that can take on a range of values.

Why vectors? You might ask yourself what all this discussion about vectors might be good for. In brief, it’s easier to do math with them, which is needed for many applications afterwards. Here is an example from an even more complex field: linguistics. If we want to figure how closely two sentences are related in meaning, we can do this mentally. However, any computer program will have some problems finding the same relation. This task becomes easier if we embed data in vectors. There are a host of methods to calculate distance or similarity for vectors and the question about related meaning basically becomes a mathematical equation. In brief, think of vector embedding for phenotypes in the same way that computer algorithms handle complex language analysis. Arranging values around a vector reduces dimensionality and therefore makes for easier comparison between terms.

EMR. In our publication by Daniali and collaborators in Artificial Intelligence in Medicine, we explored the use of vector embedding techniques. We translated more than 53 million patient notes to HPO using the cTAKES software, identifying more than 15,000 distinct phenotypic categories. We then embedded these phenotypes in a vector with 128 dimensions, basically cutting down the dimensionality of the dataset by the factor of one thousand. This first step provided meaningful results. As seen in Figure 1, closely related neurological terms were close together. This means that our HPO dataset could actually be represented in a much less complex “space” than the 15,000 categories that we initially had. When taking into account the actual frequencies of the phenotypes, the mapping became even more accurate.

Litmus test. At this point, we had transformed the entire HPO and frequencies based on 53M patient notes into a vector, an abstract mathematical construct. But how useful is this and how much was lost? We tried a simple experiment to find out how much of the relationship had been preserved. We provided domain experts with a phenotypic term A and two additional terms B and C. We then asked them to assess whether A is more closely related to B or C. In brief, we tried to obtain a baseline on how experts assess relatedness between phenotypes. For example, most experts would judge that “absence seizures” is more closely related to “myoclonic seizures” than “ataxia”. We did this experiment for 1,300 combinations of terms and then assessed the same relationships in the vector embeddings. In brief, were the vectors as good as our human domain experts?

Results. We found that the embedding method that incorporates term frequencies fared the best and was the closest to what our human domain experts provided as a gold standard, getting to an accuracy that was pretty much identical with the rater assessments. This indicates that we can use these methods reliably for handling large clinical datasets, drastically cutting down on the dimensionality while maintaining the critical relationships within the dataset. This provides strong support to the notion that future analysis of complex phenotypic data can not only be accomplished using sophisticated data harmonization but can also take advantage of some of the more recent developments in data science.

What you need to know. Clinical datasets are incredibly complex and methods to help handle and analyze these datasets are urgently needed. Borrowing techniques from other fields of data science, we explored the utility of vector embedding in our publication by Daniali and collaborators. We found that we can accurately map more than 15,000 HPO terms into a vector of 128 numbers and that adding information about the frequency of the clinical terms adds to the accuracy of the mapping. Techniques such as vector embedding will play an increasingly important role when handling large clinical datasets in the future.

Ingo Helbig is a child neurologist and epilepsy genetics researcher working at the Children’s Hospital of Philadelphia (CHOP), USA.