Narrowing the phenotype gap through vector embedding

Sparse data. Trying to match the growing body of genomic datasets with associated clinical data is difficult for a variety of reasons. Most importantly, while genomic data are standardized and can be generated at scale, clinical data are often unstructured and sparse, making it difficult to represent a phenotype fully through any type of abbreviated format. Quite frequently in our prior blog posts, we have discussed the Human Phenotype Ontology (HPO), a standardized dictionary where all phenotypic features can be mapped and linked. But these data also quickly become large and the question on how best to handle them remains. In a recent publication, we translated more than 53M patient notes using HPO and explored the utility of vector embedding, a method that currently forms the basis of many AI-based applications. Here is a brief summary on how these technologies can help us to better understand phenotypes. Continue reading

Entering the phenotype era – HPO-based similarity, big data, and the genetic epilepsies

Semantic similarity. The phenotype era in the epilepsies has now officially started. While it is possible for us to generate and analyze genetic data in the epilepsies at scale, phenotyping typically remains a manual, non-scalable task. This contrast has resulted in a significant imbalance where it is often easier to obtain genomic data than clinical data. However, it is often not the lack of clinical data that causes this problem, but our ability to handle it. Clinical data is often unstructured, incomplete and multi-dimensional, resulting in difficulties when trying to meaningfully analyze this information. Today, our publication on analyzing more than 31,000 phenotypic terms in 846 patient-parent trios with developmental and epileptic encephalopathies (DEE) appeared online. We developed a range of new concepts and techniques to analyze phenotypic information at scale, identified previously unknown patterns, and were bold enough to challenge the prevailing paradigms on how statistical evidence for disease causation is generated. Continue reading