Big data, ontologies, and the phenotypic bottle neck in epilepsy research

Unconnected data. Within the field of biomedicine, large datasets are increasingly emerging. These datasets include the genomic, imaging, and EEG datasets that we are somewhat familiar with, but also many large unstructured datasets, including data from biomonitors, wearables, and the electronic medical records (EMR). It appears that the abundance of these datasets makes the promise of precision medicine tangible – achieving an individualized treatment that is based on data, synthesizing available information across various domains for medical decision-making. In a recent review in the New England Journal of Medicine, Haendel and collaborators discuss the need in the biomedical field to focus on the development of terminologies and ontologies such as the Human Phenotype Ontology (HPO) that help put data into context. This review is a perfect segue to introduce the increasing focus on computational phenotypes within our group in order to overcome the phenotypic bottleneck in epilepsy genetics. Continue reading

Launching the Epilepsy Genetics Initiative – Go EGI!

Launch. This week, the Epilepsy Genetics Initiative (EGI) was launched. EGI was founded by Citizens United for Research in Epilepsy (CURE) and represents a large database for diagnostic and research exomes that will guarantee regular re-analysis of exome data, which is particularly relevant for the large number of exomes that we think are negative. Here is a brief blog post why all exomes should eventually find their way into EGI. Continue reading

Navigating the epilepsiome – live from Tübingen

2D. I am writing this post during our EuroEPINOMICS meeting in Tübingen listening to presentation from CoGIE, the EuroEPINOMICS project working on IGE/GGE and Rolandic Epilepsies and RES, the project on rare epilepsies. At some point during the afternoon, I made my selection for the best graph during the presentations today – an overview of the conservation space of epilepsy genes. Continue reading

To do: read ENCODE papers

ENCODE will change the way we analyse genomes. The comparison of long non-coding RNA and transcription factor binding sites will require more CPU time. Anything else? I don’t know, I am only writing this because Ingo asked me to. It’ll take time to study the 30+ papers, sift through the data and discuss it with colleagues. Only then, something like that understanding we hear so much about can happen and I am sure it will in journal clubs around the globe in the next weeks. But smaller things might already be interesting.

Continue reading

Big data now, scientific revolutions later

Sequence databases are not the only repositories that see exponential growth. The internet helps companies to collect information in unprecedented orders of magnitude, which has spurned the development of new software solutions. “Big data” is the term that stuck with it and blew life into the data analysis. Widespread coverage ensued, including a series of blog posts published by the New York Times. Data produced by sequencing is big: Current hard drives are too slow for raw data acquisition in modern sequencers and we have to ship the discs because we lack the bandwidth to transmit the data via the internet. But we process them only once and in a couple of years from now they can be reproduced with ease.

Large-scale data collection is once again hailed as the next big thing and spiced with calls for a revolution in science. In 2008, Wired even announced the end of theory. Experimental scientists make good use of hypotheses and targeted experiments under the scientific method the last time I checked though. A TEDMED12 presentation by Atul Butte, bioinformatician at Stanford is symptomatic in it’s revolutionary language and caused concern with Florian Markowetz, bioinformatician at the Cancer Center in Cambdridge, UK (and a Facebook friend of mine). Florian complains and explains that the quantitative changes in the data does not lead to a new quality of science and calls for better theories and model development. He’s right, although the issue of data acquisition and source material had deserved more attention (what can you expect from a mathematician).

Big data

The part of the data we care about in biology is quite moderate but note that the computing resources of the BGI are in league with the Large Hadron Collider.

We don’t know what to expect from e.g. exome sequencing for a particular disease and the only way to find out is to do the experiment, look at the data, come up with guestimates and confirm your finding in the next round. Current data gathering and analysis projects in the life sciences won’t be classified as big data by the next sweep of scientists anyway. They are mere community technology exploration projects using ad hoc solutions.