Predictions. December 17th is the day that I consider my annual anniversary in epilepsy genetics. Exactly eight years ago, I was still a student in my final med school year and went to Australia for a job interview. We took a road trip over the weekend and on the evening of the 17th, I was reading Nigel Tan’s review on epilepsy genes aptly entitled The truth is out there while sitting in a rock pool at Red Johanna Beach, a surfing beach at the Great Ocean Road south of Melbourne. Looking back, I think this was one of the few publications that helped me make sense of all the literature on epilepsy association studies. I thought that I would like to be able to write something like this while shivering in the waters of the Bass Strait that are always a little bit too cool. Today, sitting in the cozy warmth of our apartment in Kiel, I have finished reading Nate Silver’s book The Signal and the Noise, a book about making sense of data and predictions. Eight years later, are we any closer to the truth that is out there?
Signal and noise. If you wonder who actually won the 2012 Presidential Elections in the US, it’s not Obama but actually a statistician named Nate Silver who runs a blog called FiveThirtyEight, named after the number of electoral college votes. Nate Silver became some kind of a celebrity after the election, since he had actually predicted an Obama victory doing a systematic analysis of all available polls prior to the election taking into account the past accuracy of the polling firm, the size of the poll and the past polling data of a particular state using a statistical Bayesian framework. Prior to putting efforts into predicting US elections, Nate Silver worked on a statistical framework for baseball and he had a brief career as an online poker player, which – as I learned this week – involves luck and some sophisticated statistical thinking. His book The Signal and the Noise is about the history of prediction and how we handle predictions today. What about our predictions in epilepsy genetics?
The truth is out there. Making predictions about the course of epilepsy genetics brings me back to Nigel Tan’s 2004 review. Nigel basically made the case that despite more than 50 association studies, there has been no convincing gene for complex epilepsies. He and his coauthors highlighted several reasons including a small sample size in all studies and evidence of several biases. In retrospect, knowing what we know about association studies in 2012, sample size has been the main limiting factor. The human genome has turned out to be much more variable than we had initially thought. In 2012, merely uttering the phrase association study evokes the knee-jerk-reaction-like of mentioning the p=10-8 significance level. Did you ever wonder why this is the case?
The prior. At least part of the reason might be that we are overconfident in our estimates regarding candidate genes in general. When we designate a candidate gene, we put a certain level of certainty to it. In Bayesian terms, this is called a “prior”. And geneticists, climate researchers, stock market brokers, political commentators and many other professions systematically tend to be overconfident regarding predictions. What we have learned is that the “identity” of a candidate gene makes us overestimate its role. Only because GABRG2 causes a familial epilepsy syndrome, there is no good reason to believe that it is also a risk gene for epilepsies in general with the methods we have at hand. It might make sense conceptually, but the signal of the candidate drowns in the genomic noise of all other genes. There is so much variability on the genomic level that the little benefit this gene gets from being the candidate gene is virtually annihilated. Therefore, we need to adjust our prior, also taking into account all the odds artefacts that modern technology can give us.
OMICS unfolding. I like to tell the joke that OMICS basically means “every artefact that you could possible imagine”. When we deal with large-scale data, results will always be significant, but in most cases they will be significant for the wrong reason. I was once fooled by gene expression data of a gene called SMTB1, a gene that showed highly significant differences in gene expression in two cohorts of patients versus controls. However, I realised a little too late that this gene was upregulated in one cohort while being downregulated in the other cohort. Basically, both cohorts cancelled each other out. It was a lack of imagination on my part. I had simply no idea that things like this could exist, an unknown unknown. There is no good possibility to prepare for unknown unknowns; in a research context you simply stumble onto them. For example, the fact that de novo mutations are frequent events and that you cannot distinguish between patients versus controls based on the de novo mutation load alone, has been a strange surprise. Usually, we would have assumed that a de novo stop mutation in any given gene is strong evidence for a role of this gene in the patient’s disease. 2012 has told us that this is not necessarily true and this is something that we didn’t have a concept for.
Predictions. This post should remind us to occasionally check how good we were in the past in predicting the outcome of a study or an experiment and where our predictive power was completely off the chart. For example, a few years ago, researchers strongly believed that exome sequencing would reliably identify the cause of disease in virtually every patient with epilepsy. The method was new, expensive and exciting and covered much more of the genome than everything else before. In 2012, however, we realise that this analysis might not be that easy. The percentage of patients with various neurodevelopmental disorders explained through exome sequencing is 5-10% and there is little reason to believe that we might have a much higher success rate in many forms of epilepsy. What has happened? Basically, in our initial assessment, we have failed to account for the fact that the variability of the genome drowns much of the possible signal from causative genes. We can only see these genes if they pop up repeatedly or if they are genes that we already know. Everything else might be just before our eyes, but we cannot see it. One lesson from this is that we should be much more careful in estimating the power of future novel technologies. Additional large-scale data in and of itself is useless unless we have a very robust framework to interpret it.