ENCODE will change the way we analyse genomes. The comparison of long non-coding RNA and transcription factor binding sites will require more CPU time. Anything else? I don’t know, I am only writing this because Ingo asked me to. It’ll take time to study the 30+ papers, sift through the data and discuss it with colleagues. Only then, something like that understanding we hear so much about can happen and I am sure it will in journal clubs around the globe in the next weeks. But smaller things might already be interesting.
Author Archives: Roland Krause
A literature collection for epilepsy genetics
We have set up a library in Mendeley to collect relevant papers for EuroEPINOMICS. Keeping up to date feels like a menial task yet no simple solution delivers good results. You can’t read all the tables of contents by mail (or RSS) and publishers have the tendency to spam those e-mails, but most likely you’re subscribed to the essentials. The more anxious of us employ automated search services at the NCBI or use the good old PubCrawler to follow their field of research, a particular gene, or the output of their nemesi or mentors.
Continue reading
Will the relevant SNPs please stand up
The flood of variants. Every re-sequencing of a genome leads to many more variants than can be validated with functional assays. Many strategies exist to select the candidate variants. Filtering on criteria might remove all variants so efforts are focused to re-rank the list of variants such that the most promising appear on top. A recent review in Nature Reviews Genetics wants to give users a hand with using the bioinformatics tools available. As a bioinformatician, I find a number of important points missing.
Conferences on Twitter
Bioinformatics at large. The ISMB conference is a big event and summarizing seven parallel sessions requires additional channels than physical presence. Luckily, there is the Internet. A sufficient number of scientists report of the current session on social media tools like Twitter. In previous years, the conference was supported by FriendFeed but the slow demise of the platform no longer made it possible. Continue reading
Predicting the effects of mutations

From left to right: Sean Mooney, Rachel Karchin, Andre Frank, Shamil Sunyaev, Emidio Capriotti.
How well can we predict the effects of mutations that change the protein sequence? Framed by the the ISMB conference, the largest bioinformatics conference the SNP special interest group met on Saturday the 14th, 2012, moderated by Emidio Capriotti and Yana Bromberg to discuss current state of the art. Here’s a summary with links a set of tools to try if you study variants.
Gene birth in yeast and human
The completion of the human genome came with few surprises it seems in retrospect. One was the observation that the human genome apparently had much fewer genes than expected. I never understood the fuss that was made about the total gene number, as referring to an absolute number of genes is a gross simplification of the facts. How many genes we have is highly dependent on what we call a gene and how we identify them in the human genome. When talking about the total number of genes, how can you leave out all the problems of probabilistic algorithms for gene identification and the difficulties of proving gene expression and coding potential experimentally? Continue reading
Old friends
The functional interactions of two genes can be predicted by their conserved proximity in the genomes of distant species. The observation can be used to build large scale networks for bacterial species e.g. in the STRING database but there is little evidence for such conservation in larger eukaryotic species such as animals. Metazoan gene order is scrambled after short periods of evolutionary time and few interactions can be found except for the conserved Hox gene clusters.
Gene-gene pairs in metazoan genomes. Irimia et al. now show the prediction of 600 gene-gene interactions in human and more in other species by analysis of conservation across 17 metazoan genomes and demonstrate the validity by a variety of large scale experiments. In brief, some gene-gene-pairs are more conserved than expected, suggesting a functional relationship. Not all gene pairs are adjacent – longer range interactions are also studied. It’s funny to read such a seemingly simple analysis in 2012 as so many people will have tried similar lines of research after the observations by Abachi and Lieber about bidirectional promoters, i.e. promotors, which affect gene expression in the upstream and downstream direction. The small number of available metazoan genomes might have been a cause for the late discovery. Or am I expecting science to move too fast?

Adachi and Lieber found that bi-directional gene pairs are conserved in higher eukaryotes and suggest the accepted explanation that a single promoter drives the expression of both genes.
Location, location, location. The number of new interactions identified by Irimia et al. is small but the experimental data lined up supposedly point towards high degree of true positives. The identified genes might not be of direct interest to epilepsy genetics as they are primarily found it basic cellular functions. But the observation that conservation is strong on a few gene pairs hopefully allows a glimpse on what shapes the genetic architecture, suggesting other neighbouring genes in humans might have positional effects. A recent publication by Campbell et al. provides an interesting example for epilepsy research and suggests cis-regulatory effects between epilepsy genes at the chromosomal region 9q34 including STXBP1 and SPTAN1. I wonder what role non-coding RNAs play in the cases presented by Irimia et al., which is not touched upon.
Big data now, scientific revolutions later
Sequence databases are not the only repositories that see exponential growth. The internet helps companies to collect information in unprecedented orders of magnitude, which has spurned the development of new software solutions. “Big data” is the term that stuck with it and blew life into the data analysis. Widespread coverage ensued, including a series of blog posts published by the New York Times. Data produced by sequencing is big: Current hard drives are too slow for raw data acquisition in modern sequencers and we have to ship the discs because we lack the bandwidth to transmit the data via the internet. But we process them only once and in a couple of years from now they can be reproduced with ease.
Large-scale data collection is once again hailed as the next big thing and spiced with calls for a revolution in science. In 2008, Wired even announced the end of theory. Experimental scientists make good use of hypotheses and targeted experiments under the scientific method the last time I checked though. A TEDMED12 presentation by Atul Butte, bioinformatician at Stanford is symptomatic in it’s revolutionary language and caused concern with Florian Markowetz, bioinformatician at the Cancer Center in Cambdridge, UK (and a Facebook friend of mine). Florian complains and explains that the quantitative changes in the data does not lead to a new quality of science and calls for better theories and model development. He’s right, although the issue of data acquisition and source material had deserved more attention (what can you expect from a mathematician).

The part of the data we care about in biology is quite moderate but note that the computing resources of the BGI are in league with the Large Hadron Collider.
We don’t know what to expect from e.g. exome sequencing for a particular disease and the only way to find out is to do the experiment, look at the data, come up with guestimates and confirm your finding in the next round. Current data gathering and analysis projects in the life sciences won’t be classified as big data by the next sweep of scientists anyway. They are mere community technology exploration projects using ad hoc solutions.
No use in studying gene-gene and gene-environment effects in complex diseases?
Genome-wide association studies (GWAS) have improved our insight into the genetics of complex diseases but have fallen short of initial expectations, leaving the majority of the heritabililty to be explained. Interactions of genes with the environments and with each other receive a fair share of the blame for the lack of progress despite the widespread efforts. The large number of possible interactions, however, currently still limits progress in this field. A dedicated and growing group of computer scientists and geneticists now study gene-gene effects in the hope of shedding light on complex diseases. Initial results were hopeful, even in the field of epilepsy genetics.
Now, a group of Harvard based biostatisticians presented simulations for breast cancer, type 2 diabetes and rheumatoid arthritis that include gene-gene and gene-environment effects. Their interpretation reads bleak: little predictive power can be gained by including the additional dependencies, which means that all the CPU time consumed currently for their analysis is only warming the planet and the hearts of computer scientists.

The large number of cases diabetes and many other complex widespread diseases are not explained easily. And the Aschard study suggests that it will remain so for the immediate future despite the progress in sequencing technology.
Negative predictions from experts for their own domain usually receive a negative backlash. The study could probably be attacked on the grounds that the authors selected a large number of parameters, some from probably little more than thin air. But the geneticists on twitter remained silent. Is this acceptance already? Maybe the critics still lie exhausted from attacking Vogelstein’s negative predictions from a couples of months ago.
If the statistical model and parameter choices find widespread acceptance, it would mean that it is virtually impossible to explain many complex diseases from genetics alone to a sufficient degree. As individual studies of the interactions of two SNPs are difficult enough, many cases of complex diseases will remain unexplained. Despite all the efforts, it would be almost as dark as before we had high-throughput sequencing facilities.
The river of frequent rare variants
The flow of exome sequencing papers amounts to a small river. In the most recent work of general interest researchers captained by Bamshad and Akey from the University of Washington sifted for rare mutations in 63.4 terabases of exome sequences. Next to whole genome sequencing the current output will feel like a trickle but their census yields about 300 mutations per genome that current methods for function prediction consider important and a strong bias on ancestry. Nature News has a fairly dry summary of the results.

By David Stanley in Flickr under CC-BY licencse
Everyone’s genomes are awash with rare variants. The fact alone won’t surprise no one in the field but hopefully drown the claims that these are problems with current sequencing technologies such as statistical artefacts.
This plethora of rare variants will make interpretation of results from exome sequencing studies challenging, but also indicates that large consortiums like EuroEPINOMICS are necessary to navigate this stream of rare variants.