De novo. Three months ago, I performed a trio exome de novo analysis in a patient-parent trio. From my iPad, in a hotel room in Paris. When I got home a few days later, I was excited to tell my students that the analysis worked. They looked at me slightly confused: “What’s the big deal? We had the analysis complete already a week or so ago.” Last year at this time, I was proud that our lab had established a fully functional de novo analysis pipeline. Suddenly, it’s not a big deal anymore. What happened? Let me tell you about Varbank.
Varbank. Varbank is the brainchild of Holger Thiele at the Cologne Center of Genomics (CCG). It’s an exome database that can be accessed through a web browser and offers basic annotation and filtering functionality. For de novo trio studies, Varbank uses the same algorithm that we also used in the RES consortium: Denovogear, the program developed at the Sanger Institute. Therefore, getting used to the Varbank functionality wasn’t all that complicated for us. Varbank is accessible on request for researchers in the field for a limited amount of exomes sequenced at CCG and at other centers. It’s a good tool to host exomes for a small to medium-size research group like our group in Kiel. Nevertheless, it took us some time to fully embrace Varbank. Why did it take us so long?
1. Hubris. Well, when you first send off your research exomes, you feel that you almost hold the answer in your hands – this wonderful new technology will surely discover the underlying genetic cause of the patient that you have included in your study. In most cases, your genotyping center already provides you with basic annotation that you can filter down using available software on your desktop computers such as Excel – the desktopization of genomic data. Just use the exome and get the answer. Afterwards, you don’t need it anymore… The truth is that things are not that simple. In less than 30% of exomes, you will get a definite answer with regards to the causative gene. You may want to look at the data again, explore the coverage in critical target regions, and tell true rare variants from in house artefacts. Basically, in more than 70% of exomes, you need a second look. This is where an accessible database comes in handy.
2. Inexperience. The second reason why it took us so long to use Varbank was our own inexperience. Even when you realize that you need to have a “second look”, you might still think that you can do it yourself. A combination of variant files and Excel tables will do, right? Well, the truth probably is that these projects will get pushed further down on your to do list and you end up never looking at the dataset again. There is just this one more formatting step that I need to do if I had time, but there is always something else coming up.
3. Being competitive. Why work with others when you can have all the glory for yourself? You will find the gene, publish it, and it will have your name attached to it. No need for sharing, no need for involving others. Make sure that you don’t tell anybody the name of “your novel gene”; they might take it from you and publish before you. Not that we ever had this mentality, but you sometimes feel hints of this “competitive feeling” coming up here and there. The truth is that you probably won’t be able to make sense of exome data by yourself anymore. Additional cases with mutations in the same gene will be needed to demonstrate that the gene is causal, even in clear-cut familial cases. One-family-one-gene studies are increasingly difficult to sell and might actually somehow miss the point. Accumulating and assessing available data might be the better option in the long-run, a practice that we have implemented in the RES consortium and which is also used in many other research networks.
My Varbank wishlist. Using Varbank makes your luggage lighter – you don’t need to carry around multiple hard drives with exome data anymore; you have access to your data “in the cloud”. Here is my wishlist for additional features for the future. This is just something to think about for any exome database that’s out there.
More functional annotation. For some datasets, we actually exported Varbank data and ran it through a standard annovar pipeline again. Not that Varbank is deficient in any way, but some more functional annotation such as discrete SIFT and Polyphen annotation (“pathogenic”) is sometimes a bit more helpful for us.
Automated trio analysis with recessive/compounds. Press the button and you get the 1-2 de novo mutations, 4-5 genes with homozygous mutations, and 6-8 genes with compound heterozygous mutations. Currently, it is still quite labor-intensive to identify the genes with compound heterozygous mutation. As we will increasingly turn to trio analysis, this might be worthwhile considering.
myVarbank. Despite the cloud-character of Varbank, a customizable version might be nice that can be installed on your local network, possibly with some link to the main database with different levels of access rights. Many research groups would probably profit from a slick and fast exome data management tool that doesn’t require many programming skills. Wouldn’t it be nice to have all your local GATK/BAM files organized to have a “quick look” whether there is a mutation in a given gene in your datasets?
Go exome databases. Managing exomes to allow for flexible data mining is one of the next big steps. I personally fell in love with Varbank when I realized how easy it is for us to outsource the trio analysis, giving us more time for other projects. In the near future, we will likely see various formats of exome databases that will serve different needs – from the hands-on database like Varbank to larger databases with thousands of exomes such as the Epilepsy Genetics Initiative (EGI).