End of the year. The final weeks of the year are always a time when my curiosity for bioinformatics takes over. Four years ago, I was trying to teach myself sufficient command line and bioinformatics to run Denovogear on my computer. Now the field has moved on, from command line language to solutions that aim at bringing data closer to researchers. I hijacked a platform that was initially built for cancer research, CHOP’s Cavatica platform that was developed with Seven Bridges Genomics. In the same way as a few years ago, I started out with a simple question: Can I take an exome completely apart and then re-analyze it to find the SCN1A mutation in DRA1?
Democratization of sequencing. Exome sequencing has become a standard tool in human genetics and is part of the big transformation in biomedical sciences that is transforming biology from a previously data poor field of science to something that it more akin to astrophysics – we have so much data around that we often don’t know where to look. Over time, data analysis and data availability become more important that data generation. However, while many scientists in the field are comfortable with working with the results of exome sequencing studies, only very few are able to actually perform such an analysis from scratch. And this is not for the lack of trying, but for the lack of available platforms. Yes, the steps to assemble an exome are very complex and I don’t even come close to understanding them. But then, I am using a word processor to write this post without knowing the inner workings of Microsoft Office and without a PhD in computer science. The bottom line is that you need to have a tool that works for you at your disposal and engages you in a way that make you become more competent. Having such tools that are intuitive to use and ease you into the topic gently have the potential to expand the pool of people working with exome data significantly. How much crowd intelligence will you be able to generate if you lead people to the data?
Cavatica. After submitting a letter of interest for a foundation grant proposing a neuroscience open data platform to stimulate what we called a pediatric neuroscience data sharing economy, I felt that I should try myself. The Cavatica platform was launched earlier this year as a collaboration between the Children’s Brain Tumor Tissue Consortium (CBTTC) and Seven Bridges – the same organization that built one of the NCI’s Cancer Genomics Cloud pilots. Cancer data had basically become too big to be housed at any single institution and moving the data into the cloud was the next natural step forward. In neuroscience, we are still dealing with data silos and are not quite there yet, even though the direction towards joint, cloud-based datasets is obvious. Cavatica has several aspects that made it very appealing to run the first epilepsy exome in the cloud. First, with Yuankun Zhu and Bo Zhang from CBTTC/D3b, I had great support on the ground who were able to help me with easy questions (“Where do I find the data tab again?”) and not-so easy questions (“How do I link the platform to the S3 bucket where the data is stored?”). However, I am somewhat proud to say that after a few hiccups, I didn’t need that much help, mainly due to the second feature that made Cavatica useful to me. Cavatica uses both a command line and graphical workflow interface (screenshot). For this post, I’ve used the graphical one. You can take workflows apart, tie them together and tinker with them. Cavatica is cloud-based, so I could log on from work, home, or on my phone from the train to check on the process of my workflows. So I tried my luck with DRA1.
DRA1. The DRA1 trio was our first presumed SCN1A-negative Dravet trio that we recruited within our EuroEPINOMICS project. In fact, it was one of the many trio exomes that made up the story of the missed SCN1A mutations in the end – patients who turned out to be positive for mutations in SCN1A even though no mutations had been found with Sanger sequencing. DRA1 is usually my litmus test for any new exome pipeline that we’re approached for. If the pipeline manages to find the SCN1A mutation, the basic work-flow is present. The difference with Cavatica was that I had to do it myself. Ramakrishnan (Ramki) Rajagopalan had already put the exomes into an S3 bucket, so the data was already in the system. But it was completely raw data, so-called fastq files, the rawest dataset for exome sequencing data you can imagine, the basic data that jumps off the sequencer. I plugged the fastq files into one of the standard Cavatica workflows for exome sequencing. It took me some time to learn the system, but finally everything was configured appropriately to move this pipeline from fastq to BAM to VCF files to merged VCF files. VCF files are variant files and this is usually where my “territory” begins. It’s of course not the ideal pipeline that I would imagine for trio exomes. Joint calling of BAM files is not established yet and for the annotation beyond the VCF files, I had to move to wANNOVAR, which has always been my personal preference for annotation. But don’t take this as a complaint; I used basic building blocks that were mainly developed for a different purpose and applied them to the question that I had. And it worked (admittedly, this post would not exist otherwise).
Exomes in the cloud. It is extremely satisfying to sit at my desktop computer, press a button, and thereby make a cloud-based platform crunch gigabytes of data to deliver you a single mutation. When I plugged my variant files into wANNOVAR, our SCN1A variant was the top variant in the phenotype analysis and one of the few variants in the de novo analysis. Again, the system is not laid out yet for a proper de novo analysis, but we have all the building blocks in place. As I had mentioned during our meeting in Prague, it is my goal to re-analyze the entire RES dataset using such a platform, using our swarm intelligence and shared, cloud-based workflows. We’re not quite there yet, but platforms like Cavatica are our next step forward.