The Excel contagion

The tale of 16,000 genes. For a recent analysis, I wanted to compile all the gene names of variants that were found in 12 of our EuroEPINOMICS research patients. Since I was planning to do some statistical analysis as well, I used the R package for this, my personal favourite for all kinds of statistics. I also have  weak spot for Minitab and never got along with SPSS, but that is a different story. After I filtered and sorted the genes alphabetically, the following picture made me smile and gave me a reason to write a bit about role of Microsoft Excel for exome analysis…

This is not a calendar. Microsoft Excel took gene names (i.e. Sept-10) and treated them as dates. Even though I wasn’t fully aware that Excel was ever used in this analysis, this contamination proves a contact with Excel at some point.

This is not the EuroEPINOMICS birthday calendar. This is what happened to some of the gene names on my list. What has happened? At some point, files containing these names were opened in Excel. And Excel automatically reformats names like “SEPT10” (Septin 10) into 10-Sept, interpreting this as a date, not a name. All files that have touched Excel at some point carry this “contamination”. This is of course not a problem when looking at two genes, but something might fall through the cracks when you work with hundreds or thousands of them. This problem with Excel is long known and so prevalent that a 2004 article in BMC Bioinformatics was dedicated to it.

The primary program for exome analysis. It’s Excel again. After Microsoft extended the limitations for row numbers to >1M in the 2007 version from 60K in the 2003 version, Excel became a useful tool for bioinformatics. Much of the filtering for exome data can be done through Excel and it is often satisfying to see how variant lists shrink down by applying filter after filter. Likewise, for GWAS results, we can sort by p-value, quality measurements etc. Excel provides a very intuitive and familiar interface that makes data interpretable. There are, however, downsides.

The curse of dimensionality. Excel data is two-dimensional by default and data must be in this format. However, meaningful genetic information does not necessarily adhere to this. For example, the analysis of copy number variations in a large cohort can be forced into such a format, but is very difficult to interpret. Data visualisation using programs like CNVineta or the figures used in the paper by Cooper et al. on the genetic landscape of intellectual disability, provide a much better feel of what is actually hidden in the data. Also, when we look at exomes to find rare causative variants, the data is not two-dimensional, but actually three-dimensional. We are interested in variants on the chromosomal string (rows, dimension 1) and their properties (columns, dimension 2), but we need to compare this with other exomes (dimension 3), either of additional patients to find shared genes or with controls. For controls, we can compress this information again into one of the columns (dimension 2), but we should be aware that this is simply a substitute for what we are actually looking at.

Data is beautiful. One of my aims of this post is to remind us that data sometimes needs to be visualised. And Excel with its two-dimensional layout is simply a form of data visualisation that appeals to us. However, data visualisation doesn’t stop there. There are many interesting ways to display complex data, starting with Hans Rosling’s Gapminder to David McCandless’ “Data is beautiful”. We should keep in mind that with increasing data complexity, ranging from exomes, genomes to someday metagenomes, we need to visualize and break the curse of two-dimensionality.

Ingo Helbig

Child Neurology Fellow and epilepsy genetics researcher at the Children’s Hospital of Philadelphia (CHOP), USA and Department of Neuropediatrics, Kiel, Germany

Facebook Twitter