There is no escape from big data – it haunts you on the web, in social media, enables self-driving cars and supposedly revolutionizes health care. Big Data in Healthcare is a meeting in Luxembourg today and tomorrow which brings together a colorful mix of people from all domains and should convince the last of us that this is not at all an IT specialist issue.
How big is big? The European Bioinformatics Institute is looking at having an Exabyte of data available by 2020, as Niklas Blomberg from Elixir Europe pointed out. That’s a million hard drives at 1Tb that you can buy at the electronics shop around the corner. The real problem is not having too few USB ports on your computer (if computers in 2020 still have USB ports), but that the network speed has not grown such that this can be transferred to your analyst of choice. Sending data across the internet in 2015 is not working well. A number of research groups are down to shipping hard drives, others are using cloud-based solutions that will allow the analysts to bring the software to the data.
The cloud. Obviously, Carlos Conde from Amazon, has little understanding of why not everyone is using their web services for genomic data analysis and all other large scale data sets. Many researchers would like to use cloud services but are uncertain whether they are allowed to in the first place. Regulations are not keeping up. Advocates of open access demand information should be open to other researchers, whereas data protection-aware people would like to keep patient data accessible with highly specific and detailed regulations before any research is conducted. The matter was actively discussed in the conference and we’re a long way from a consensus.
Sharing. Clinical data cannot be shared Openly – capital O as in “Open Access,” as Andrew Hufton from the Nature data journal Scientific Data voiced it – but other parts of data sharing is supported from a variety of stakeholders – funding bodies, researchers and publishers. The regulatory perspective is to have what is called controlled access, nicely summarized earlier by Berta Knoppers representing the Global Alliance for Genomics and Health, which struck me as a very balanced approach to a global consensus.
A third perspective comes from patient organizations such as the Dravet Foundation Spain, represented by Ana Mignorance, who strongly believes that data should be collected and stored essentially by the patients or their parents themselves.
Security. Not all is well. Paulo Esteves Veríssimo, a security researcher at the University of Luxembourg, pointed out the many breaches of identity that have been demonstrated by security experts over the years. I am not sure whether the solution he and other researchers proposed can actually be put into use for genomics in the clinic, but in terms of raising awareness, he certainly excelled. A decade ago, birthdate and zip code and gender were considered to be sufficient for anonymization of a record. Today, anybody who has basic training in the field of health care data protection knows otherwise. The computer scientists who develop the security-sensitive, easy to use, registered-access genomics data platform will be famous and carried around on hands.