Parasitic. In the dramatic language that was somewhat reminiscent of the current US primaries, the New England Journal of Medicine warned of an emerging class of researchers referred to as research parasites, researchers who had nothing to do with an initial study, but re-analyze data without being connected to the initial study design, possibly for their own purposes. The NEJM editorial was accompanied by a call for collaborative research on a coordinated basis rather than analyzing data without working with the researchers who were initially involved in the generation of the data. Let’s discuss whether genetics is currently under threat from research parasite infestation and whether this may actually be a good thing.
Scenario. Imagine the following scenario. Group A has found a particular gene and Group B is interested in combining data to find out whether the initial findings still hold when data is joined together. Group A is currently about to submit grant proposals based on this initial finding and much of their reputation hinges upon this initial discovery. Group B’s dataset is much larger and it would be expected that Group A’s discovery may be refuted in this study. In fact, refuting the initial study is the purpose of Group B’s research, which is a novel idea given that the initial publication of Group A had a very high impact. Now Group B asks for the raw data of the initial study in order to make the analysis pipelines consistent. Would you consider it likely that Group A willingly shares the full raw data or would you rather assume that Group A may choose not to collaborate at this point, pursuing their own research interests?
Editorial. In their editorial in NEJM, Longo and Drazen use the example of study on colon cancer to emphasize how collaboration in the scientific world could works. Start with a novel idea, identify possible collaborators, work together to test the new hypothesis, and report the data with shared authorship. This way, all relevant data from both sources can be integrated. They raise concerns mainly about re-interpretation of clinical studies, when researchers not involved in the generation of the data may be unaware of some of definitions of the parameters, which may then lead to faulty interpretations of the data. These scientists who simply re-analyze the data, but are not part of the initial data collection and analysis, may find interest in rather pursuing this type of analysis rather than engaging in clinical studies, becoming research parasites rather than symbiotes. Longo and Drazen did not call these scientists research parasites, but they referred to the fact that other researchers have used this term. The basic notion of their editorial is that, while beautiful from the 10,000 feet perspective, data sharing may be tricky when it comes these details.
Data science. Translated to the genetic perspective, the discussion about parasites versus symbiotes is a discussion about re-analyzing raw genetic data, possibly in combination with other genetic data that may not be completely compatible. It is a somewhat less dramatic discussion as the NEJM editorial is targeted mainly at clinical studies, where careful “parasiting” may result in the false interpretation of a clinical study, possibly questioning the effect of medical interventions that were carefully assessed in the initial study. For genetics, it is the question of whether we as scientists are under the obligation to share our data and whether this data may be used without our direct approval for studies that we are not involved in. In fact, many genetic studies are already required to deposit their data into dbGAP or other archives, making the data accessible to other researchers. While many of these researchers would be happy to collaborate, some may not necessarily want to, for reasons that may be manifold. Should these researchers still be allowed to obtain data access?
There will be sharing. Eventually, genetic data will be shared and made available, which is the overall trend in the field. There is general consensus in the community that isolated data in firewalled centers will need to be connected, and ideally this will occur in a collaborative way. But should there be more? An obligation to share or even an automatic mechanism that will allow others to access your data? There are many reasons why genetic data cannot be exchanged, ranging from formal issues like the consent process to commercial interests of the center that generates or houses the data. However, most genetic studies are funded through public funds and we may consider it a general obligation to make data accessible in the broadest sense for the scientific community. It may be that we disagree with some of the interpretation of our data analyzed by others. However, in this case, I would count on the fact that science with ongoing peer review is eventually self-correcting. Re-analyzing existing data, even through scientists not connected with the initial study, is part of this process. And eventually, we may refer to research parasites as data scientists.