Cold fusion – joining exome datasets to identify autism genes

Mergers and acquisitions. Invariably, genetic research in neurodevelopmental disorders is moving towards joint analyses of large datasets. While the methodology of meta-analysis is well established for genome-wide association studies, the joint analysis of exome datasets comes with many question marks. Now, a recent paper in PLOS Genetics pioneers the field of joint exome data analysis for association studies in autism. This paper highlights some unexpected facets of rare variant analysis.

What keeps you awake at 3AM? While my infant son was struggling to go back to sleep, I decided to take a look at the recent paper by Liu and colleagues. First of all, a compliment to the authors, as they managed to write a very readable paper despite a lot of technicalities and statistical methods they used. If they managed to keep a sleep-deprived dad interested in reading their paper at 3AM on his iPhone, so that’s already something. And the statement “rare variants map in ancestry space” almost has a poetic ring to it (see below).

Association, but the results don’t matter. The study by Liu and colleagues is an association study on two merged autism exome datasets and two control datasets. In total, they looked at ~1000 cases and ~1000 controls. Even though their study did not result in any novel findings, they discuss many of the technical difficulties when joining two exome datasets. In particular, their study demonstrates that (1) a joint analysis can be performed on the level of variant files, (2) mega-analysis beats meta-analysis and (3) population stratification for rare variants can be sufficiently addressed. We will discuss these three main points below.

Principle Component Analysis (PCA). As I have struggled recently with the concept and methodology of PCA for a different project, I wanted to use a 5-min-do-it-yourself example to demonstrate what this method does. Imagine you have two light sources in a room that are captured by four cameras. If we only know the signal received by the cameras, it is not immediately clear whether we are dealing with the signal from one, two or more sources. When looking at the principle components of the signal received by the four cameras, we find that there are basically only two principle components that capture much of the variation in the signal, suggesting that we are dealing with two independent sources rather than one source. Also, the frequency of the source signal can almost be reconstructed through the first two principle components. In genetic studies, the principle components may correspond to artefacts, batch effects or population stratification. After PCA analysis, the data can be corrected for the main principle components on a group level.

Principle Component Analysis (PCA). As I have struggled recently with the concept and methodology of PCA for a different project, I wanted to use a 5-min-do-it-yourself example to demonstrates what this method does. Imagine you have two light sources in a room that are captured by four cameras. If we only know the signal received by the cameras, it is not immediately clear whether we are dealing with the signal from one, two or more sources. When looking at the principle components of the signal received by the four cameras, we find that there are basically only two principle components that capture much of the variation in the signal, suggesting that we are dealing with two independent sources rather than one source. Also, the frequency of the source signal can almost be reconstructed through the first two principle components. In genetic studies, the principle components may correspond to artefacts, batch effects or population stratification. After PCA analysis, the data can be corrected for the main principle components on a group level.

Joining of datasets on the variant file level. There are various calling algorithms for exome data including GATK and samtools. For two different datasets generated at two centers with different conditions, it is difficult to assess whether results can be compared. Liu and colleagues now demonstrate that this is the case. Both datasets, analysed with different calling algorithms, can be adjusted to a similar variant distribution using simple filtering. Patients do not have to be re-sequenced and you don’t have to go back to the initial computationally intense calling algorithms. That is good news for anybody attempting such a joint analysis in the future.

Mega-analysis beats meta-analysis. There are two possibilities to combine two association studies. The joint analysis can be performed on the level of the raw data (mega-analysis) or at the level of group statistics (meta-analysis). For genome-wide association studies, the technology of meta-analysis is well established and accepted. However, for rare variants, this question was so far unresolved. Liu and colleagues now demonstrate that the mega-analysis (i.e. going back to the raw data) has certain advantages over the meta-analysis (joining p-values) for rare variants. For future combined studies, this means that raw data rather than summary data will need to be analyzed, which might be a data sharing issue.

Rare variants map in ancestry space.  The most beautiful part of the paper by Liu and colleagues revolves around the analysis of population substructure for rare variants. If case and control samples are from different ethnicities, population stratification might results in false positive results, as in the case of the North-South gradient for variants in the lactase gene. For common variants in genome-wide association studies, the method of Principle Component Analysis (PCA) is well established and was first pioneered by Price and colleagues. In brief, Principle Component analysis reduces the dimensionality of data and generates components that capture a huge part of the variation in the sample. For common genetic variants, these components correspond to ancestry or ethnicity. However, for rare variants this question regarding population stratification has remained open. There have been legitimate concerns that issues surrounding population stratification might actually be more relevant for rare variants. Liu and colleagues now show that the same method (PCA) used for common variants can be used for rare variants. In fact, Liu and colleagues find that using PCA to assess population stratification through common variants is sufficient to correct for the stratification of rare variants. As common variants hint at the “ancestry” of a given population, the PCA analysis of common variants generates a so-called ancestry space. The authors show that the rare variants in their analysis cluster in ancestry space, i.e. they spread along the same components as common variants. This means that the established method of PCA for common variants can be used as proxy for rare variants.

Lessons for EuroEPINOMICS. Sooner or later, exome data will have to be used for a case-control study, possibly combining different datasets. The study by Liu and colleagues suggests that such a combined analysis is possible, but comes with several caveats. Their study did not reveal any new autism genes, suggesting that a joint analysis requires large datasets. There are different ways of genetic association tests for rare variants and ideally, the joint analysis would involve a re-analysis of variant files, i.e. individual genotypes will need to be joint in a mega-analysis. In addition, Liu and colleagues suggest that issues such as population stratification are important, but can be sufficiently addressed.

The example in the figure is taken from a tutorial by Emily Mankin at the University of Colorado. This example provides an easy-to-follow example for PCA that can be run on your own computer using the R package.

Ingo Helbig

Child Neurology Fellow and epilepsy genetics researcher at the Children’s Hospital of Philadelphia (CHOP), USA and Department of Neuropediatrics, Kiel, Germany

Facebook Twitter