Gamification. Genetic epidemiology is probably one of the driest and most boring fields of genome science that you will encounter. However, there are some basic questions that keep on puzzling me. One of them is about rare variants: if we think of rare variants that are present in patients, but also in controls, how could a combination of rare variants ever fully explain a disease? What are the rules, what are the conditions for such a situation? I happened to play around with R yesterday, and caught by a wave of gaming spirit, I wanted to try and see. I created a virtual population with 1 million people where disease risk is almost fully explained by 300 rare variants – a little genetic SimCity that I will call Rare Variant Island. Follow me through some of the adventures in our new empire and see what happens.
Why? You might be wondering why I am taking time to think about a virtual population with rare variants. My interest for rare variants started when we were working on microdeletions like the 15q13.3 or 16p13.11 microdeletions. Back then, I was looking in an almost jealous way at the researchers who had the luxury of working on monogenic variants. We didn’t expect at this time that there actually is a large contribution of de novo mutations in severe epilepsies and all we had were rare genetic risk factors. Risk factors versus causative genes implies that the same variants can be found in unaffected individuals. What I asked myself back then were the following questions. If everything is just a risk factor, how do we cross the border from being unaffected to affected? How many variants do we need? How do rare variants “behave” in a population? That’s when I started to think about how we could simulate a population with rare variants.
Rules. Here are the rules that our sims, the inhabitants of Rare Variant Island live by. Our island has one million inhabitants. One percent of the population is affected (10,000 individuals). There are 300 independent genes in the genomes of our simulated population that confer risk to disease. Every gene has rare risk variants that are present in roughly 1% of the affected individuals – this is quite close to a typical scenario: 1% of the population affected, rare risk variants present in 1% of the population.
How much risk? Given that I am the builder of Rare Variant Island, I can define how much risk each variant contributes. Let’s say that each risk variant in each gene doubles the risk for disease. We can take this risk and then distribute the number of risk alleles between affected and unaffected individuals. Taken together, when looking at the population-attributable risk, 95% of the overall disease risk in our simulated population is explained by the variants in the 300 genes, which is why I chose the number of genes in the first place. Let’s randomly distribute risk variants across affected and unaffected individuals and roll the dice for 300 x 1,000,000 genotypes.
Distributing the risk factors. Voila, that’s not too bad. With a little bit of variability, our rare variants are risk factors with a relative risk of 2 in our population. We have a population in which 95% of the risk of disease is explained by rare variants in 300 genes, each variant in each gene is rare and present in approximately one percent of the population. With 5% that we can not explain – it may be environment or other factors – our genetic architecture is nice and tidy. The disease is pretty much explained by rare variants on the population level. Now let’s see what these variants are doing in the population and how they are distributed in unaffected and affected individuals.
Numbers. The absolute numbers of affected and unaffected individuals is interesting. Disease is rare in our simulated population (1%). Therefore, even when increasing “density” of risk variants, there are always many more unaffected individuals carrying a certain number of rare variants compared to affected individuals. The highest load of risk variants is 11 – no individual has more variants than this, which is due to the random distribution of genotypes in our population. But even for individuals with 7 risk factors, the total number of unaffected individuals with this overall amount of risk variants is more than 3x more than the absolute number of affected individuals. This means that when we look at the burden of genetic risk factors, even among the top 5%, you’re way more likely to pick an unaffected individual than an affected individual when you randomly chose based on genotype. We’ll come back to this in a later post when we’ll look at how well genotype predicts phenotype on Rare Variant Island. For now, let’s observe that there doesn’t seem to be a good separation for most individuals.
Variant distribution. To repeat our last observation: there is not a good separation between affected and unaffected individuals on Rare Variant Island when it comes to the number of risk genes. Let’s hold this thought for a moment. This shouldn’t really happen, right? We are used to this phenomenon in real life when we observe that all variants other than clear monogenic variants hardly explain a significant proportion of the disease. But things on Rare Variant Island should be different – this simulation is built in a way that these variants theoretically should explain more than 95% of the liability to the disease. Why is this not working?
Rethinking causality. The reason why we struggle to understand what is happening on Rare Variant Island is a flaw in our understanding of risk factors. We want black and white causality, we are longing for a yes and no answer. We’re happy to accept risk factors, but often have the implicit assumption that these risk factors will eventually add up to a black and white answer. This is our hope for complex genetics: risk factors will interact somehow that the combination of a selection of well-defined risk factors add up and clearly discriminate between health and disease. The only problem with this assumption is that it is not necessarily true, as seen on Rare Variant Island. Risk factors remain risk factors and we have a hard time of telling whether an individual with a certain number of risk variants is affected or not. This situation is similar to common variants in schizophrenia where it is assumed that common genetic variants explain a significant proportion of the population risk, but they are pretty much meaningless when you try to break this down to the individual. It’s almost a scary thought: a combination of risk factors rule the population risk, but are powerless when applied to the individual.
What you need to know. Our little trip to Rare Variant Island was meant to be thought experiment. What does a population look like where the entire risk for disease is governed by rare risk variants. Our first observation is that things don’t add up – multiple risk variants don’t add up to certainty. This is something that we don’t really want to hear as we implicitly assume that studying rare variants will help us come up with concept for variant burden or variant load that will help us separate clearly been health and disease. Let’s revisit some other aspects of Rare Variant Island later – for now, our simulated population is stored away safely on my computer’s hard dive.