The completion of the human genome came with few surprises it seems in retrospect. One was the observation that the human genome apparently had much fewer genes than expected. I never understood the fuss that was made about the total gene number, as referring to an absolute number of genes is a gross simplification of the facts. How many genes we have is highly dependent on what we call a gene and how we identify them in the human genome. When talking about the total number of genes, how can you leave out all the problems of probabilistic algorithms for gene identification and the difficulties of proving gene expression and coding potential experimentally?
What’s a gene anyway? In 2012, we still don’t know how to tell protein-coding genes with 100% certainty across the genome although the genome is reportedly finished and recent updates are only minor. A couple of years back, Mark Gerstein and colleagues tried a definition of a gene in the light of the Encode project after witnessing non-coding conserved regions and an abundance of RNA expression everywhere.
Now a new study in yeasts inspects the spectrum of potentially coding areas. A functional genomics all-star team assembled around Mark Vidal proposed an evolutionary model that leads to new protein-coding genes. This model is different from straightforward inspection of known genes in yeast. Most new genes that are born from non-coding regions will loose their coding potential and be removed from the genome. The term gene birth feels like a bad metaphor in that light as most of these genes die early and never have any impact on the species. However, some small open reading frames (ORFs) may be used in evolution to create novel genes and at least in yeast, almost 2000 of these expressed open reading frames, which may serve as proto-genes, can be identified.
Unknown unknowns. We do not know to what extent small ORFs that may or may not be expressed at low levels in the human genome contribute to disease. For sure, they are difficult to track experimentally as well as using bioinformatics methods. In the light of complex diseases and missing heritability I wonder whether we shouldn’t pay more attention to such aspects. Now that we’re analyzing exome and genomes in the EuroEPINOMICS consortium, it might be worth considering whether the non-conserved genes could also contribute to the genetic architecture of the epilepsies. Often enough, these positions are filtered out in pipelines focusing on likely detrimental effects.
These things are easy to blog about but will require substantial amount of work and new ideas. And from recent exome studies on autism-spectrum disorder we know that the data won’t be simple to begin with.