Disparity in Genomic Databases

Genomics. The use and importance of genomics in clinical research and practice has grown exponentially as the cost of acquiring human genomic sequences has continually decreased. Genetic variation can be inherited, acquired, or present at birth. Within the realm of inherited variants, the evolutionary history of humans can account for much of the genetic variation seen across different groups. Genomic research can help in identification of genomic loci or variants that are potentially associated with human diseases and, hence, also enable the development of precision medicine. However, accounting for the normal spectrum of human genetic variation is critical, and the currently available tools are significantly limited in their ability to do so for a diverse range of human populations.

Figure 1. Genome Wide Association studies search the genome for small variations that occur more frequently in people with a particular disease or trait than in people without that disease or trait. This figure from Fatumo et al. (2022) depicts the distribution of individuals in genome wide association studies (GWAS) over the years. It is obvious that most of these studies have been conducted in European individuals despite the percentage of Europeans in the global population being relatively lower than people of East Asian or South Asian descent.

Database disparity. Since the Human Genome Project was completed in 2003, scientists around the world have worked to populate the sequence and variant databases. When a genomic test is carried out, the discovery of rare changes in the DNA can only be understood by comparing it with thousands of previous test results. Some of the most widely utilized databases are Genome Aggregation Database (which are used by laboratories extensively for data sharing.

Unfortunately, these databases are heavily skewed towards data derived from individuals of European descent. Genomic research is   as these studies are often expensive and require a significant amount of funding and resources, which are typically more readily available in Western countries. While there are increasing requirements for research studies to deposit their data into accessible databases, the participants in most research are still predominantly of Western European ancestry. Historical mistrust has continued to be a significant barrier to increased research participation from underrepresented communities. Without approaches which address the root causes for this disparity, genomic advances will only be helping in the diagnosis and precision treatment for populations with Western European ancestry.

Individuals are expected to benefit most from genomic research conducted in individuals with a similar ancestral background to them. Genes change in response to factors such as climate change, infectious diseases, diet etc. Genetic testing for often yields results of uncertain significance due to insufficient comparable data in reference databases. These skewed data can lead to inaccurate representations of minority populations, making it scientifically challenging and inequitable to the underrepresented populations.

Addressing genomic disparity. Considering the pace of change in the 20 years since the NIH mandated inclusion of diverse populations, it seems unlikely this dynamic will self-correct. The National Institute of Health’s Genomic Research Initiative (NHGRI) has launched several initiatives to address the observed disparity in genomic research.

Through the program, people belonging to different demographic categories are being recruited for biomedical research. Large scale biobanking initiatives have the capacity to yield extensive genetic data and, therefore, it is imperative to focus on increasing diversity during recruitment. “The African Genome Variation Project” was an initiative that performed genetic sequencing on various sub-populations in Africa. The GenomeAsia100K Consortium was established in 2019 to understand the genetic diversity within people of Asian descent.

Efforts to improve the technology that maximizes the utility of genomic data are predominantly led by researchers and largely dependent on the sharing of genomic data. Genomic data generated through NIH funded studies are required to be contributed to a data repository to enable widespread access. However, it is important to note that not all genomic research is NIH-funded and there are no hard requirements to deposit this data into existing data repositories. There are increasing requirements for non-NIH-funded studies to submit a data sharing plan which involves planned data submission to repositories prior to publication, but this requirement is self-regulated and not enforced by any regulatory authority.

Recommendations. We have discussed the challenges in recruitment of minority populations in research. It is fundamentally unethical to continue to conduct genomic research in this manner without adequate representation. To ensure equity and justice to individuals from different ancestries, it is important to collect datasets that have a wide representation. Conducting research in the world’s diverse cities can help in getting truly representative control samples of different populations to make sure our studies are diversely controlled. Large scale biobanking initiatives targeting inclusion of underserved populations are currently ongoing. These studies have the potential to generate diverse genomic datasets and thereby can serve as a strategy to reduce the Western bias in genomics.

There needs to be persistent outreach as well as education programs for minority communities to regain their trust in research. Dedicated biomedical funding to researchers who target inclusion of diverse populations in genomics has also been shown to have positive impact. To avoid misdiagnosis and increase genomic yield, GWAS studies need to be conducted per population to ensure generalizability. Study reviewers should ensure inclusivity in study design and should also be cognizant of the fact that research in underrepresented communities often requires a longer duration and is associated with higher costs.

Conclusions. Scientific research in health and medicine has advanced at a rapid pace in the past few decades, owing to improvements in technology. Technology has also enabled researchers to demystify the human genome, and this has led to massive generation of genomic datasets. Though these are exciting times to be in science, there are ethical challenges that arise due to inadequate representation from all communities in genomic databases and therefore favorably benefiting only a fraction of the global population. Failure to address this issue could be a deterrent for the future of genomic research. It is, therefore, crucial to tackle these disparities at the grassroot level and progress towards genetic equality in research.

Priya Vaidiswaran

Priya is a clinical research program manager within the Helbig lab.