Catalog of Humanity—Variation in 2,500 Genomes Ready for Perusal

Once upon a time, sequencing a single genome was something to brag about. Those days are gone. In the October 1 Nature, a consortium of researchers publishes its final report on the 1000 Genomes Project. The consortium sequenced the full genomes of 2,504 people, from 26 different populations across the Americas, Eurasia, and Africa. This took seven years. In total, the project identified 88 million variants in the human genome. The typical person’s DNA was dotted with 4 to 5 million of them. Single-nucleotide polymorphisms (SNPs) contributed the bulk of these variants, but your average Joe also carries a couple of thousand structural variants, such as deletions and insertions. In a separate report in the same issue of Nature, a subset of authors catalog those structural variations, reporting complex rearrangements geneticists had not seen before.

Casting a Wide Net. Project scientists collected DNA from people in 26 different populations across five continents. [Courtesy of The 1000 Genomes Project Consortium, Nature.]

The main paper is by corresponding authors Adam Auton of the Albert Einstein College of Medicine in New York, Gonçalo Abecasis of the University of Michigan in Ann Arbor, and hundreds of co-authors. They report variants found in diverse ethnicities, from Finns to Puerto Ricans to the Yoruba of Nigeria. They estimate that their data set accounts for more than 99 percent of SNPs and 85 percent of larger variants that are present at a frequency of at least 1percent in the populations studied. Very rare variants, exclusive to certain small populations or families, might not appear in this data set.

As previous studies have indicated, people in Africa exhibited the most variable genomes. Because Homo sapiens arose there, the continent houses the oldest human populations, the ones who have had the most time for genetic drift to create variety. Scientists believe other populations underwent bottlenecks in genetic diversity as small founder groups emigrated from Africa carrying only a smidgen of the total population diversity with them. Since their departure from Africa, these younger ethnic groups have not had time to build up the same range of diversity that African populations possess.

Compared with SNPs, structural variants have been harder for geneticists to identify because genome sequencing relies on piecing together overlapping bits of short sequences. Repeat sequences and deletions are therefore easy to miss. To address this, researchers led by co-corresponding authors Evan Eichler of the University of Washington in Seattle and Jan Korbel of the European Molecular Biology Laboratory in Heidelberg, Germany, combined the 1000 Genomes sequences with data from other types of analyses, such as long-read sequencing, to characterize these kinds of variants. Besides simple deletions, insertions, and inversions, they also discovered more complex patterns of genomic shuffling, some new to science. For example, they saw places where multiple deletions occurred in a row, or spots where genes were duplicated, then inverted. They discovered about 240 genes that were missing in the genomes of many study participants, hence the authors deemed those genes potentially “dispensable.”

While the 1000 Genomes project only just wrapped up, scientists are already making regular use of its catalog of human variation. The authors published an interim data set, composed of 1,092 genomes from 14 countries, in 2012 (see Nov 2012 news; Nov 2010 news).

In the neurodegenerative disease field, scientists tap 1000 Genomes most often in GWAS interpretation, in a process called imputation. GWAS typically use SNP microarray chips, which identify some, but not nearly all, of the places where the human genome can vary by a single nucleotide. Then, scientists turn to data sets like 1000 Genomes to predict, or impute, what other variants are likely co-inherited with the SNPs flagged in the GWAS. Having full sequences of more individuals in the final data set will make imputation more accurate, said Mark Cookson of the National Institute on Aging in Bethesda, Maryland, who did not participate in the project. That will be particularly helpful in finding expression quantitative trait loci (eQTLs), he added. EQTLs are non-coding genetic sequences that regulate gene expression and are believed to contain many of the functional variants linked to diseases.

Ewan Birney of the European Bioinformatics Institute in Cambridge, England, and Nicole Soranzo of the University of Cambridge also highlighted the importance of 1000 Genomes in GWAS imputation in a commentary accompanying the Nature paper. “Because genotyping arrays are cheap, the ability to infer variation allows researchers to focus on increasing sample sizes—a crucial next step in improving our understanding of the genetics of diseases,” they wrote. They called the 1000 Genomes Project a foundation for the future of human population genetics.

In a sense, 1000 Genomes provides a control population so scientists studying a particular disease or population do not have to build their own. In that regard, the variety of people sampled by the 1000 Genomes group offers an advantage, Cookson said. For example, when researchers want to focus on a particular ethnic group—Europeans, say—they can compare their genotypes to the 1000 Genome data to identify, and discard, any samples from people who have a different ancestry.

The new structural-variants catalog can help scientists studying diseases inherited in families, just as SNP databases already do. Doctors who suspect a mutation, inversion, or deletion is to blame for an age-related neurodegenerative disease, for example, can check the 1000 Genomes data to find out if that variant commonly occurs. If it does, it might be innocuous, explain the authors. Without that context, a structural variation found in an afflicted family might be mistaken for the cause of the disease. The papers do not reveal the age or health of the people who donated the DNA samples.

Scientists can freely access the genotype data at the 1000 Genomes website. Researchers who spot something interesting in those A’s, T’s, G’s and C’s can purchase from the Coriell Institute for Medical Research in Camden, New Jersey, the original DNA samples for more detailed study. Also available are immortalized cell lines generated from the blood cells of people who donated their genomes.

References:

1. 1000 Genomes Project Consortium, Corresponding authors, Steering committee, Production group, Baylor College of Medicine, BGI-Shenzhen, Broad Institute of MIT and Harvard, Coriell Institute for Medical Research, European Molecular Biology Laboratory, European Bioinformatics Institute, Illumina, Max Planck Institute for Molecular Genetics, McDonnell Genome Institute at Washington University, US National Institutes of Health, University of Oxford, Wellcome Trust Sanger Institute, Analysis group, Affymetrix, Albert Einstein College of Medicine, Baylor College of Medicine, BGI-Shenzhen, Bilkent University, Boston College, Broad Institute of MIT and Harvard, Cold Spring Harbor Laboratory, Cornell University, European Molecular Biology Laboratory, European Molecular Biology Laboratory, European Bioinformatics Institute, Harvard University, Human Gene Mutation Database, Illumina, Icahn School of Medicine at Mount Sinai, Louisiana State University, Massachusetts General Hospital, Max Planck Institute for Molecular Genetics, McDonnell Genome Institute at Washington University, McGill University, National Eye Institute, NIH, New York Genome Center, Ontario Institute for Cancer Research, Pennsylvania State University, Rutgers Cancer Institute of New Jersey, Stanford University, Tel-Aviv University, Jackson Laboratory for Genomic Medicine, Thermo Fisher Scientific, Translational Genomics Research Institute, US National Institutes of Health, University of California, San Diego, University of California, San Francisco, University of California, Santa Cruz, University of Chicago, University College London, University of Geneva, University of Maryland School of Medicine, University of Michigan, University of Montréal, University of North Carolina at Chapel Hill, University of North Carolina at Charlotte, University of Oxford, University of Puerto Rico, University of Texas Health Sciences Center at Houston, University of Utah, University of Washington, Weill Cornell Medical College, Wellcome Trust Sanger Institute, Yale University, Structural variation group, BGI-Shenzhen, Bilkent University, Boston College, Broad Institute of MIT and Harvard, Cold Spring Harbor Laboratory, Cornell University, European Molecular Biology Laboratory, European Molecular Biology Laboratory, European Bioinformatics Institute, Illumina, Leiden University Medical Center, Louisiana State University, McDonnell Genome Institute at Washington University, Stanford University, Jackson Laboratory for Genomic Medicine, Translational Genomics Research Institute, US National Institutes of Health, University of California, San Diego, University of Maryland School of Medicine, University of Michigan, University of North Carolina at Charlotte, University of Oxford, University of Texas MD Anderson Cancer Center, University of Utah, University of Washington, Vanderbilt University School of Medicine, Weill Cornell Medical College, Wellcome Trust Sanger Institute, Yale University, Exome group, Baylor College of Medicine, BGI-Shenzhen, Boston College, Broad Institute of MIT and Harvard, Cornell University, European Molecular Biology Laboratory, European Bioinformatics Institute, Massachusetts General Hospital, McDonnell Genome Institute at Washington University, McGill University, Stanford University, Translational Genomics Research Institute, US National Institutes of Health, University of Geneva, University of Michigan, University of Oxford, Yale University, Functional interpretation group, Cornell University, European Molecular Biology Laboratory, European Bioinformatics Institute, Harvard University, Stanford University, Weill Cornell Medical College, Wellcome Trust Sanger Institute, Yale University, Chromosome Y group, Albert Einstein College of Medicine, American Museum of Natural History, Arizona State University, Boston College, Broad Institute of MIT and Harvard, Cornell University, European Molecular Biology Laboratory, European Bioinformatics Institute, New York Genome Center, Stanford University,
Jackson Laboratory for Genomic Medicine, University of Michigan, University of Queensland, Virginia Bioinformatics Institute, Wellcome Trust Sanger Institute, Data coordination center group, Baylor College of Medicine, BGI-Shenzhen, Broad Institute of MIT and Harvard, European Molecular Biology Laboratory, European Bioinformatics Institute, Illumina, Max Planck Institute for Molecular Genetics, McDonnell Genome Institute at Washington University, Translational Genomics Research Institute, US National Institutes of Health, University of California, Santa Cruz, University of Michigan, University of Oxford, Wellcome Trust Sanger Institute, Samples and ELSI group, Sample collection, British from England and Scotland (GBR), Colombians in Medellín, Colombia (CLM), Han Chinese South (CHS), Finnish in Finland (FIN), Iberian Populations in Spain (IBS), Puerto Ricans in Puerto Rico (PUR), African Caribbean in Barbados (ACB), Bengali in Bangladesh (BEB), Chinese Dai in Xishuangbanna, China (CDX), Esan in Nigeria (ESN), Gambian in Western Division – Mandinka (GWD), Indian Telugu in the UK (ITU) and Sri Lankan Tamil in the UK (STU), Kinh in Ho Chi Minh City, Vietnam (KHV), Mende in Sierra Leone (MSL), Peruvian in Lima, Peru (PEL), Punjabi in Lahore, Pakistan (PJL), Scientific management, Writing group, European Molecular Biology Laboratory European Bioinformatics Institute, National Eye Institute NIH, University of California San Diego, University of California San Francisco, University of California Santa Cruz, British from England and Scotland GBR, Colombians in Medellín Colombia CLM, Han Chinese South CHS, Finnish in Finland FIN, Iberian Populations in Spain IBS, Puerto Ricans in Puerto Rico PUR, African Caribbean in Barbados ACB, Bengali in Bangladesh BEB, Chinese Dai in Xishuangbanna China CDX, Esan in Nigeria ESN, Gambian in Western Division – Mandinka GWD, Indian Telugu in the UK ITU and Sri Lankan Tamil in the UK STU, Kinh in Ho Chi Minh City Vietnam KHV, Mende in Sierra Leone MSL, Peruvian in Lima Peru PEL, Punjabi in Lahore Pakistan PJL. A global reference for human genetic variation. Nature. 2015 Oct 1. 526. [Pubmed]

2. Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, Jun G, Fritz MH, Konkel MK, Malhotra A, Stütz AM, Shi X, Casale FP, Chen J, Hormozdiari F, Dayama G, Chen K, Malig M, Chaisson MJ, Walter K, Meiers S, Kashin S, Garrison E, Auton A, Lam HY, Mu XJ, Alkan C, Antaki D, Bae T, Cerveira E, Chines P, Chong Z, Clarke L, Dal E, Ding L, Emery S, Fan X, Gujral M, Kahveci F, Kidd JM, Kong Y, Lameijer EW, McCarthy S, Flicek P, Gibbs RA, Marth G, Mason CE, Menelaou A, Muzny DM, Nelson BJ, Noor A, Parrish NF, Pendleton M, Quitadamo A, Raeder B, Schadt EE, Romanovitch M, Schlattl A, Sebra R, Shabalin AA, Untergasser A, Walker JA, Wang M, Yu F, Zhang C, Zhang J, Zheng-Bradley X, Zhou W, Zichner T, Sebat J, Batzer MA, McCarroll SA, The 1000 Genomes Project Consortium, Mills RE, Gerstein MB, Bashir A, Stegle O, Devine SE, Lee C, Eichler EE, Korbel JO. An integrated map of structural variation in 2,504 human genomes. Nature. 2015 Oct 1. 526. [Pubmed]

3. Birney E, Soranzo N. The end of the start for population sequencing. Nature. 2015. Oct 1. 526. [Pubmed]


To view commentaries, primary articles and linked stories, go to the original posting on Alzforum.org here.

Copyright © 1996–2018 Biomedical Research Forum, LLC. All Rights Reserved.

GWAS topic-genetics
Share this:
Facebooktwittergoogle_plusmailFacebooktwittergoogle_plusmail