The American Society for Human Genetics is holding its annual conference from 18th to 22nd October in Vancouver, Canada. The Platform and Poster Abstracts are now available online. The research presented at this meeting gives a taste of some of the publications and developments to come in the next year or so. There are a number of abstracts that are of particular interest to genetic genealogists. In particular I note that AncestryDNA are presenting a number of interesting posters which hint at some new tools that might be on the way. I've highlighted below my picks from the conference programme.
23andMe will also be at the ASHG meeting. They have published a list of the abstracts for their presentations and posters on their blog, though none of the content is of direct interest to genetic genealogists.
23andMe will also be at the ASHG meeting. They have published a list of the abstracts for their presentations and posters on their blog, though none of the content is of direct interest to genetic genealogists.
Ultra-fine structural inference and population assignment using IBD network clustering and classifiers accurately assign sub-continental origins represented in a large admixed U.S. cohort.
E. Han, R. Curtis, P. Carbonetto, K. Noto, J. Byrnes, Y. Wang, J. Granka, A. Kermany, K. Rand, E. Elyashiv, H. Guturu, N. Myres, E. Hong, C. Ball, K. Chahine. Ancestry.com DNA, LLC,
San Francisco, CA.
Motivation & Objectives: Identifying the geographic origin of individuals using genetic data has broad application in forensics, human disease and evolution. There have been multiple methods proposed to achieve this goal, such as Principle Component Analysis (PCA), Spatial Ancestry Analysis (SPA) and Geographic Population Structure (GPS). However, most methods suffer from decreased prediction accuracy outside Europe and do not apply to the US population comprised of admixed immigrants. In this study, we describe a new method and demonstrate its accuracy in predicting geographic origins in the US post-European colonization or internationally for single origin and admixed samples. Methods: We use a database of over 1.5 million consented genotype samples collected from the US and internationally, along with samples from public databases such as POBI. We build a genetic network by estimating the amount of identity-by-descent (IBD) sharing between all individuals. By iteratively applying the Louvain method for community detection, we find a hierarchy of genetic clusters in the network. Levering user-generated pedigrees going back 6-8 generations, we annotate each cluster with birth locations that are enriched in historical time periods. The birth locations of these clusters are generally specific to locations in the US or internationally, allowing for concise geographical interpretation. Although community detection results assign samples to only one cluster, we use machine learning classification to assign samples to multiple clusters. Given this classification and enriched birth locations, we identify the likely geographic origins of each sample. Results: Our results include over 300 stable clusters, each comprised of more than 1000 samples. Some clusters correspond to narrow geographical regions, such as people descended from southern West Virginia in the 19th century, and others to broader groups, such as European Jews from Poland. By using the associated pedigrees, we demonstrate the accuracy of these predictions: over 95% of the assigned individuals have at least one known ancestor born in the enriched region defined by most clusters. Conclusion: By utilizing large-scale genetic data with associated pedigrees, we have developed the first method for predicting the geographic origin of individuals within the US or internationally with high accuracy. This approach can be used for ultra fine scale genetic ancestry mapping in any population.
A massively scalable phenotyping approach using social media for genetic studies.
J. Yuan1,2, A. Gordon1, D. Speyer1,2, D. Zielinski1, R. Aufrichtig1, J. Pickrell1,3, Y. Erlich1,2. 1) New York Genome Center, New York, NY; 2) Computer Science, Columbia University, New York, NY; 3) Biological Sciences, Columbia University, New York, NY.
While DNA sequencing is largely a tractable problem, massive phenotyping is still a challenge, especially for Internet-based studies. Traditional methods, such as physical exams, scale poorly for large numbers of individuals. Questionnaires are easier to collect, but administering lengthy or frequent questionnaires creates a negative experience for participants, leading to lower completion rates. Electronic health records are a great resource for phenotypes, but they exhibit large heterogeneity when collected from various resources and are subject to an array of confidentiality restrictions that complicate their collection. Recent studies have highlighted the value of obtaining digital phenotypes by interpreting the interactions of users with digital outlets as a reflection of underlying traits. In particular, these studies have shown that social media data enables the collection of various phenotypes including big five personality traits, sexual orientation, sleeping patterns, and even heart rate from regular user videos. The ubiquity of the data and its ease of collection through standard APIs enable a new methodology for large scale phenotypic collection. Here, we report our ongoing efforts to enable participants to donate their social-media data along with their genomes in order to understand the genetics of digital phenotypes. In our previous work, we developed DNA.Land (https://dna.land), an online platform where users may register and securely contribute their Direct to Consumer genomic data, as well as receive reports of ancestry and shared relatives with other DNA.Land users. Since our launch in ASHG2015, we have obtained over 20,000 users, many of whom have been eager to share personal information such as family history. We are now building a new component in DNA.Land in which users can contribute their Facebook data for scientific studies. We will present our IBM Watson-based system to predict traits from social media data and will describe the type of information DNA.Land users will receive. In addition, we will discuss the particular challenges in collecting this data with respect to both computational efforts and privacy concerns. Our approach is applicable for other types of large scale efforts such as the Precision Medicine Initiative and can easily scale to millions of people.
Insights into the geographical distribution of genetic admixture of unrelated volunteer donors and recipients of stem-cell transplants.
A. Madbouly 1, K. Besse 1, Y. Wang 2, J. Byrnes 2, C. Ball 2, N. Myres 2, M. Maiers 1. 1) Bioinformatic Research, National Marrow Donor Program, Minneapolis, MN; 2) Ancestry.com, San Francisco, CA, USA.
Genetic ancestry of self-described groups may vary across geographic locations in the US, a phenomenon documented anecdotally but not thoroughly explored in the literature. We studied the genetic ancestry of 995 HLA matched donor/recipient (DR) pairs from the Be The Match® registry with a focus on regional ancestry differences among ethnic groups. We hypothesized that, along with historical events, donor/transplant center distribution and socioeconomic factors might influence the geographical spread of some genetic admixtures. We genotyped 995 DR pairs on the Illumina OmniExpress chip with approximately 730,000 SNPs. Self-reported race and ethnicity was collected for donors at the time of registry recruitment. Recipients’ race and ethnicity was recorded at the transplant hospital once at the time of diagnosis and again after transplant. The majority of the study cohort (94%) self-identified as European Caucasian (CAU). The rest identified as Hispanic (HIS) (3.5%), African-American (1%) and Asian or Pacific Islander (1.5%). Address zip code information was available for 99% of recipients but only 59% of donors. Genetic ancestry was estimated by applying the AncestryDNA ethnicity estimator pipeline, which provides a vector of 26 admixtures. Some admixtures were combined for the analysis due to small counts and minimal impact such as detailed African (AFR) admixtures. We then mapped the geographical distribution of European (EUR) and non-EUR genetic admixtures for self-reported CAU and non-CAU individuals, optimizing geographical regions for subject privacy. The main self-reported race groups showed average proportions of AFR and EUR admixtures compatible with Bryc and colleagues (2015). However, our results revealed larger Amerindian admixture in self-reported HIS, especially among recipients. When stratifying regionally, systematic differences emerged in admixture distribution among similar race groups mostly interpretable by historic events. Separating donors and recipients suggested possible additional influences, such as donor and transplant center geographical spread. Importantly, we observed differences in the distribution of non-majority admixtures such as increased AFR admixture in self-reported CAU donors (but not recipients) in some southern states suggesting a possible socioeconomic link. This work has the potential of guiding stem-cell donor registry strategies on volunteer donor recruitment and donor and transplant center planning.
Geographic and historic changes in runs of homozygosity among more than 1,000,000 individuals sheds light into the recent demographic history of US population.
A. Kermany, C. Ball, J. Byrnes, P. Carbonetto, K. Chahine, R. Curtis, E. Elyashiv, J. Granka, H. Guturu, E. Han, E. Hong, N. Myres, K. Noto, K. Rand, Y. Wang. Ancestry.com DNA, LLC, San Francisco, CA.
Runs of Homozygosity (ROH) are indicators of segments of chromosomes identical by descent between parental haplotypes. Distribution of such runs along the chromosome contains information regarding the demographic history of the population under study, in particular it reveals trends in consanguinity. In this study, we analyze the distribution of runs of homozygosity – chromosomal locations, number of runs and lengths of runs - as well as estimated inbreeding coefficient (F) among more than 1,000,000 consented AncestryDNA customers. We report on observed variations in distribution of ROH based on geographic origins - inferred from the available pedigree data – admixture proportions as well as birth year cohort. In particular, we present our results on variations in the distribution of ROH within 19 communities within the US population - identified based on analysing a network of genetic matches in the database - and investigate differences in patterns of ROH between each group and comment on the inferred demographic history within each group.
Y-chromosomal sequencing and screening reveal both stability and migrations in North Eurasian populations.
O. Balanovsky 1,2, V. Zaporozhchenko 2,1, A. Agdzhoyan 1,2, I. Alborova 5, M. Kuznetsova 2, V. Urasin 3, M. Zhabagin 4, M. Chukhryaeva 2,1, Kh. Mustafi n 5, C. Tyler-Smith 6, E. Balanovska 2 . 1) Vavilov Institute of General Genetics, Moscow, Russian Federation; 2) Research Centre for Medical Genetics, Moscow, Russia; 3) YFull service, Moscow, Russia; 4) National Laboratory Astana, Nazarbayev University, Astana, Republic of Kazakhstan; 5) Moscow Institute of Physics and Technology (State University), Moscow, Russia; 6) The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, United Kingdom.
Y-chromosomal markers exhibit the highest interpopulation diversity in the genome and thus form one of the most informative tools for tracing population history. However, their information value depends on discovering SNPs which subdivide haplogroups with broad geographic distribution into branches revealing fine population structure. Progress in such discoveries has recently moved from a slow linear phase to a rapid exponential phase due to NGS. We applied this approach to the Y-chromosomal pool of North Eurasian populations and concentrated on haplogroups C, G1, G2, N1b, N1c, and R1b. We sequenced 181 Y-chromosomes (capturing 11 Mb from each sample), developed the NGSConv software for calling Y-chromosomal SNPs, and identified roughly 2,500 SNPs, most of which were new. Then we constructed phylogenetic trees and dated dozens of their branches using our estimates of the mutation rate. The last – but not the least – step included screening branch-defining SNPs in the entire Biobank of indigenous North Eurasian populations (led by prof. Elena Balanovska), which includes 26,000 samples from 260 populations. This screening resulted in frequency distribution maps of 29 branches of haplogroups R1b and C, thus increasing the phylogenetic resolution by an order of magnitude compared to the two initial haplogroups. For haplogroup R1b, we identified a previously unstudied “eastern” branch, R1b-GG400, found in East Europeans and West Asians and forming a brother clade to the “western” branch R1b-L51 found in West Europeans. The ancient samples from the Yamnaya archaeological culture are located on this eastern branch, showing that the paternal descendants of the Yamnaya population – in contrast to the published autosomal findings - still live in the Pontic steppe and were not an important source of paternal lineages in present-day West Europeans. For haplogroup C-M217 - the predominant paternal component in Central Asians - we found signals of simultaneous expansion in two independent branches. Both expansion times and gene geographic maps of the expanded lineages indicated the emergence of the Mongol Empire as the likely trigger. We conclude that simply discovering new SNP is not enough, but in combination with screening for the branch-defining SNPs in large biobanks of indigenous populations, it allows comprehensive reconstruction of male population history. The study was supported by the Russian Science Foundationgrant 14-14-00827 to OB.
Admixture inference of African Americans and Latinos in the United States through time.
M.L. Spear 1, D.G. Torgerson 2, R.D. Hernandez 1,3,4. 1) Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA; 2) Department of Medicine, University of California, San Francisco, San Francisco, CA; 3) California Institute for Quantitative Biosciences (QB3), University of California, San Francisco, San Francisco, CA; 4) Institute for Human Genetics, University of California, San Francisco, CA.
The study of admixed populations has provided important insights into medical genetics and population history. The genomes of admixed individuals are mosaics of segments originating from different ancestral populations. At the genome-wide level, the proportion of one’s genome deriving from each ancestral population is referred to as “global ancestry proportions”. However, modern statistical methods enable inference of the ancestry at individual SNPs within a genome, “local ancestry”, which allow us to reconstruct the mosaic pattern of ancestry tracts across an individual’s genome. Local ancestry inference is critical for the analysis of admixed genomes and has been widely studied in the fields of medical genetics and human demographic history. Local ancestry tracts can be used to infer migration histories but the question remains how these histories have shaped ancestry proportions over time, particularly in the United States, a “melting pot” country that has faced changing societal norms over the past century. It has yet to be determined how the length distribution of ancestry tracts in admixed individuals has changed over decades as well as how the variation in ancestry proportions across chromosomes and individuals may differ. Thus, we estimated local ancestry for 4,600 Latinos and 2,100 African Americans from the Genetic Epidemiology Research on Adult Health and Aging (GERA) dataset using RFMix. With these local ancestry tracts, we used TRACTS to compare the observed length of the ancestry tracts to predictions of different demographic models of migration scenarios. Individuals were grouped by 5-year birth year categories, and comparisons were made between the demographic models generated from each birth year category. Overall, the local ancestry tracts of African Americans and Latinos from the United States have provided insights into the change in complexity of their genetic structure throughout the 20th century.
Fine-scale population structure in France: Loire River as genetic barrier.
C. Dina 1,2, J. Giemza 1, M. Karakachoff 1,2, F. Simonet 1,2, K. Rouault 3, E. Charpentier 1,2, S. Lecointe 1,2, P. Lindenbaum 1, J. Violleau 1,2, H. Le Marec 1,2, C. Férec 3, S. Chatel 1,2, S. Hercberg 4, P. Galan 4, J-J. Schott 1,2, E. Génin 3, R. Redon 1,2. 1) Thorax Inst, INSERM-CNRS, Nantes, France; 2) CHU Nantes, Nantes University; 3) Inserm UMR 1078, CHRU Brest, University Bretagne Occidentale, EFS, Brest France; 4) Université Paris 13, Equipe de Recherche en Epidémiologie Nutritionnelle, Centre de Recherche en Epidémiologie et Statistiques, Inserm (U1153), Inra (U1125), Cnam, COMUE Sorbonne Paris Cité, F-93017, Bobigny, France.
Background The genetic structure of human populations varies throughout the world, being infl uenced by migration, admixture, natural selection and genetic drift. Human population structure has first been investigated at broad scales, between and within continents. Currently researchers focus on finer scales, examining genetic structure within countries. Characterising such genetic variation is of interest as it provides insight into demographical history and informs research on disease association studies, especially on rare variants. We here explored the genetic structure of a population living on the French territory (hereafter called French population) both on the whole territory and then on Western part where interesting stratification was identified.
Methods and Results We genotyped genome-wide ; 2276 individuals with known department of origin from French Population (SU.VI.MAX study) using Illumina Chip; 456 individuals (PREGO study) from Western France Atlantic Coast, from Finistère to Vendée, with at least three of their grandparents born within a 15 kilometres distance using Axiom CEU Chip. With EEMS software we visualised areas with low effective migration rates - the migration barriers, which match with geographical features, with particularly strong barrier on the lower course of Loire in Western France. We then focused on the PREGO study and Principal Components analysis revealed that individuals from the same departments form clusters. In both datasets we observed a high correlation between geographical position and components (p-value < 2e-16). Many independent methods support the hypothesis that Loire River is a genetic barrier. The two groups of individuals, from north or south of Loire, are well differentiated along PC1 axis. ADMIXTURE estimated different ancestry proportions for the two groups. The first split of hierarchical clustering returned by fi neSTRUCTURE, and the one based on normalized counts of identity-by-descent segments is between north and south of Loire.
Conclusion We here report genetic stratification at the level of continental French territory. The migration pattern is following the geographical structure. A specific pattern is noticed around the Loire River. We confirm both evidence for isolation by distance and existence of a genetic barrier, the Loire River. The discovered fi ne-scale population structure may have consequences in association analyses, especially for rare variants which tend to be geographically clustered.
Maps of effective migration as a summary of human genetic diversity.
B. Peter, D. Petkova, M. Stephens, J. Novembre. University of Chicago, Chicago, IL.
A dominant pattern of genetic diversity in humans is that geographically proximal populations are generally more genetically similar to one another; however, there are exceptions to this rule. Persistent geographical features such as mountains, oceans, or deserts, have allowed excess genetic differences to accumulate in some regions more than others. Conversely, historical migrations and population movements have led to cases where exceptional levels of similarity persist across large geographic distances. To provide more insight into how genetic differentiation is distributed geographically in humans, we examine the fine-scale genetic structure of humans. We produce maps that represent the spatial structure of human genetic diversity using a recently developed, spatially explicit method (EEMS, Estimation of Effective Migration Surfaces). We apply EEMS on global, continental, and sub-continental scales, analyzing genetic data from 8,740 individuals from 469 geographically localized populations, obtained from 24 different source studies. In addition to the major, well-known barriers such as the Sahara, Himalayas and Mediterranean, we detect barriers that correlate with historic language group boundaries (boundaries of Slavic and Bantu speakers with their neighbors), mountain ranges (Zagros, Caucasus, Ural) and marine features (English Channel, Adriatic Sea, Wallace line). We also identify regions showing high connectivity despite having geographic separation (Britain and Scandanavia, Iceland and Denmark, among the Lesser Sunda Islands). Simultaneously, we find that levels of diversity vary more smoothly, decreasing gradually with distance from Africa. Overall, our results suggest that diversity patterns are consistent and primarily shaped by the signature of the Out-of-Africa expansion, but that migration rates are strongly influenced by geography and local events.
The African Genome Resource Project: Patrilineal and matrilineal inheritance through the Y chromosome and the mitochondrial genome.
F. Abascal, D. Gurdasani, T. Carstensen, M. Pollard, C. Pomilla, M. Sandhu on behalf of AGR investigators. Human Genetics, Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom.
Background The Y chromosome and the mitochondrial genome are inherited from the paternal and maternal lines, respectively. The lack of recombination in the mitochondrial genome and in large part of the Y chromosome leads to evolution almost in isolation from the autosomal genome. As a result, the Y chromosome and the mitochondrial genome offer a unique perspective on human demographic processes. Y chromosome (Y-) and mitochondrial (mt-) haplogroups can be very informative about human origins, migrations and admixture, as well as about potential sex biases during these processes. Further characterisation of the diversity of Y- and mt-haplogroups within Africa is essential to understand human history. Here, we present the mitochondrial and Y chromosome diversity among ~5000 individuals from the African Genome Resource panel.
Methods We predicted the mt- and Y-haplogroups for 4,990 individuals and 2,399 males, respectively, representing diverse ethno-linguistic groups from Ethiopia, Uganda, South Africa, Egypt, and 5 African populations sequenced within the 1000 Genomes project. Mitochondrial and Y haplogroups were predicted with Haplogrep and YFitter, respectively. We called the mitochondrial genome and the Y chromosome for each sample and reconstructed their phylogenetic relationships with FastML.
Results We found evidence for Eurasian admixture among several populations across sub-Saharan populations. Eurasian mt haplogroups appeared in 23% of the Ethiopians and 0.8% of the Ugandans. No Eurasian mt haplogroups were detected for the Zulu and Nama. We identified 13% Ethiopians, 0.5% Ugandan, and 43% Nama/Khoe-Sans with Eurasian Y haplogroups. Eurasian admixture is prevalent in Ethiopia but it is not distributed homogenously. Whereas the Gumuz show no Eurasian haplogroups, the Amhara show the highest frequencies. Within the Nama/Khoe-San there is not a single Eurasian mitochondrial haplogroup but up to 43% of Eurasian Y haplogroups, revealing a strong sex bias (p=1e-12). Consistent with previous reports, the oldest haplogroups are found in highest frequencies within the Khoe-Sans.
Conclusions We present the largest panel of mt and Y chromosome sequences across Africa, including highly diverse Khoe-San populations from South-Africa. Our findings suggest substantial variation in Y chromosome and mt haplogroups across Africa, and provide evidence for extensive Eurasian admixture among several populations across Africa.
Whole-genome sequence analyses provide new insights into the demographic history and local adaptation of African populations.
S. Fan 1, D.E. Kelly 1, M.H. Beltrame 1, M.E.B. Hansen 1, S. Mallick 2,3,4, T. Nyambo 5, S. Omar 6, D. Meskel 7, G. Belay 7, A. Froment 8, N. Patterson 3, D. Reich 2,3,4, S.A. Tishkoff 1,9 . 1) Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA; 2) Department of Genetics, Harvard Medical School, Boston, MA 02115, USA; 3) Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; 4) Howard Hughes Medical Institute, Harvard Medical School, Boston, MA 02115, USA; 5) Department of Biochemistry, Muhimbili University of Health and Allied Sciences, Dares Salaam, Tanzania; 6) Kenya Medical Research Institute, Center for Biotechnology Research and Development, Nairobi, Kenya; 7) Department of Biology, Addis Ababa University, Addis Ababa, Ethiopia; 8) UMR 208, IRD-MNHN, Musée de l'Homme, Paris, France; 9) Department of Biology, University of Pennsylvania, Philadelphia, PA 19104, USA.
Africa is the origin of modern humans within the past 200,000 years. There are more than 2,000 ethnolinguistic groups in Africa, which encompass around one-third of the world’s languages. To infer the complex demographic history of African populations and adaptation to diverse environments, we sequenced the genomes of 94 individuals from 44 indigenous African populations using high coverage Illumina sequencing technology. Phylogenetic analysis confirms that the San lineage is basal to all other modern human population lineages. The location of other African populations in the phylogenetic tree correlates with geographical location, with the exception of the Central Africa rainforest hunter-gatherer (RHG) populations, who group with Southern African populations. We characterize ancient African population structure by inferring the effective population size and divergence time between populations. A common population bottleneck for all African populations was observed at ~200 thousand years ago (kya), corresponding with paleobiological evidence for modern human origins. Since then, the San and RHG populations have maintained the largest effective population size compared to other populations prior to 10 kya. Using MSMC analysis, we infer that the San population split from the RHG and the East African Khoesan-speaking Hadza and Sandawe hunter-gatherers within the past 66-82 kya, suggesting these populations could have originated from a historically more widespread population of hunter-gatherers. By contrast, the San diverged from all non-Khoesan speaking populations ~100-120 kya The divergence times of Niger-Kordofanian, Nilo-Saharan and Afroasiatic speaking populations were within the past ~22 to 41 kya. In the RHG populations, the oldest divergence was found between Eastern and Western RHG at ~36-51 kya; the time of divergence of the western RHG populations was inferred to be ~12-18 kya. Based on the ADMIXTURE analysis, Niger-Kordofanian and RHG populations were pooled for analyses of natural selection. We observed signatures of positive selection at genes involve in muscle development, bone synthesis, reproduction, immune function, energy metabolism, cell signaling, and neural development.
This work is supported by NIH grants 1R01DK104339-01, 1R01GM113657-01, and DP1 ES022577-04 to SAT. The sequencing was funded by the Simons Foundation (SFARI 280376) and the U.S. National Science Foundation (BCS-1032255) grants to DR.
The Genome Diversity in Africa Project: A deep catalogue of genetic diversity across Africa.
D. Gurdasani 1,2, J.P. Martinez 1, M.O. Pollard 1,2, T. Carstensen 1,2, C. Pomilla 1,2, GDAP Investigators 1,2 . 1) Wellcome Trust Sanger Institute, Cambridge, Cambridgeshire, United Kingdom; 2) Department of Medicine, University of Cambridge, Cambridge.
While recent efforts have greatly extended our understanding of genetic diversity in Africa, current sequence panels are limited in their capture of African genetic variation. Deeper sequencing with sampling of diverse indigenous populations is needed to capture diverse haplotypes across Africa. The Genome Diversity in Africa Project (GDAP) aims to characterise diversity from representative populations across all of Africa, including from several indigenous hunter-gatherer populations across the region. This would provide an important global resource to understand human genetic diversity and provide insight into population history and migrations across Africa in recent times. The project has completed sequencing of 575 samples across 23 populations in Africa, including populations from the Gambia, Ghana, Morocco, South Africa, Sudan, Chad, Kenya, South Africa, Uganda, Egypt and Ethiopia. Here, we present preliminary results from the project on 133 samples from 5 ethno-linguistic groups from Morocco, Ghana (Ashanti), Nigeria (Igbo), Kenya (Kalenjin) and South Africa (Zulu) sequenced on the Hiseq X platform (30x).
Methods Reads were mapped to the GRCh38 reference. Following quality control, variant sites were called using HaplotypeCaller v3.5 for each sample to generate gVCFs. GenotypeGVCFs was run across all samples for joint calling. VCFs were fi ltered using VQSR calibrated on DP, QD, FS, SOR, Read- PosRankSum and MQRankSum annotations. A tranche sensitivity threshold of 99.5% was applied for fi ltering of SNPs and 99% for indels. Only sites called in >90% of individuals were included. Results We identifi ed 25.1M SNPs and 2.9M indels among 133 individuals in the GDAP pilot phase, with 25% and 47% of SNPs and indels being novel (not in dbSNP141), respectively. A large proportion of variants per population were private, varying from 12-18%, being greatest among the Kalenjin and Zulu. We found the highest level of heterozygosity and genetic variation among the Zulu, consistent with reported Khoe- San admixture in this group. Conclusions We present the pilot phase of the Genome Diversity in Africa Project, identifying a high level of diversity across 5 populations from Africa. Inclusion of indigenous population groups, such as the Hadza, Twa Pygmies, and Ju/’hoansi in the next phase will materially advance the understanding of genetic diversity across African populations, and provide an invaluable resource to researchers worldwide.
High-coverage sequencing of the Human Genome Diversity Project (HGDP-CEPH) Panel.
S. McCarthy 1, A. Anders Bergström 1, Y. Xue 1, Q. Ayub 1, S. Mallick 2,3,4, M. Sandhu 1, D. Reich 2,3,4, R. Durbin 1, C. Tyler-Smith 1 . 1) Wellcome Trust Sanger Institute, Cambridge, United Kingdom; 2) Department of Genetics, Harvard Medical School, Boston, MA; 3) Broad Institute of Harvard and MIT, Cambridge, MA; 4) Howard Hughes Medical Institute, Boston, MA.
We discuss the completion of high coverage (>30x), whole-genome sequencing of all 952 core individuals in the Human Genome Diversity Panel (HGDP-CEPH), with the results being made available as an open access population data resource. This widely used panel contains samples from 52 populations spanning Africa, the Middle East, Europe, Asia, Oceania and the Americas, and previous genotype data from these samples have been an important reference resource for human genetic diversity. As seen in the 1000 Genomes Project, having fully open access data, unencumbered by managed access restrictions and other hurdles, is an invaluable driver for democratized data analysis and methods development Building on previous sequencing efforts by the Simons Genome Diversity Project, we have completed sequencing of the panel and are making the data available via the ENA and the 1000 Genomes Project data management successor, the International Genome Sample Resource (IGSR) (www.internationalgenome.org). All data has moved to the new GRCh38 reference and we present preliminary results on the call set derived from this data. We have GATK HaplotypeCaller and fermikit primary calls, are making mpileup and freebayes calls, and will present an integrated call set that has been computationally phased, together with initial population genetic analyses. A small number of samples are being experimentally phased using 10X Genomics technology which will allow evaluation of phasing accuracy, and also unbiased use of haplotype-based analyses such as MSMC.
Fine-scale identity-by-descent and birth records in Finland provide insights into recent population history.
A.R. Martin 1,2, S. Kirminen 3, A.S. Havulinna 4, A. Sarin 3, A. Palotie 1,2,3, V. Salomaa 4, S. Ripatti 3, M. Pirinen 3, M.J. Daly 1,2 . 1) Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA; 2) Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA; 3) Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland; 4) National Institute for Health and Welfare (THL), Helsinki, Finland.
Finland provides unique opportunities to investigate both population and medical genomics because of its adoption of unprecedented uniformity in national electronic health records, concerted coordination of research centers across the country, detailed historical records, as well as recent population bottlenecks that drove specific disease alleles to high frequency. We investigate recent population history (up to ~50 generations ago), particularly relevant to rare, disease-conferring alleles, using identity-by-descent (IBD) haplotype sharing in >10,000 Finns. We compare IBD sharing in Finland to nearby Scandinavian countries with considerably different population histories, including >8,000 Swedes and >30,000 Danes. We find drastically more sharing on average in Finns, including many long tracts. By leveraging fi ne-scale birth record data, we find a non-linear decay of pairwise IBD sharing with increasing distance across Finland. This arises from pockets of excess IBD sharing; e.g. pairs of individuals from northeast Finland share on average several-fold more of their genome IBD than pairs from southwest regions containing the major cities of Turku and Helsinki. We demonstrate inference of recent migration patterns from IBD sharing patterns. For example, high IBD sharing in northeast Finland radiates from north to south rather than to the west, indicating that migration is restricted near the Russian border. We also investigate recent effective population size changes across regions of Finland and find evidence supporting the distinction between early and late settlement areas. However, our results indicate a more continuous flow of migration than previously posited, with a minimum N e occurring ~12 generations ago in the northernmost Lapland region and moving further back in time to the south, with a bottleneck detectable in the early settlement area ~40 generations ago. Lastly, we leverage IBD sharing for genetic disease mapping and show that rare, functional haplotypes show more significant association via IBD mapping than single variants with linear mixed effect models.
Y-chromosomal composition of mediaeval and contemporary populations in Norway and adjacent Scandinavian countries: Y-STR haplotypes and the rare Y-haplogroup Q.
B. Berger 1, S. Willuweit 2, H. Niederstätter 1, P. Kralj 1, L. Roewer 2, W. Parson 1,3. 1) Institute of Legal Medicine, Medical University of Innsbruck, Innsbruck, Austria; 2) Department of Forensic Genetics, Institute of Legal Medicine and Forensic Sciences, Charité-Universitätsmedizin, Berlin, 13353, Germany; 3) Forensic Science Program, The Pennsylvania State University, PA, USA.
In the framework of the project “Immigration and mobility in mediaeval and post-mediaeval Norway” molecular genetic analyses were performed on 97 pre-modern human remains including genetic sexing and Y-chromosomal DNA typing. All samples were subjected to molecular genetic analyses of the sex using “Genderplex” consisting of two diff erent regions of the amelogenin gene, SRY and four X-STR loci. From 90% of the extracted remains (n=87) sex assignment was possible. Of these, 49 (56.3%) brought a genetically male result. All of these DNA extracts were subjected to Y-STR analysis using Yfiler Plus PCR Amplification Kit (Thermo Fisher Scientifi c) and/or PowerPlex Y23 System (Promega). At least partial Y-STR profiles were obtained from all samples. A detailed comparison between mediaeval/post-mediaeval and contemporary Y-chromosomes was performed by searching the obtained haplotypes (HTs) in the Y Chromosome Haplotype Reference Database (YHRD: https://yhrd.org) comprising 154,329 haplotypes from 991 populations in 129 countries at the time of query (Release 50). YHRD searches of the pre-modern haplotypes yielded full matches plus neighbor-matches differring at only one allele from the query HT. Matches are presented with geographical and ancestry information of the contemporary HTs. For samples without direct YHRD-matches, this information is provided through their neighbor HTs. AMOVA was performed using the YHRD online tool on pairwise R ST values to create the corresponding MDS plots. The pre-modern HTs were grouped according to medieval and post-medieval origin and compared to contemporary populations from Scandinavian (Norwegian, Swedish and Danish), Northwest European, and Northeast European populations. Both pre-modern populations showed small genetic distances to contemporary Scandinavians and larger distances to Northeast Europeans with Northwest European populations in between. As expected, an initial assessment of the Y-chromosomal haplogroups (HGs) showed that most of the samples were attributable to the main European HGs I1, R1a and R1b. However, one of the HTs seemed to be associated with HG-Q which is rare in Europe and hitherto little evaluated in this region. Network analysis was applied for detecting similar HTs in contemporary samples from Norway and adjacent Northern European countries stored in the YHRD. The outcomes of this survey should initiate a detailed SNP based HG-assessment of HG-Q candidate samples.
Evidence for detailed historical European population structure from large-scale, diverse genetic polymorphism data.
P. Carbonetto 1, J. Byrnes 1, J.M. Granka 1, Y. Wang 1, K. Noto 1, E. Han 1, A.R. Kermany 1, K.A. Rand 1, E. Elyashiv 1, H. Guturu 1, N.M. Myres 2, E.L. Hong 1, R.E. Curtis 2, K.G. Chahine 2, C.A. Ball 1. 1) Ancestry.com DNA, LLC, San Francisco, CA; 2) Ancestry.com DNA, LLC, Lehi, UT.
Despite the recent surge of interest in ancient genomes, we show that there is still much to be elucidated about human demography from contemporary genomes. Here, we demonstrate the use of genealogical data to generate demographic insights from analysis of a large-scale, heterogeneous genetic data set. Specifically, we show that an unsupervised ADMIXTURE analysis of genotypes from 131,293 primarily US-born individuals, followed by a simple statistical analysis of the 3 million pedigree records linked to these genotype samples, yields novel insights into European genetic diversity. In contrast to principal component analysis (PCA), which is the most widely used approach to investigating European genetic diversity, we use ADMIXTURE to infer genetically differentiated source populations reflecting more distant historical time periods. Unsurprisingly, among European-origin individuals, admixture is pervasive. Despite this, our ADMIXTURE analysis with K = 12 ancestral populations identifies 5 stable, genetically differentiated groups within Europe (with putative historical counterparts in parentheses): Ashkenazi Jewish, Irish (Celts), Eastern Europeans (Slavs), Scandinavians (Nordics) and Iberians, featuring Basques and Sardinians. The genealogical data also allow us to provide a detailed portrait of the genetic composition of contemporary peoples across North America (e.g., Iberians in Cuba), and other parts of the world. This work suggests the potential for drawing more detailed connections between present-day and ancient genetic variation by leveraging large, heterogeneous genetic data sets.
Genomic insights into the population structure and history of the Irish Travellers.
E.H. Gilbert 1, S. Carmi 2, S. Ennis 3, J.F. Wilson 4,5, G.L. Cavalleri 1. 1) Molecular and Cellular Therapeutics, Royal College of Surgeons in Ireland, Dublin, Leinster, Ireland; 2) Braun School of Public Health, The Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem, Israel; 3) School of Medicine and Medical Science, University College Dublin, Dublin, Ireland; 4) Centre for Global Health Research, Usher Institute for Population Health Sciences and Informatics, University of Edinburgh, Teviot Place, Edinburgh, Scotland; 5) MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh, Scotland.
Aims: The Irish Travellers are a population with a history of nomadism. Consanguineous unions are common, and as a population they are socially and genetically isolated from the surrounding, “settled” Irish population. A previous low-resolution genetic analysis suggested a common Irish origin between the settled and the Traveller populations. What is not known, however, is the extent of population structure within the Irish Traveller population, the time of divergence from the general Irish population, and the extent of autozygosity.
Methods: We recruited Irish Travellers from across Ireland and the UK. To be included a participant had to have had at least three grandparents with a surname associated with the Irish Travellers. DNA was extracted from saliva samples, and genotypes were generated using the Illumina OmniExpress SNP genotyping platform. With this data, we investigated population structure using fi neStructure, quantifi ed the levels of autozygosity with PLINK, and estimated a time of divergence using a method based on Identity by Descent (IBD) segment identification.
Results: We merged, cleaned, and analysed data from 42 Irish Travellers, 2232 settled Irish, 2039 British, 143 Roma Gypsies, and 931 individuals from 57 world-wide populations. We confirm an Irish origin for the Irish Travellers, demonstrate evidence for population substructure within the population, confirm high levels of autozygosity consistent with a consanguineous population, and for the first time provide estimates for a date of divergence between the Irish Travellers and settled Irish.
Conclusion: Our findings have implications for disease mapping within Ireland, as well as on the social history of the Irish Traveller population.
Personal ancestry inference at the finest scale reveals more sub-structure in the UK.
D. Lawson, G. Weyenburg. Integrative Epidemiology Unit, University of Bristol, Bristol, UK, United Kingdom.
Chromosome Painting has revealed genetic differences within the UK at a very fi ne scale , with structured genetic variation within a single county in some cases (such as Cornwall & South Wales). However, in that work, it was not possible to genetically distinguish much of England, which appeared as a single homogeneous group. Here, we describe an extension to the Fine-STRUCTURE  clustering that can further distinguish ancestry even within England; for example, identifying regions such as Norfolk, the Midlands and the South as genetically distinct. The approach works by using the known county locations to craft genetic features to use in unsupervised clustering. Specifically, we group individuals by their geographic sampling location into reference donor populations. This forms an ancestry profile - which can be viewed as a careful choice of feature vector - that still allows unsupervised genetic clustering for all individuals. Further, we describe how this approach allows individuals to be described as an admixture of the inferred geographical clusters. This allows ancestral information to be recovered for individuals who are not purely represented by a single geographical location. This also allows us to characterise the genetic relationship between the inferred clusters, several of which represent drift that is most strongly represented by a particular geographical region (including Cornwall, Wales, Scotland and the North of England) and others of which represent characteristic admixture proportions between these ancestral drifted populations. Beyond improving resolution, this approach facilitates personal genomics because individuals can be represented in terms of the fixed reference panel. We demonstrate the utility of the approach by describing the ancestry of the UK10K participants in terms of the new, high resolution POBI clusters. Previously, a similar analysis  without geographical information inferred little population structure in the UK from these samples, but now we have a rich representation of their population structure, including an assessment of admixture from outside the UK. This highlights the value in high quality fine-scale geographic sampling, which could now facilitate this level of ancestry identification for many other countries.
 Leslie et al 2015, Nature 519:309–314  Lawson et al 2012, PLoS Genet. 8:e1002453  UK10K Consortium 2015, Nature 526:82-90.
Chromosome painting for arbitrary sample collections.
G. Weyenberg, D. Lawson. Integrative Epidemiology Unit, University of Bristol, Bristol, United Kingdom.
Haplotype-based methods have been demonstrated to be capable of detecting fine scale structure within human populations—to the point of distinguishing genetic variation at the sub-county level in the South West of England . However, the aforementioned method implements an all-against-all analysis of sampled individuals, which is not suited to all applications, including personal genomics where samples are obtained individually or in small batches. Here, we describe an extension of the FineSTRUCTURE  method to allow for painting of individual samples against a panel of pre-calculated reference haplotype clusters, making the method computationally feasible for on-demand analysis of individuals. The choice of the reference panel also allows the user to tailor the analysis to emphasise targeted features of the data. For example, in the context of a personal ancestry imputation, panels may be constructed to focus on global-, continental-, or national-scale genetic features, and the low computational cost of painting an individual against a pre-computed panel makes sample-level exploratory analysis feasible. Another application of the panel-based painting is to use high-quality reference data to impute unknown geographical labels to samples where such information is either unavailable, or was collected at an undesirable resolution. To demonstrate the latter application, we analysed several populations with suspected Northern-European ancestry—including the Hapmap CEU and ASW populations, and the UK10K dataset—with respect to panels of Europeans and the high-resolution People of the British Isles (POBI) samples. These individuals are characterised in terms of an admixture of inferred clusters in the reference populations. Whilst many individuals were best described as a complex admixture that likely occurred over many generations, many others had a clear signal of geographically distinct ancestry.
 Leslie et al 2015, Nature 519:309–314  Lawson et al 2012, PLoS Genet. 8:e1002453.
Local ancestry patterns inferred from one million genomes recapitulate fine-scale population history.
Y. Wang 1, K. Noto 1, J. Byrnes 1, R.E. Curtis 2, E. Han 1, E. Eyal 1, G. Harendra 1, P. Carbonetto 1, A.R Kermany 1, J.M. Granka 1, K.A. Rand 1, N.M. Natalie 2, E.L. Hong 1, C.A. Ball 1, K.G. Chahine 2 . 1) Ancestry.com DNA, LLC, San Francisco, CA; 2) Ancestry.com DNA, LLC, Lehi, UT.
In a country of immigrants, population structure is shaped by a long, ongoing history of immigration, followed by subsequent admixture and migration. All these events have left their footprints in the genomic landscape of current residents and make it possible for geneticists to reconstruct population history from genomic data. However, deciphering the signature of these forces requires accurate inference of genomic tracts that one individual inherits from ancestors of different origins. Previously, several methods have been developed for inferring local ancestry with varying levels of success. Unfortunately, none of these methods can be feasibly applied to a data set of one million genomes. Recently, our team presented Polly, a novel algorithm for estimating genome-wide ancestry proportions in admixed samples. Polly, built on a modified version of the BEAGLE haplotype model, relies on this model to achieve two things: First, to account for phasing uncertainly, and second, to provide a measure of distance between a query haplotype and a reference haplotype. Using haplotype models learned from hundreds of thousands of haplotypes and subsequently annotated with over eight thousand single-origin reference individuals, Polly performs ultra-fast inference of both global and local ancestry. In this study, we evaluate Polly's accuracy in predicting local ancestry using simulated admixed samples with known genomic composition. We assess the assignment accuracy, the switching pattern and the tract length distribution. Using cross-validation experiment, we confirm that Polly makes highly accurate local ancestry estimates even at the subcontinental level. We further use Polly to analyze one million genomes from the United States and discover distinct local ancestry patterns among different ethnic groups and communities, especially among African Americans and Latino Americans. We map local ancestry estimates to individuals’ geographic locations. Our results illustrate clear population structure arising from immigration routes, assortative mating and isolation by distance. We also find evidence that supports large scale domestic migration events, as exemplified by the Great Migration of African Americans following the abolition of slavery. Finally, we attempt to date known historical events from ancestry tract length distributions. Overall, our analysis demonstrates the power of combining local ancestry analysis with big data in studying fine-scale population history.