The American Society for Human Genetics is holding its annual conference from 18th to 22nd October in Vancouver, Canada. The Platform and Poster Abstracts are
now available online. The research presented at this meeting gives a taste of some of the publications and developments to come in the next year or so. There are a number of abstracts that are of particular interest to genetic genealogists. In particular I note that AncestryDNA are presenting a number of interesting posters which hint at some new tools that might be on the way. I've highlighted below my picks from the conference programme.
23andMe will also be at the ASHG meeting. They have published a
list of the abstracts for their presentations and posters on their blog, though none of the content is of direct interest to genetic genealogists.
Platform Abstracts
74
Ultra-fine structural inference and population assignment
using IBD network clustering and classifiers accurately assign sub-continental
origins represented in a large admixed U.S. cohort.
E. Han, R. Curtis, P.
Carbonetto, K. Noto, J. Byrnes, Y. Wang, J. Granka, A. Kermany, K. Rand,
E. Elyashiv, H. Guturu, N. Myres, E. Hong, C. Ball, K. Chahine. Ancestry.com
DNA, LLC,
San Francisco, CA.
Motivation & Objectives: Identifying the geographic
origin of individuals using genetic data has broad application in forensics,
human disease and evolution. There have been multiple methods proposed to
achieve this goal, such as Principle Component Analysis (PCA), Spatial Ancestry
Analysis (SPA) and Geographic Population Structure (GPS). However, most methods
suffer from decreased prediction accuracy outside Europe and do not apply to
the US population comprised of admixed immigrants. In this study, we describe a
new method and demonstrate its accuracy in predicting geographic origins in the
US post-European colonization or internationally for single origin and admixed
samples.
Methods: We use a database of over 1.5 million consented genotype
samples collected from the US and internationally, along with samples from
public databases such as POBI. We build a genetic network by estimating the
amount of identity-by-descent (IBD) sharing between all individuals. By
iteratively applying the Louvain method for community detection, we find a
hierarchy of genetic clusters in the network. Levering user-generated pedigrees
going back 6-8 generations, we annotate each cluster with birth locations that
are enriched in historical time periods. The birth locations of these clusters
are generally specific to locations in the US or internationally, allowing for concise
geographical interpretation. Although community detection results assign
samples to only one cluster, we use machine learning classification to assign
samples to multiple clusters. Given this classification and enriched birth
locations, we identify the likely geographic origins of each sample.
Results:
Our results include over 300 stable clusters, each comprised of more than 1000
samples. Some clusters correspond to narrow geographical regions, such as
people descended from southern West Virginia in the 19th century, and others to
broader groups, such as European Jews from Poland. By using the associated
pedigrees, we demonstrate the accuracy of these predictions: over 95% of the
assigned individuals have at least one known ancestor born in the enriched
region defined by most clusters.
Conclusion: By utilizing large-scale genetic
data with associated pedigrees, we have developed the first method for
predicting the geographic origin of individuals within the US or
internationally with high accuracy. This approach can be used for ultra fine
scale genetic ancestry mapping in any population.
251
A massively scalable phenotyping approach using social media
for genetic studies.
J. Yuan1,2, A. Gordon1, D. Speyer1,2, D. Zielinski1, R.
Aufrichtig1, J. Pickrell1,3, Y. Erlich1,2. 1) New York Genome Center, New York,
NY; 2) Computer Science, Columbia University, New York, NY; 3) Biological
Sciences, Columbia University, New York, NY.
While DNA sequencing is largely a
tractable problem, massive phenotyping is still a challenge, especially for
Internet-based studies. Traditional methods, such as physical exams, scale
poorly for large numbers of individuals. Questionnaires are easier to collect,
but administering lengthy or frequent questionnaires creates a negative experience
for participants, leading to lower completion rates. Electronic health records
are a great resource for phenotypes, but they exhibit large heterogeneity when
collected from various resources and are subject to an array of confidentiality
restrictions that complicate their collection. Recent studies have highlighted
the value of obtaining digital phenotypes by interpreting the interactions of
users with digital outlets as a reflection of underlying traits. In particular,
these studies have shown that social media data enables the collection of
various phenotypes including big five personality traits, sexual orientation,
sleeping patterns, and even heart rate from regular user videos. The ubiquity
of the data and its ease of collection through standard APIs enable a new
methodology for large scale phenotypic collection. Here, we report our ongoing
efforts to enable participants to donate their social-media data along with
their genomes in order to understand the genetics of digital phenotypes. In our
previous work, we developed DNA.Land (https://dna.land), an online platform
where users may register and securely contribute their Direct to Consumer
genomic data, as well as receive reports of ancestry and shared relatives with
other DNA.Land users. Since our launch in ASHG2015, we have obtained over
20,000 users, many of whom have been eager to share personal information such
as family history. We are now building a new component in DNA.Land in which
users can contribute their Facebook data for scientific studies. We will
present our IBM Watson-based system to predict traits from social media data
and will describe the type of information DNA.Land users will receive. In
addition, we will discuss the particular challenges in collecting this data with
respect to both computational efforts and privacy concerns. Our approach is
applicable for other types of large scale efforts such as the Precision
Medicine Initiative and can easily scale to millions of people.
Poster Abstracts
1039W
Insights into the geographical distribution of genetic
admixture of unrelated volunteer donors and recipients of stem-cell
transplants.
A. Madbouly 1, K. Besse 1, Y. Wang 2, J. Byrnes 2, C. Ball 2, N.
Myres 2, M. Maiers 1. 1) Bioinformatic Research, National Marrow Donor Program,
Minneapolis, MN; 2) Ancestry.com, San Francisco, CA, USA.
Genetic ancestry of self-described groups may vary across
geographic locations in the US, a phenomenon documented anecdotally but not
thoroughly explored in the literature. We studied the genetic ancestry of 995
HLA matched donor/recipient (DR) pairs from the Be The Match® registry with a
focus on regional ancestry differences among ethnic groups. We hypothesized
that, along with historical events, donor/transplant center distribution and
socioeconomic factors might influence the geographical spread of some genetic
admixtures. We genotyped 995 DR pairs on the Illumina OmniExpress chip with
approximately 730,000 SNPs. Self-reported race and ethnicity was collected for
donors at the time of registry recruitment. Recipients’ race and ethnicity was
recorded at the transplant hospital once at the time of diagnosis and again
after transplant. The majority of the study cohort (94%) self-identified as
European Caucasian (CAU). The rest identified as Hispanic (HIS) (3.5%),
African-American (1%) and Asian or Pacific Islander (1.5%). Address zip code
information was available for 99% of recipients but only 59% of donors. Genetic
ancestry was estimated by applying the AncestryDNA ethnicity estimator
pipeline, which provides a vector of 26 admixtures. Some admixtures were
combined for the analysis due to small counts and minimal impact such as
detailed African (AFR) admixtures. We then mapped the geographical distribution
of European (EUR) and non-EUR genetic admixtures for self-reported CAU and non-CAU
individuals, optimizing geographical regions for subject privacy. The main
self-reported race groups showed average proportions of AFR and EUR admixtures compatible with Bryc and colleagues (2015).
However, our results revealed larger Amerindian admixture in self-reported HIS,
especially among recipients. When stratifying regionally, systematic differences
emerged in admixture distribution among similar race groups mostly
interpretable by historic events. Separating donors and recipients suggested
possible additional influences, such as donor and transplant center geographical
spread. Importantly, we observed differences in the distribution of
non-majority admixtures such as increased AFR admixture in self-reported CAU
donors (but not recipients) in some southern states suggesting a possible
socioeconomic link. This work has the potential of guiding stem-cell donor registry
strategies on volunteer donor recruitment and donor and transplant center
planning.
1067T
Geographic and historic changes in runs of homozygosity
among more than 1,000,000 individuals sheds light into the recent demographic
history of US population.
A. Kermany, C. Ball, J. Byrnes, P. Carbonetto, K.
Chahine, R. Curtis, E. Elyashiv, J. Granka, H. Guturu, E. Han, E. Hong, N.
Myres, K. Noto, K. Rand, Y. Wang. Ancestry.com DNA, LLC, San Francisco, CA.
Runs of Homozygosity (ROH) are indicators of segments of
chromosomes identical by descent between parental haplotypes. Distribution of
such runs along the chromosome contains information regarding the demographic
history of the population under study, in particular it reveals trends in
consanguinity. In this study, we analyze the distribution of runs of
homozygosity – chromosomal locations, number of runs and lengths of runs - as
well as estimated inbreeding coefficient (F) among more than 1,000,000
consented AncestryDNA customers. We report on observed variations in
distribution of ROH based on geographic origins - inferred from the available
pedigree data – admixture proportions as well as birth year cohort. In particular,
we present our results on variations in the distribution of ROH within 19
communities within the US population - identified based on analysing a network
of genetic matches in the database - and investigate differences in patterns of
ROH between each group and comment on the inferred demographic history within
each group.
1070T
Y-chromosomal sequencing and screening reveal both stability
and migrations in North Eurasian populations.
O. Balanovsky 1,2, V.
Zaporozhchenko 2,1, A. Agdzhoyan 1,2, I. Alborova 5, M. Kuznetsova 2, V. Urasin
3, M. Zhabagin 4, M. Chukhryaeva 2,1, Kh. Mustafi n 5, C. Tyler-Smith 6, E.
Balanovska 2 . 1) Vavilov Institute of General Genetics, Moscow, Russian
Federation; 2) Research Centre for Medical Genetics, Moscow, Russia; 3) YFull
service, Moscow, Russia; 4) National Laboratory Astana, Nazarbayev University,
Astana, Republic of Kazakhstan; 5) Moscow Institute of Physics and Technology
(State University), Moscow, Russia; 6) The Wellcome Trust Sanger Institute,
Wellcome Trust Genome Campus, Hinxton, United Kingdom.
Y-chromosomal markers exhibit the highest interpopulation
diversity in the genome and thus form one of the most informative tools for
tracing population history. However, their information value depends on
discovering SNPs which subdivide haplogroups with broad geographic distribution
into branches revealing fine population structure. Progress in such discoveries
has recently moved from a slow linear phase to a rapid exponential phase due to
NGS. We applied this approach to the Y-chromosomal pool of North Eurasian populations
and concentrated on haplogroups C, G1, G2, N1b, N1c, and R1b. We sequenced 181
Y-chromosomes (capturing 11 Mb from each sample), developed the NGSConv
software for calling Y-chromosomal SNPs, and identified roughly 2,500 SNPs,
most of which were new. Then we constructed phylogenetic trees and dated dozens
of their branches using our estimates of the mutation rate. The last – but not
the least – step included screening branch-defining SNPs in the entire Biobank
of indigenous North Eurasian populations (led by prof. Elena Balanovska), which
includes 26,000 samples from 260 populations. This screening resulted in
frequency distribution maps of 29 branches of haplogroups R1b and C, thus increasing
the phylogenetic resolution by an order of magnitude compared to the two
initial haplogroups. For haplogroup R1b, we identified a previously unstudied
“eastern” branch, R1b-GG400, found in East Europeans and West Asians and
forming a brother clade to the “western” branch R1b-L51 found in West
Europeans. The ancient samples from the Yamnaya archaeological culture are
located on this eastern branch, showing that the paternal descendants of the
Yamnaya population – in contrast to the published autosomal findings - still
live in the Pontic steppe and were not an important source of paternal lineages in
present-day West Europeans. For haplogroup C-M217 - the predominant paternal
component in Central Asians - we found signals of simultaneous expansion in two
independent branches. Both expansion times and gene geographic maps of the expanded
lineages indicated the emergence of the Mongol Empire as the likely trigger. We
conclude that simply discovering new SNP is not enough, but in combination with
screening for the branch-defining SNPs in large biobanks of indigenous
populations, it allows comprehensive reconstruction of male population history.
The study was supported by the Russian Science Foundationgrant 14-14-00827 to
OB.
1079T
Admixture inference of African Americans and Latinos in the
United States through time.
M.L. Spear 1, D.G. Torgerson 2, R.D. Hernandez 1,3,4.
1) Department of Bioengineering and Therapeutic Sciences, University of
California, San Francisco, San Francisco, CA; 2) Department of Medicine,
University of California, San Francisco, San Francisco, CA; 3) California
Institute for Quantitative Biosciences (QB3), University of California, San
Francisco, San Francisco, CA; 4) Institute for Human Genetics, University of
California, San Francisco, CA.
The study of admixed populations has provided important
insights into medical genetics and population history. The genomes of admixed
individuals are mosaics of segments originating from different ancestral
populations. At the genome-wide level, the proportion of one’s genome deriving
from each ancestral population is referred to as “global ancestry proportions”.
However, modern statistical methods enable inference of the ancestry at
individual SNPs within a genome, “local ancestry”, which allow us to
reconstruct the mosaic pattern of ancestry tracts across an individual’s
genome. Local ancestry inference is critical for the analysis of admixed
genomes and has been widely studied in the fields of medical genetics and human
demographic history. Local ancestry tracts can be used to infer migration
histories but the question remains how these histories have shaped ancestry
proportions over time, particularly in the United States, a “melting pot”
country that has faced changing societal norms over the past century. It has
yet to be determined how the length distribution of ancestry tracts in admixed
individuals has changed over decades as well as how the variation in ancestry
proportions across chromosomes and individuals may differ. Thus, we estimated
local ancestry for 4,600 Latinos and 2,100 African Americans from the Genetic
Epidemiology Research on Adult Health and Aging (GERA) dataset using RFMix. With
these local ancestry tracts, we used TRACTS to compare the observed length of
the ancestry tracts to predictions of different demographic models of migration
scenarios. Individuals were grouped by 5-year birth year categories, and
comparisons were made between the demographic models generated from each birth
year category. Overall, the local ancestry tracts of African Americans and
Latinos from the United States have provided insights into the change in
complexity of their genetic structure throughout the 20th century.
1095F
Fine-scale population structure in France: Loire River as
genetic barrier.
C. Dina 1,2, J. Giemza 1, M. Karakachoff 1,2, F. Simonet 1,2,
K. Rouault 3, E. Charpentier 1,2, S. Lecointe 1,2, P. Lindenbaum 1, J. Violleau
1,2, H. Le Marec 1,2, C. Férec 3, S. Chatel 1,2, S. Hercberg 4, P. Galan 4,
J-J. Schott 1,2, E. Génin 3, R. Redon 1,2. 1) Thorax Inst, INSERM-CNRS, Nantes,
France; 2) CHU Nantes, Nantes University; 3) Inserm UMR 1078, CHRU Brest, University
Bretagne Occidentale, EFS, Brest France; 4) Université Paris 13, Equipe de
Recherche en Epidémiologie Nutritionnelle, Centre de Recherche en Epidémiologie
et Statistiques, Inserm (U1153), Inra (U1125), Cnam, COMUE Sorbonne Paris Cité,
F-93017, Bobigny, France.
Background The genetic structure of human populations varies
throughout the world, being infl uenced by migration, admixture, natural
selection and genetic drift. Human population structure has first been
investigated at broad scales, between and within continents. Currently researchers
focus on finer scales, examining genetic structure within countries. Characterising
such genetic variation is of interest as it provides insight into demographical
history and informs research on disease association studies, especially on rare
variants. We here explored the genetic structure of a population living on the French
territory (hereafter called French population) both on the whole territory and
then on Western part where interesting stratification was identified.
Methods and
Results We genotyped genome-wide ; 2276 individuals with known department of
origin from French Population (SU.VI.MAX study) using Illumina Chip; 456
individuals (PREGO study) from Western France Atlantic Coast, from Finistère to
Vendée, with at least three of their grandparents born within a 15 kilometres
distance using Axiom CEU Chip. With EEMS software we visualised areas with low
effective migration rates - the migration barriers, which match with
geographical features, with particularly strong barrier on the lower course of
Loire in Western France. We then focused on the PREGO study and Principal
Components analysis revealed that individuals from the same departments form
clusters. In both datasets we observed a high correlation between geographical
position and components (p-value < 2e-16). Many independent methods support
the hypothesis that Loire River is a genetic barrier. The two groups of
individuals, from north or south of Loire, are well differentiated along PC1
axis. ADMIXTURE estimated different ancestry proportions for the two groups.
The first split of hierarchical clustering returned by fi neSTRUCTURE, and the
one based on normalized counts of identity-by-descent segments is between north
and south of Loire.
Conclusion We here report genetic stratification at the
level of continental French territory. The migration pattern is following the
geographical structure. A specific pattern is noticed around the Loire River.
We confirm both evidence for isolation by distance and existence of a genetic
barrier, the Loire River. The discovered fi ne-scale population structure may
have consequences in association analyses, especially for rare variants which
tend to be geographically clustered.
1106T
Identification and characterization of common
haplotypes found in a database of one million human genomes.
H. Guturu 1 , K.
Noto 1 , J. Byrnes 1 , S. Song 1 , P. Carbonetto 1 , R.E. Curtis 2 , E.
Elyashiv 1 , J.M. Granka 1 , E. Han 1 , E.L. Hong 1 , A.R. Kermany 1 , N.M.
Myres 2 , K.A. Rand 1 , Y. Wang 1 , C.A. Ball 1 , K.G. Chahine 2 . 1)
Ancestry.com DNA, LLC, San Francisco, CA; 2) Ancestry.com DNA, LLC, Lehi, UT.
Introduction: A common DNA-based method to detect relatives
and ancestors (“cousins”) is to identify and match shared portions of
chromosomes (haplotype blocks) between an individual and their potential
relatives. Identifying and matching the shared haplotype blocks is challenging
due to the non-uniform halving of genetic information that takes place during
the meiosis events of each generation. As the number of generations increases,
the average size of matching haplotype blocks shrink, due to successive
chromosomal recombination. Additionally, genetic drift, flow and selection
establish population structure that skews the distribution of frequency and
size of some haplotype blocks. We aim to characterize haplotype blocks based on
their frequency profiles and link haplotypes to ancestral communities
(“genetic ethnicities”) and more recent admixed communities.
Methods: Using a
novel haplotype block matching algorithm, we identify haplotype blocks that
occur frequently in a database of over one million samples genotyped by
Ancestry.com DNA, LLC. We review the frequency profiles of each haplotype, and
associate them with metadata inferred from global and local estimated admixture
("genetic ethnicity") as well as aggregated family history data from
public family trees associated with some of the genotypes.
Results: Common SNP windows have been characterized as
identifying signatures of the gamut from ethnicities to more recent admixed
communities resulting from migration. Further, we show that these signals of
ethnic populations and communities can be used to improve the accuracy of
identifying distant “cousin” matches by correcting for matches that are
predominately generated due to more ancient signals of ancestry.
Conclusion: By
linking common haplotype blocks to ancestral groups of varying age of origin,
we can improve the accuracy of ancestor identification for the desired task –
ancient haplotype blocks for ethnicity admixture detection to more recent
haplotype blocks that reflect recent cousins. Additionally, our
characterization of haplotype blocks by ancestral groups reveals interesting
candidates for further study and interpretation of their functional
implications in various ethnic and community groups.
1107F
Maps of effective migration as a summary of human genetic
diversity.
B. Peter, D. Petkova, M. Stephens, J. Novembre. University of
Chicago, Chicago, IL.
A dominant pattern of genetic diversity in humans is that
geographically proximal populations are generally more genetically similar to
one another; however, there are exceptions to this rule. Persistent
geographical features such as mountains, oceans, or deserts, have allowed
excess genetic differences to accumulate in some regions more than others.
Conversely, historical migrations and population movements have led to cases
where exceptional levels of similarity persist across large geographic
distances. To provide more insight into how genetic differentiation is
distributed geographically in humans, we examine the fine-scale genetic
structure of humans. We produce maps that represent the spatial structure of
human genetic diversity using a recently developed, spatially explicit method
(EEMS, Estimation of Effective Migration Surfaces). We apply EEMS on global,
continental, and sub-continental scales, analyzing genetic data from 8,740
individuals from 469 geographically localized populations, obtained from 24
different source studies. In addition to the major, well-known barriers such as
the Sahara, Himalayas and Mediterranean, we detect barriers that correlate with
historic language group boundaries (boundaries of Slavic and Bantu speakers
with their neighbors), mountain ranges (Zagros, Caucasus, Ural) and marine
features (English Channel, Adriatic Sea, Wallace line). We also identify
regions showing high connectivity despite having geographic separation (Britain
and Scandanavia, Iceland and Denmark, among the Lesser Sunda Islands).
Simultaneously, we find that levels of diversity vary more smoothly, decreasing
gradually with distance from Africa. Overall, our results suggest that
diversity patterns are consistent and primarily shaped by the signature of the
Out-of-Africa expansion, but that migration rates are strongly influenced by
geography and local events.
1113F
The African Genome Resource Project: Patrilineal and
matrilineal inheritance through the Y chromosome and the mitochondrial genome.
F. Abascal, D. Gurdasani, T. Carstensen, M. Pollard, C.
Pomilla, M. Sandhu on behalf of AGR investigators. Human Genetics, Wellcome
Trust Sanger Institute, Hinxton, Cambridge, United Kingdom.
Background The Y chromosome and the mitochondrial genome are
inherited from the paternal and maternal lines, respectively. The lack of
recombination in the mitochondrial genome and in large part of the Y chromosome
leads to evolution almost in isolation from the autosomal genome. As a result, the
Y chromosome and the mitochondrial genome offer a unique perspective on human
demographic processes. Y chromosome (Y-) and mitochondrial (mt-) haplogroups
can be very informative about human origins, migrations and admixture, as well
as about potential sex biases during these processes. Further characterisation
of the diversity of Y- and mt-haplogroups within Africa is essential to
understand human history. Here, we present the mitochondrial and Y chromosome
diversity among ~5000 individuals from the African Genome Resource panel.
Methods We predicted the mt- and Y-haplogroups for 4,990 individuals and 2,399
males, respectively, representing diverse ethno-linguistic groups from
Ethiopia, Uganda, South Africa, Egypt, and 5 African populations sequenced
within the 1000 Genomes project. Mitochondrial and Y haplogroups were predicted
with Haplogrep and YFitter, respectively. We called the mitochondrial genome
and the Y chromosome for each sample and reconstructed their phylogenetic
relationships with FastML.
Results We found evidence for Eurasian admixture among
several populations across sub-Saharan populations. Eurasian mt haplogroups
appeared in 23% of the Ethiopians and 0.8% of the Ugandans. No Eurasian mt
haplogroups were detected for the Zulu and Nama. We identified 13% Ethiopians,
0.5% Ugandan, and 43% Nama/Khoe-Sans with Eurasian Y haplogroups. Eurasian
admixture is prevalent in Ethiopia but it is not distributed homogenously.
Whereas the Gumuz show no Eurasian haplogroups, the Amhara show the highest
frequencies. Within the Nama/Khoe-San there is not a single Eurasian mitochondrial
haplogroup but up to 43% of Eurasian Y haplogroups, revealing a strong sex bias
(p=1e-12). Consistent with previous reports, the oldest haplogroups are found
in highest frequencies within the Khoe-Sans.
Conclusions We present the largest panel of mt and Y
chromosome sequences across Africa, including highly diverse Khoe-San populations
from South-Africa. Our findings suggest substantial variation in Y chromosome
and mt haplogroups across Africa, and provide evidence for extensive Eurasian
admixture among several populations across Africa.
1114W
Whole-genome sequence analyses provide new insights into the
demographic history and local adaptation of African populations.
S. Fan 1, D.E.
Kelly 1, M.H. Beltrame 1, M.E.B. Hansen 1, S. Mallick 2,3,4, T. Nyambo 5, S.
Omar 6, D. Meskel 7, G. Belay 7, A. Froment 8, N. Patterson 3, D. Reich 2,3,4,
S.A. Tishkoff 1,9 . 1) Department of Genetics, University of Pennsylvania,
Philadelphia, PA 19104, USA; 2) Department of Genetics, Harvard Medical School,
Boston, MA 02115, USA; 3) Broad Institute of Harvard and MIT, Cambridge, MA
02142, USA; 4) Howard Hughes Medical Institute, Harvard Medical School, Boston,
MA 02115, USA; 5) Department of Biochemistry, Muhimbili University of Health
and Allied Sciences, Dares Salaam, Tanzania; 6) Kenya Medical Research
Institute, Center for Biotechnology Research and Development, Nairobi, Kenya;
7) Department of Biology, Addis Ababa University, Addis Ababa, Ethiopia; 8) UMR
208, IRD-MNHN, Musée de l'Homme, Paris, France; 9) Department of Biology, University
of Pennsylvania, Philadelphia, PA 19104, USA.
Africa is the origin of modern humans within the past
200,000 years. There are more than 2,000 ethnolinguistic groups in Africa,
which encompass around one-third of the world’s languages. To infer the complex
demographic history of African populations and adaptation to diverse
environments, we sequenced the genomes of 94 individuals from 44 indigenous
African populations using high coverage Illumina sequencing technology. Phylogenetic
analysis confirms that the San lineage is basal to all other modern human
population lineages. The location of other African populations in the
phylogenetic tree correlates with geographical location, with the exception of
the Central Africa rainforest hunter-gatherer (RHG) populations, who group with
Southern African populations. We characterize ancient African population
structure by inferring the effective population size and divergence time
between populations. A common population bottleneck for all African populations
was observed at ~200 thousand years ago (kya), corresponding with
paleobiological evidence for modern human origins. Since then, the San and RHG
populations have maintained the largest effective population size compared to
other populations prior to 10 kya. Using MSMC analysis, we infer that the San
population split from the RHG and the East African Khoesan-speaking Hadza and
Sandawe hunter-gatherers within the past 66-82 kya, suggesting these
populations could have originated from a historically more widespread
population of hunter-gatherers. By contrast, the San diverged from all
non-Khoesan speaking populations ~100-120 kya The divergence times of
Niger-Kordofanian, Nilo-Saharan and Afroasiatic speaking populations were
within the past ~22 to 41 kya. In the RHG populations, the oldest divergence
was found between Eastern and Western RHG at ~36-51 kya; the time of divergence
of the western RHG populations was inferred to be ~12-18 kya. Based on the
ADMIXTURE analysis, Niger-Kordofanian and RHG populations were pooled for
analyses of natural selection. We observed signatures of positive selection at
genes involve in muscle development, bone synthesis, reproduction, immune
function, energy metabolism, cell signaling, and neural development.
This work
is supported by NIH grants 1R01DK104339-01, 1R01GM113657-01, and DP1
ES022577-04 to SAT. The sequencing was funded by the Simons Foundation (SFARI
280376) and the U.S. National Science Foundation (BCS-1032255) grants to DR.
1127T
The Genome Diversity in Africa Project: A deep catalogue of
genetic diversity across Africa.
D. Gurdasani 1,2, J.P. Martinez 1, M.O.
Pollard 1,2, T. Carstensen 1,2, C. Pomilla 1,2, GDAP Investigators 1,2 . 1)
Wellcome Trust Sanger Institute, Cambridge, Cambridgeshire, United Kingdom; 2)
Department of Medicine, University of Cambridge, Cambridge.
Background
While recent efforts have greatly extended our
understanding of genetic diversity in Africa, current sequence panels are
limited in their capture of African genetic variation. Deeper sequencing with
sampling of diverse indigenous populations is needed to capture diverse
haplotypes across Africa. The Genome Diversity in Africa Project (GDAP) aims to
characterise diversity from representative populations across all of Africa,
including from several indigenous hunter-gatherer populations across the
region. This would provide an important global resource to understand human
genetic diversity and provide insight into population history and migrations
across Africa in recent times. The project has completed sequencing of 575
samples across 23 populations in Africa, including populations from the Gambia,
Ghana, Morocco, South Africa, Sudan, Chad, Kenya, South Africa, Uganda, Egypt
and Ethiopia. Here, we present preliminary results from the project on 133
samples from 5 ethno-linguistic groups from Morocco, Ghana (Ashanti), Nigeria
(Igbo), Kenya (Kalenjin) and South Africa (Zulu) sequenced on the Hiseq X
platform (30x).
Methods Reads were mapped to the GRCh38 reference. Following
quality control, variant sites were called using HaplotypeCaller v3.5 for each
sample to generate gVCFs. GenotypeGVCFs was run across all samples for joint calling.
VCFs were fi ltered using VQSR calibrated on DP, QD, FS, SOR, Read- PosRankSum
and MQRankSum annotations. A tranche sensitivity threshold of 99.5% was applied
for fi ltering of SNPs and 99% for indels. Only sites called in >90% of
individuals were included. Results We identifi ed 25.1M SNPs and 2.9M indels
among 133 individuals in the GDAP pilot phase, with 25% and 47% of SNPs and
indels being novel (not in dbSNP141), respectively. A large proportion of
variants per population were private, varying from 12-18%, being greatest among
the Kalenjin and Zulu. We found the highest level of heterozygosity and genetic
variation among the Zulu, consistent with reported Khoe- San admixture in this
group. Conclusions We present the pilot phase of the Genome Diversity in Africa
Project, identifying a high level of diversity across 5 populations from
Africa. Inclusion of indigenous population groups, such as the Hadza, Twa
Pygmies, and Ju/’hoansi in the next phase will materially advance the
understanding of genetic diversity across African populations, and provide an
invaluable resource to researchers worldwide.
1133T
High-coverage sequencing of the Human Genome Diversity
Project (HGDP-CEPH) Panel.
S. McCarthy 1, A. Anders Bergström 1, Y. Xue 1, Q.
Ayub 1, S. Mallick 2,3,4, M. Sandhu 1, D. Reich 2,3,4, R. Durbin 1, C.
Tyler-Smith 1 . 1) Wellcome Trust Sanger Institute, Cambridge, United Kingdom;
2) Department of Genetics, Harvard Medical School, Boston, MA; 3) Broad
Institute of Harvard and MIT, Cambridge, MA; 4) Howard Hughes Medical
Institute, Boston, MA.
We discuss the completion of high coverage (>30x),
whole-genome sequencing of all 952 core individuals in the Human Genome
Diversity Panel (HGDP-CEPH), with the results being made available as an open
access population data resource. This widely used panel contains samples from
52 populations spanning Africa, the Middle East, Europe, Asia, Oceania and the Americas,
and previous genotype data from these samples have been an important reference
resource for human genetic diversity. As seen in the 1000 Genomes Project,
having fully open access data, unencumbered by managed access restrictions and
other hurdles, is an invaluable driver for democratized data analysis and
methods development Building on previous sequencing efforts by the Simons
Genome Diversity Project, we have completed sequencing of the panel and are
making the data available via the ENA and the 1000 Genomes Project data
management successor, the International Genome Sample Resource (IGSR)
(www.internationalgenome.org). All data has moved to the new GRCh38 reference
and we present preliminary results on the call set derived from this data. We
have GATK HaplotypeCaller and fermikit primary calls, are making mpileup and
freebayes calls, and will present an integrated call set that has been
computationally phased, together with initial population genetic analyses. A
small number of samples are being experimentally phased using 10X Genomics
technology which will allow evaluation of phasing accuracy, and also unbiased
use of haplotype-based analyses such as MSMC.
1136T
Fine-scale identity-by-descent and birth records in Finland
provide insights into recent population history.
A.R. Martin 1,2, S. Kirminen 3,
A.S. Havulinna 4, A. Sarin 3, A. Palotie 1,2,3, V. Salomaa 4, S. Ripatti 3, M.
Pirinen 3, M.J. Daly 1,2 . 1) Analytic and Translational Genetics Unit,
Massachusetts General Hospital, Boston, MA, USA; 2) Medical and Population
Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA; 3) Institute
for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki,
Finland; 4) National Institute for Health and Welfare (THL), Helsinki, Finland.
Finland provides unique opportunities to investigate both population and medical
genomics because of its adoption of unprecedented uniformity in national
electronic health records, concerted coordination of research centers across
the country, detailed historical records, as well as recent population bottlenecks
that drove specific disease alleles to high frequency. We investigate recent
population history (up to ~50 generations ago), particularly relevant to rare,
disease-conferring alleles, using identity-by-descent (IBD) haplotype sharing
in >10,000 Finns. We compare IBD sharing in Finland to nearby Scandinavian countries
with considerably different population histories, including >8,000 Swedes
and >30,000 Danes. We find drastically more sharing on average in Finns,
including many long tracts. By leveraging fi ne-scale birth record data, we find
a non-linear decay of pairwise IBD sharing with increasing distance across Finland.
This arises from pockets of excess IBD sharing; e.g. pairs of individuals from
northeast Finland share on average several-fold more of their genome IBD than
pairs from southwest regions containing the major cities of Turku and Helsinki.
We demonstrate inference of recent migration patterns from IBD sharing
patterns. For example, high IBD sharing in northeast Finland radiates from
north to south rather than to the west, indicating that migration is restricted
near the Russian border. We also investigate recent effective population size
changes across regions of Finland and find evidence supporting the distinction
between early and late settlement areas. However, our results indicate a more
continuous flow of migration than previously posited, with a minimum N e
occurring ~12 generations ago in the northernmost Lapland region and moving
further back in time to the south, with a bottleneck detectable in the early
settlement area ~40 generations ago. Lastly, we leverage IBD sharing for
genetic disease mapping and show that rare, functional haplotypes show more
significant association via IBD mapping than single variants with linear mixed
effect models.
1147W
Y-chromosomal composition of mediaeval and contemporary
populations in Norway and adjacent Scandinavian countries: Y-STR haplotypes and
the rare Y-haplogroup Q.
B. Berger 1, S. Willuweit 2, H. Niederstätter 1, P. Kralj
1, L. Roewer 2, W. Parson 1,3. 1) Institute of Legal Medicine, Medical
University of Innsbruck, Innsbruck, Austria; 2) Department of Forensic
Genetics, Institute of Legal Medicine and Forensic Sciences,
Charité-Universitätsmedizin, Berlin, 13353, Germany; 3) Forensic Science
Program, The Pennsylvania State University, PA, USA.
In the framework of the project “Immigration and mobility in
mediaeval and post-mediaeval Norway” molecular genetic analyses were performed
on 97 pre-modern human remains including genetic sexing and Y-chromosomal DNA
typing. All samples were subjected to molecular genetic analyses of the sex
using “Genderplex” consisting of two diff erent regions of the amelogenin gene,
SRY and four X-STR loci. From 90% of the extracted remains (n=87) sex
assignment was possible. Of these, 49 (56.3%) brought a genetically male
result. All of these DNA extracts were subjected to Y-STR analysis using Yfiler
Plus PCR Amplification Kit (Thermo Fisher Scientifi c) and/or PowerPlex Y23
System (Promega). At least partial Y-STR profiles were obtained from all
samples. A detailed comparison between mediaeval/post-mediaeval and contemporary
Y-chromosomes was performed by searching the obtained haplotypes (HTs) in the Y
Chromosome Haplotype Reference Database (YHRD: https://yhrd.org) comprising
154,329 haplotypes from 991 populations in 129 countries at the time of query
(Release 50). YHRD searches of the pre-modern haplotypes yielded full matches plus
neighbor-matches differring at only one allele from the query HT. Matches are
presented with geographical and ancestry information of the contemporary HTs.
For samples without direct YHRD-matches, this information is provided through
their neighbor HTs. AMOVA was performed using the YHRD online tool on pairwise
R ST values to create the corresponding MDS plots. The pre-modern HTs were
grouped according to medieval and post-medieval origin and compared to
contemporary populations from Scandinavian (Norwegian, Swedish and Danish),
Northwest European, and Northeast European populations. Both pre-modern
populations showed small genetic distances to contemporary Scandinavians and
larger distances to Northeast Europeans with Northwest European populations in
between. As expected, an initial assessment of the Y-chromosomal haplogroups
(HGs) showed that most of the samples were attributable to the main European
HGs I1, R1a and R1b. However, one of the HTs seemed to be associated with HG-Q
which is rare in Europe and hitherto little evaluated in this region. Network analysis
was applied for detecting similar HTs in contemporary samples from Norway and
adjacent Northern European countries stored in the YHRD. The outcomes of this
survey should initiate a detailed SNP based HG-assessment of HG-Q candidate
samples.
1149F
Evidence for detailed historical European population
structure from large-scale, diverse genetic polymorphism data.
P. Carbonetto 1,
J. Byrnes 1, J.M. Granka 1, Y. Wang 1, K. Noto 1, E. Han 1, A.R. Kermany 1,
K.A. Rand 1, E. Elyashiv 1, H. Guturu 1, N.M. Myres 2, E.L. Hong 1, R.E. Curtis
2, K.G. Chahine 2, C.A. Ball 1. 1) Ancestry.com DNA, LLC, San Francisco, CA; 2)
Ancestry.com DNA, LLC, Lehi, UT.
Despite the recent surge of interest in ancient genomes, we
show that there is still much to be elucidated about human demography from
contemporary genomes. Here, we demonstrate the use of genealogical data to
generate demographic insights from analysis of a large-scale, heterogeneous
genetic data set. Specifically, we show that an unsupervised ADMIXTURE analysis
of genotypes from 131,293 primarily US-born individuals, followed by a simple statistical
analysis of the 3 million pedigree records linked to these genotype samples,
yields novel insights into European genetic diversity. In contrast to principal
component analysis (PCA), which is the most widely used approach to
investigating European genetic diversity, we use ADMIXTURE to infer genetically
differentiated source populations reflecting more distant historical time
periods. Unsurprisingly, among European-origin individuals, admixture is
pervasive. Despite this, our ADMIXTURE analysis with K = 12 ancestral populations
identifies 5 stable, genetically differentiated groups within Europe (with
putative historical counterparts in parentheses): Ashkenazi Jewish, Irish (Celts),
Eastern Europeans (Slavs), Scandinavians (Nordics) and Iberians, featuring
Basques and Sardinians. The genealogical data also allow us to provide a
detailed portrait of the genetic composition of contemporary peoples across
North America (e.g., Iberians in Cuba), and other parts of the world. This work
suggests the potential for drawing more detailed connections between
present-day and ancient genetic variation by leveraging large, heterogeneous genetic
data sets.
1153W
Genomic insights into the population structure and history
of the Irish Travellers.
E.H. Gilbert 1, S. Carmi 2, S. Ennis 3, J.F. Wilson
4,5, G.L. Cavalleri 1. 1) Molecular and Cellular Therapeutics, Royal College of
Surgeons in Ireland, Dublin, Leinster, Ireland; 2) Braun School of Public
Health, The Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem,
Israel; 3) School of Medicine and Medical Science, University College Dublin,
Dublin, Ireland; 4) Centre for Global Health Research, Usher Institute for
Population Health Sciences and Informatics, University of Edinburgh, Teviot
Place, Edinburgh, Scotland; 5) MRC Human Genetics Unit, Institute of Genetics
and Molecular Medicine, University of Edinburgh, Western General Hospital,
Crewe Road, Edinburgh, Scotland.
Aims: The Irish Travellers are a population with a history
of nomadism. Consanguineous unions are common, and as a population they are
socially and genetically isolated from the surrounding, “settled” Irish
population. A previous low-resolution genetic analysis suggested a common Irish
origin between the settled and the Traveller populations. What is not known,
however, is the extent of population structure within the Irish Traveller
population, the time of divergence from the general Irish population, and the
extent of autozygosity.
Methods: We recruited Irish Travellers from across Ireland
and the UK. To be included a participant had to have had at least three
grandparents with a surname associated with the Irish Travellers. DNA was
extracted from saliva samples, and genotypes were generated using the Illumina
OmniExpress SNP genotyping platform. With this data, we investigated population
structure using fineStructure, quantifi ed the levels of autozygosity with
PLINK, and estimated a time of divergence using a method based on Identity by
Descent (IBD) segment identification.
Results: We merged, cleaned, and analysed data from 42 Irish
Travellers, 2232 settled Irish, 2039 British, 143 Roma Gypsies, and 931 individuals
from 57 world-wide populations. We confirm an Irish origin for the Irish
Travellers, demonstrate evidence for population substructure within the population,
confirm high levels of autozygosity consistent with a consanguineous population,
and for the first time provide estimates for a date of divergence between the
Irish Travellers and settled Irish.
Conclusion: Our findings have implications for disease
mapping within Ireland, as well as on the social history of the Irish Traveller
population.
1162W
Personal ancestry inference at the finest scale reveals
more sub-structure in the UK.
D. Lawson, G. Weyenburg. Integrative Epidemiology
Unit, University of Bristol, Bristol, UK, United Kingdom.
Chromosome Painting has revealed genetic differences within
the UK at a very fine scale [1], with structured genetic variation within a
single county in some cases (such as Cornwall & South Wales). However, in
that work, it was not possible to genetically distinguish much of England,
which appeared as a single homogeneous group. Here, we describe an extension to
the Fine-STRUCTURE [2] clustering that can further distinguish ancestry even
within England; for example, identifying regions such as Norfolk, the Midlands
and the South as genetically distinct. The approach works by using the known county
locations to craft genetic features to use in unsupervised clustering. Specifically,
we group individuals by their geographic sampling location into reference donor
populations. This forms an ancestry profile - which can be viewed as a careful
choice of feature vector - that still allows unsupervised genetic clustering
for all individuals. Further, we describe how this approach allows individuals
to be described as an admixture of the inferred geographical clusters. This
allows ancestral information to be recovered for individuals who are not purely
represented by a single geographical location. This also allows us to
characterise the genetic relationship between the inferred clusters, several of
which represent drift that is most strongly represented by a particular geographical
region (including Cornwall, Wales, Scotland and the North of England) and
others of which represent characteristic admixture proportions between these
ancestral drifted populations. Beyond improving resolution, this approach
facilitates personal genomics because individuals can be represented in terms
of the fixed reference panel. We demonstrate the utility of the approach by
describing the ancestry of the UK10K participants in terms of the new, high
resolution POBI clusters. Previously, a similar analysis [3] without
geographical information inferred little population structure in the UK from
these samples, but now we have a rich representation of their population structure,
including an assessment of admixture from outside the UK. This highlights the
value in high quality fine-scale geographic sampling, which could now
facilitate this level of ancestry identification for many other countries.
[1] Leslie
et al 2015, Nature 519:309–314 [2] Lawson et al 2012, PLoS Genet. 8:e1002453
[3] UK10K Consortium 2015, Nature 526:82-90.
1174W
Chromosome painting for arbitrary sample collections.
G.
Weyenberg, D. Lawson. Integrative Epidemiology Unit, University of
Bristol, Bristol, United Kingdom.
Haplotype-based methods have been demonstrated to be capable
of detecting fine scale structure within human populations—to the point of
distinguishing genetic variation at the sub-county level in the South West of
England [1]. However, the aforementioned method implements an all-against-all
analysis of sampled individuals, which is not suited to all applications,
including personal genomics where samples are obtained individually or in small
batches. Here, we describe an extension of the FineSTRUCTURE [2] method to
allow for painting of individual samples against a panel of pre-calculated
reference haplotype clusters, making the method computationally feasible for
on-demand analysis of individuals. The choice of the reference panel also
allows the user to tailor the analysis to emphasise targeted features of the
data. For example, in the context of a personal ancestry imputation, panels may
be constructed to focus on global-, continental-, or national-scale genetic
features, and the low computational cost of painting an individual against a
pre-computed panel makes sample-level exploratory analysis feasible. Another
application of the panel-based painting is to use high-quality reference data
to impute unknown geographical labels to samples where such information is
either unavailable, or was collected at an undesirable resolution. To
demonstrate the latter application, we analysed several populations with suspected
Northern-European ancestry—including the Hapmap CEU and ASW populations, and
the UK10K dataset—with respect to panels of Europeans and the high-resolution
People of the British Isles (POBI) samples. These individuals are characterised
in terms of an admixture of inferred clusters in the reference populations.
Whilst many individuals were best described as a complex admixture that likely occurred
over many generations, many others had a clear signal of geographically distinct
ancestry.
[1] Leslie et al 2015, Nature 519:309–314 [2] Lawson et al 2012, PLoS
Genet. 8:e1002453.
1178T
Local ancestry patterns inferred from one million genomes
recapitulate fine-scale population history.
Y. Wang 1, K. Noto 1, J. Byrnes 1,
R.E. Curtis 2, E. Han 1, E. Eyal 1, G. Harendra 1, P. Carbonetto 1, A.R Kermany
1, J.M. Granka 1, K.A. Rand 1, N.M. Natalie 2, E.L. Hong 1, C.A. Ball 1, K.G.
Chahine 2 . 1) Ancestry.com DNA, LLC, San Francisco, CA; 2) Ancestry.com DNA,
LLC, Lehi, UT.
In a country of immigrants, population structure is shaped
by a long, ongoing history of immigration, followed by subsequent admixture and
migration. All these events have left their footprints in the genomic landscape
of current residents and make it possible for geneticists to reconstruct
population history from genomic data. However, deciphering the signature of
these forces requires accurate inference of genomic tracts that one individual
inherits from ancestors of different origins. Previously, several methods have
been developed for inferring local ancestry with varying levels of success.
Unfortunately, none of these methods can be feasibly applied to a data set of
one million genomes. Recently, our team presented Polly, a novel algorithm for estimating
genome-wide ancestry proportions in admixed samples. Polly, built on a modified
version of the BEAGLE haplotype model, relies on this model to achieve two
things: First, to account for phasing uncertainly, and second, to provide a
measure of distance between a query haplotype and a reference haplotype. Using
haplotype models learned from hundreds of thousands of haplotypes and
subsequently annotated with over eight thousand single-origin reference
individuals, Polly performs ultra-fast inference of both global and local ancestry.
In this study, we evaluate Polly's accuracy in predicting local ancestry using
simulated admixed samples with known genomic composition. We assess the
assignment accuracy, the switching pattern and the tract length distribution.
Using cross-validation experiment, we confirm that Polly makes highly accurate
local ancestry estimates even at the subcontinental level. We further use Polly
to analyze one million genomes from the United States and discover distinct local
ancestry patterns among different ethnic groups and communities, especially
among African Americans and Latino Americans. We map local ancestry estimates
to individuals’ geographic locations. Our results illustrate clear population
structure arising from immigration routes, assortative mating and isolation by
distance. We also find evidence that supports large scale domestic migration
events, as exemplified by the Great Migration of African Americans following
the abolition of slavery. Finally, we attempt to date known historical events
from ancestry tract length distributions. Overall, our analysis demonstrates
the power of combining local ancestry analysis with big data in studying fine-scale
population history.