Pile-ups can occur for a number of different reasons:
- Lack of phasing. Phasing is the process of sorting the DNA letters (the As, Cs, Ts and Gs) onto the paternal and maternal chromosomes. AncestryDNA and MyHeritage now used phased matching which means that they phase our genotypes before trying to identify shared sections of DNA. 23andMe and Family Tree DNA use a process of half-identical matching. Our DNA is not phased but instead the algorithms zigzag backwards and forwards across two columns of unsorted DNA letters looking for consecutive runs of matching SNPs. Half-identical matching works well at identifying large shared segments of DNA but is less successful on smaller segments, and particularly segments under about 10 centiMorgans (cMs) in size. if a match does not survive phasing it is a false match.
- SNP-poor regions. The autosomal DNA tests used for genetic genealogy provide information on between 630,000 and 700,000 genetic markers known as SNPs (single nucleotide polymorphisms) which are scattered across the genome. These SNPs are only a tiny fraction of the three billion letters which make up the human genome, but the SNPs are specially selected for being the most informative about variations within and between populations. When trying to identify shared regions of the genome the companies are looking for long runs of consecutive SNPs that are the same (identical by state or IBS) in two individuals. Segments which pass the companies' matching thresholds are declared to be identical by descent (IBD) and are possibly indicative of shared ancestry in a genealogical timeframe. Some companies will also apply additional algorithms to filter out known problematic regions which are unlikely to be IBD. However, because not all of our SNPs are being tested, the length of a segment can be falsely inflated. One hypothesis is that lots of small segments can become conflated into longer segments. (1) This problem is particularly likely to occur in sections of the genome which have poor coverage on the chips. (2)
- Excess IBD. This is a term used to describe sections of the genome which are known to be widely shared in humans or in certain populations. Such regions often offer some type of evolutionary advantage. For an overview of known excess IBD regions see the section on excess IBD sharing in the ISOGG Wiki article on IBD. In addition to looking at the size of a shared segment, some IBD detection algorithms will, therefore, also take into account the frequency of the segment. (3) The more people who share a segment, the older it is likely to be. AncestryDNA apply their proprietary Timber algorithm to phased segments and they downweight the cM count for segments that are widely shared in their database. (4)
Following on from our discussion in the All Genetic Genealogy Facebook group, Dan Edwards has been working on an exciting tool to provide a new way of visualising pile-ups. It's possible that the tool will eventually be made available on the web but for the moment it is a bespoke service. Dan has been experimenting on some of my data. He has produced for me some charts showing the distribution of shared segments across my 22 autosomes and on the X-chromosome. Dan has kindly given me permission to share my charts which are reproduced below.
The charts are based on my Family Finder chromosome browser data from Family Tree DNA. FTDNA updated their match thresholds in May 2016, but they are still the only company that continue to include small segments under 6 cMs when inferring a relationship. It is generally accepted by genetic genealogists that the use of such small segments is problematical. (6)
The problem with small segments can be clearly seen in the charts below. Rather than being distributed evenly across my genome, the smaller shared segments form huge spires and skyscrapers. As the segment size increases the pile-ups are greatly reduced, but there are still some parts of my genome which have some quite sizeable pile-ups on segments over 10 cMs in size. Chromosomes 9, 14, 18 and 19, in particular, seem to have a few problem areas which it is probably best for me to avoid. As more matches come in, these spires and skyscrapers can be expected to grow even more. Remember too that FTDNA only reports "matches" on small segments if the match thresholds have already been met. If matches were reported on all matches in the database down to 1 cM it's likely that the spires would be even more pronounced.
If Dan is able to develop his tool further and make it more widely available it will be interesting to see how other people's pile-ups compare with mine. I hope that we might also be able to identify a reason for some of the pile-ups. In the meantime I hope you enjoy looking at my pictures.
Footnotes
(1) See: Chiang CWK, Ralph P, Novembre J (2016). Conflation of short identity-by-descent segments bias their inferred length distribution. G3 Genes Genomes Genetics 6: 1287.(2) For a useful overview of SNP coverage on the chips used by AncestryDNA and 23andMe see Rebekah Canada's series of articles on the subject of exploring microarray chips.
(3) For a good overview of the methodology of IBD detection see Browning and Browning (2012): Identity by descent between distant relatives: detection and applications (Annual Review of Genetics 2012; 46: 617-33). The authors state: "The key idea behind IBD segment detection is haplotype frequency. If the frequency of a shared haplotype is very small, the haplotype is unlikely to be observed twice in independently sampled individuals, so one can infer the presence of an IBD segment. This criterion can be applied in several ways. The first is length of sharing, which is a proxy for frequency. If two densely genotyped haplotypes are identical at all or most (allowing for some genotyping error) assayed alleles over a very large segment of a chromosome, then the haplotypes are likely to be identical by descent across the whole segment. The second is direct use of haplotype frequency: Shared haplotypes with estimated frequency below some threshold are determined to be identical by descent. The third makes use of a population genetics model to infer probability of IBD. Given the frequency of the shared haplotype and a probability model for the IBD process along the chromosome, one can estimate the probability that the individuals are identical by descent at any position on the segment."
(4) For a good explanation of how the AncestryDNA algorithm works read the blog post by Julie Granka on Filtering DNA matches at AncestryDNA with Timber. Take a look in particular at the figure in that blog post. Although the majority of phased segments filtered out by Timber are smaller segments under 15 cMs, note that it also downweights some larger segments up to 50 cMs in size.
(5) Peter Alefounder has developed a tool known as the Geneal Segment Stacker but I've not yet had time to play around with it. There are further details in this thread in the ISOGG Facebook group.
(6) For an excellent summary on the current state of our knowledge on the subject of small segments see the blog post A small segment round up by Blaine Bettinger.
Further reading
- Chromosome pile-ups in genetic genealogy: examples from FTDNA and 23andMe. Genealogy and Genomics, 31 January 2015.