Monday, 22 January 2018

Small segments and pile-ups - a visualisation

We've recently been discussing the problem of pile-ups in the All Genetic Genealogy group on Facebook. A pile-up is a term used in genetic genealogy to describe multiple shared autosomal DNA segments that are stacked up on top of each other on the same part of the genome. The presence of a pile-up should be considered as a warning sign. For any shared segment to have genealogical significance we would expect it to be shared only with descendants of the common ancestral couple. If we share a segment with hundreds or thousands of people it is extremely unlikely that we will  share that section of DNA by virtue of a recent genealogical relationship within the last ten generations or so, and it is much more likely to be indicative of a false match or a more distant relationship.

Pile-ups can occur for a number of different reasons:
  • Lack of phasing. Phasing is the process of sorting the DNA letters (the As, Cs, Ts and Gs) onto the paternal and maternal chromosomes. AncestryDNA and MyHeritage now used phased matching which means that they phase our genotypes before trying to identify shared sections of DNA. 23andMe and Family Tree DNA use a process of half-identical matching. Our DNA is not phased but instead the algorithms zigzag backwards and forwards across two columns of unsorted DNA letters looking for consecutive runs of matching SNPs. Half-identical matching works well at identifying large shared segments of DNA but is less successful on smaller segments, and particularly segments under about 10 centiMorgans (cMs) in size. if a match does not survive phasing it is a false match.
  • SNP-poor regions. The autosomal DNA tests used for genetic genealogy provide information on between 630,000 and 700,000 genetic markers known as SNPs (single nucleotide polymorphisms) which are scattered across the genome. These SNPs are only a tiny fraction of the three billion letters which make up the human genome, but the SNPs are specially selected for being the most informative about variations within and between populations. When trying to identify shared regions of the genome the companies are looking for long runs of consecutive SNPs that are the same (identical by state or IBS) in two individuals. Segments which pass the companies' matching thresholds are declared to be identical by descent (IBD) and are possibly indicative of shared ancestry in a genealogical timeframe. Some companies will also apply additional algorithms to filter out known problematic regions which are unlikely to be IBD. However, because not all of our SNPs are being tested, the length of a segment can be falsely inflated. One hypothesis is that lots of small segments can become conflated into longer segments. (1) This problem is particularly likely to occur in sections of the genome which have poor coverage on the chips. (2) 
  • Excess IBD. This is a term used to describe sections of the genome which are known to be widely shared in humans or in certain populations. Such regions often offer some type of evolutionary advantage. For an overview of known excess IBD regions see the section on excess IBD sharing in the ISOGG Wiki article on IBD. In addition to looking at the size of a shared segment, some IBD detection algorithms will, therefore, also take into account the frequency of the segment. (3) The more people who share a segment, the older it is likely to be. AncestryDNA apply their proprietary Timber algorithm to phased segments and they downweight the cM count for segments that are widely shared in their database. (4)
Each individual has their own personal pile-ups. It can be instructive to map out your pile-ups so that you are aware of your own danger zones. I've previously used Don Worth's ADSA (autosomal DNA segment analyser) tool which is available from DNAGedcom to look at my pile-ups. I've also use the matching segment search at GEDmatch (this tool is available to Tier 1 subscribers). (5)  These tools are very useful for identifying problems in specific regions but it's difficult to get a good idea of the bigger picture.

Following on from our discussion in the All Genetic Genealogy Facebook group, Dan Edwards has been working on an exciting tool to provide a new way of visualising pile-ups. It's possible that the tool will eventually be made available on the web but for the moment it is a bespoke service. Dan has been experimenting on some of my data. He has produced for me some charts showing the distribution of shared segments across my 22 autosomes and on the X-chromosome. Dan has kindly given me permission to share my charts which are reproduced below.

The charts are based on my Family Finder chromosome browser data from Family Tree DNA. FTDNA updated their match thresholds in May 2016, but they are still the only company that continue to include small segments under 6 cMs when inferring a relationship. It is generally accepted by genetic genealogists that the use of such small segments is problematical. (6)

The problem with small segments can be clearly seen in the charts below. Rather than being distributed evenly across my genome, the smaller shared segments form huge spires and skyscrapers. As the segment size increases the pile-ups are greatly reduced, but there are still some parts of my genome which have some quite sizeable pile-ups on segments over 10 cMs in size. Chromosomes 9, 14, 18 and 19, in particular, seem to have a few problem areas which it is probably best for me to avoid. As more matches come in, these spires and skyscrapers can be expected to grow even more. Remember too that FTDNA only reports "matches" on small segments if the match thresholds have already been met. If matches were reported on all matches in the database down to 1 cM it's likely that the spires would be even more pronounced.

If Dan is able to develop his tool further and make it more widely available it will be interesting to see how other people's pile-ups compare with mine. I hope that we might also be able to identify a reason for some of the pile-ups. In the meantime I hope you enjoy looking at my pictures.























Footnotes

(1) See: Chiang CWK, Ralph P, Novembre J (2016). Conflation of short identity-by-descent segments bias their inferred length distribution. G3 Genes Genomes Genetics 6: 1287.

(2) For a useful overview of SNP coverage on the chips used by AncestryDNA and 23andMe see Rebekah Canada's series of articles on the subject of exploring microarray chips.

(3) For a good overview of the methodology of IBD detection see Browning and Browning (2012):  Identity by descent between distant relatives: detection and applications (Annual Review of Genetics 2012; 46: 617-33). The authors state: "The key idea behind IBD segment detection is haplotype frequency. If the frequency of a shared haplotype is very small, the haplotype is unlikely to be observed twice in independently sampled individuals, so one can infer the presence of an IBD segment. This criterion can be applied in several ways. The first is length of sharing, which is a proxy for frequency. If two densely genotyped haplotypes are identical at all or most (allowing for some genotyping error) assayed alleles over a very large segment of a chromosome, then the haplotypes are likely to be identical by descent across the whole segment. The second is direct use of haplotype frequency: Shared haplotypes with estimated frequency below some threshold are determined to be identical by descent. The third makes use of a population genetics model to infer probability of IBD. Given the frequency of the shared haplotype and a probability model for the IBD process along the chromosome, one can estimate the probability that the individuals are identical by descent at any position on the segment."

(4) For a good explanation of how the AncestryDNA algorithm works read the blog post by Julie Granka on Filtering DNA matches at AncestryDNA with Timber. Take a look in particular at the figure in that blog post. Although the majority of phased segments filtered out by Timber are smaller segments under 15 cMs, note that it also downweights some larger segments up to 50 cMs in size.

(5) Peter Alefounder has developed a tool known as the Geneal Segment Stacker but I've not yet had time to play around with it. There are further details in this thread in the ISOGG Facebook group.

(6) For an excellent summary on the current state of our knowledge on the subject of small segments see the blog post A small segment round up by Blaine Bettinger.

Further reading

5 comments:

Dan Edwards said...

While ideally a graphing tool like this can be made widely available, high resolution coverage calculations are computationally intensive, and thus it being freely available is unlikely.

For now, using currently available tools for tracking shared segments can give visual clues of pile-up areas. For example, the ADSA tool at DNAGedcom: http://www.dnagedcom.com/adsa/sample.html . Using such a tool one loses the resolution along the chromosome coordinate, but prominent pile-ups can still be identified.

Hopefully soon I'll write up more at my own blog.

Linda Jonas said...

Great post! Thank you!

Debbie Kennett said...

Thank you Dan and Linda. I'm very much looking forward to reading Dan's blog post.

I'm sure people would be interested in paying for a bespoke service. I wonder if there's a possibility of taking advantage of computational capacity in the cloud.

E. Wolfson said...

Thanks for this article. I'm still trying to digest it all. Perhaps someone can help me with a problem. I'm trying to locate a particular unknown grandparent for someone (NPE). I've tested many first and second cousins on all other sides, since no match jumped out initially. I'm seeing six or seven matches on just Gedmatch, mind you...which appear share a 20-40cM segment on Chromosome 7 with this person but don't appear to significantly match anyone from the other (known) sides of their family. I did find them by doing the kits that match both kits function with a 40cM segment match, so perhaps that is the reason, but I didn't expect such a large segment overlap there. Some of them share one or two or even three other segments over 10cM but we're talking about an Ashkenazi (endogamous) population. How much should I be barking up the tree with the 35-40cM matches for this particular segment? Are there any IBC or pile-up segments that "large"? Somewhere I read Timber might remove segments with a measurement that large. Not sure I'll get notified without my name so if anyone has some thoughts, feel free to email me at ewolfson at yahoo dot com. Thanks for your time.

Debbie Kennett said...

E Wolfson, Here's the blog post from AncestryDNA about the Timber algorithm:

https://blogs.ancestry.com/ancestry/2015/6/8/filtering-dna-matches-at-ancestrydna-with-timber/

As you can see, Timber does remove some very sizeable segments. The fact that many of you share a single 40 cM segment suggests that this is some sort of pile-up and is therefore not genealogically relevant.