Showing posts with label Small segments. Show all posts
Showing posts with label Small segments. Show all posts

Monday, 22 January 2018

Small segments and pile-ups - a visualisation

We've recently been discussing the problem of pile-ups in the All Genetic Genealogy group on Facebook. A pile-up is a term used in genetic genealogy to describe multiple shared autosomal DNA segments that are stacked up on top of each other on the same part of the genome. The presence of a pile-up should be considered as a warning sign. For any shared segment to have genealogical significance we would expect it to be shared only with descendants of the common ancestral couple. If we share a segment with hundreds or thousands of people it is extremely unlikely that we will share that section of DNA by virtue of a recent genealogical relationship within the last ten generations or so, and it is much more likely to be indicative of a false match or a more distant relationship.

Pile-ups can occur for a number of different reasons:
  • Lack of phasing. Phasing is the process of sorting the DNA letters (the As, Cs, Ts and Gs) onto the paternal and maternal chromosomes. AncestryDNA and MyHeritage now used phased matching which means that they phase our genotypes before trying to identify shared sections of DNA. 23andMe and Family Tree DNA use a process of half-identical matching. Our DNA is not phased but instead the algorithms zigzag backwards and forwards across two columns of unsorted DNA letters looking for consecutive runs of matching SNPs. Half-identical matching works well at identifying large shared segments of DNA but is less successful on smaller segments, and particularly segments under about 10 centiMorgans (cMs) in size. if a match does not survive phasing it is a false match.
  • SNP-poor regions. The autosomal DNA tests used for genetic genealogy provide information on between 630,000 and 700,000 genetic markers known as SNPs (single nucleotide polymorphisms) which are scattered across the genome. These SNPs are only a tiny fraction of the three billion letters which make up the human genome, but the SNPs are specially selected for being the most informative about variations within and between populations. When trying to identify shared regions of the genome the companies are looking for long runs of consecutive SNPs that are the same (identical by state or IBS) in two individuals. Segments which pass the companies' matching thresholds are declared to be identical by descent (IBD) and are possibly indicative of shared ancestry in a genealogical timeframe. Some companies will also apply additional algorithms to filter out known problematic regions which are unlikely to be IBD. However, because not all of our SNPs are being tested, the length of a segment can be falsely inflated. One hypothesis is that lots of small segments can become conflated into longer segments. (1) This problem is particularly likely to occur in sections of the genome which have poor coverage on the chips. (2) 
  • Excess IBD. This is a term used to describe sections of the genome which are known to be widely shared in humans or in certain populations. Such regions often offer some type of evolutionary advantage. For an overview of known excess IBD regions see the section on excess IBD sharing in the ISOGG Wiki article on IBD. In addition to looking at the size of a shared segment, some IBD detection algorithms will, therefore, also take into account the frequency of the segment. (3) The more people who share a segment, the older it is likely to be. AncestryDNA apply their proprietary Timber algorithm to phased segments and they downweight the cM count for segments that are widely shared in their database. (4)
Each individual has their own personal pile-ups. It can be instructive to map out your pile-ups so that you are aware of your own danger zones. I've previously used Don Worth's ADSA (autosomal DNA segment analyser) tool which is available from DNAGedcom to look at my pile-ups. I've also use the matching segment search at GEDmatch (this tool is available to Tier 1 subscribers). (5)  These tools are very useful for identifying problems in specific regions but it's difficult to get a good idea of the bigger picture.

Following on from our discussion in the All Genetic Genealogy Facebook group, Dan Edwards has been working on an exciting tool to provide a new way of visualising pile-ups. It's possible that the tool will eventually be made available on the web but for the moment it is a bespoke service. Dan has been experimenting on some of my data. He has produced for me some charts showing the distribution of shared segments across my 22 autosomes and on the X-chromosome. Dan has kindly given me permission to share my charts which are reproduced below.

The charts are based on my Family Finder chromosome browser data from Family Tree DNA. FTDNA updated their match thresholds in May 2016, but they are still the only company that continue to include small segments under 6 cMs when inferring a relationship. It is generally accepted by genetic genealogists that the use of such small segments is problematical. (6)

The problem with small segments can be clearly seen in the charts below. Rather than being distributed evenly across my genome, the smaller shared segments form huge spires and skyscrapers. As the segment size increases the pile-ups are greatly reduced, but there are still some parts of my genome which have some quite sizeable pile-ups on segments over 10 cMs in size. Chromosomes 9, 14, 18 and 19, in particular, seem to have a few problem areas which it is probably best for me to avoid. As more matches come in, these spires and skyscrapers can be expected to grow even more. Remember too that FTDNA only reports "matches" on small segments if the match thresholds have already been met. If matches were reported on all matches in the database down to 1 cM it's likely that the spires would be even more pronounced.

If Dan is able to develop his tool further and make it more widely available it will be interesting to see how other people's pile-ups compare with mine. I hope that we might also be able to identify a reason for some of the pile-ups. In the meantime I hope you enjoy looking at my pictures.























Footnotes

(1) See: Chiang CWK, Ralph P, Novembre J (2016). Conflation of short identity-by-descent segments bias their inferred length distribution. G3 Genes Genomes Genetics 6: 1287.

(2) For a useful overview of SNP coverage on the chips used by AncestryDNA and 23andMe see Rebekah Canada's series of articles on the subject of exploring microarray chips.

(3) For a good overview of the methodology of IBD detection see Browning and Browning (2012):  Identity by descent between distant relatives: detection and applications (Annual Review of Genetics 2012; 46: 617-33). The authors state: "The key idea behind IBD segment detection is haplotype frequency. If the frequency of a shared haplotype is very small, the haplotype is unlikely to be observed twice in independently sampled individuals, so one can infer the presence of an IBD segment. This criterion can be applied in several ways. The first is length of sharing, which is a proxy for frequency. If two densely genotyped haplotypes are identical at all or most (allowing for some genotyping error) assayed alleles over a very large segment of a chromosome, then the haplotypes are likely to be identical by descent across the whole segment. The second is direct use of haplotype frequency: Shared haplotypes with estimated frequency below some threshold are determined to be identical by descent. The third makes use of a population genetics model to infer probability of IBD. Given the frequency of the shared haplotype and a probability model for the IBD process along the chromosome, one can estimate the probability that the individuals are identical by descent at any position on the segment."

(4) For a good explanation of how the AncestryDNA algorithm works read the blog post by Julie Granka on Filtering DNA matches at AncestryDNA with Timber. Take a look in particular at the figure in that blog post. Although the majority of phased segments filtered out by Timber are smaller segments under 15 cMs, note that it also downweights some larger segments up to 50 cMs in size.

(5) Peter Alefounder has developed a tool known as the Geneal Segment Stacker but I've not yet had time to play around with it. There are further details in this thread in the ISOGG Facebook group.

(6) For an excellent summary on the current state of our knowledge on the subject of small segments see the blog post A small segment round up by Blaine Bettinger.

Further reading

Sunday, 26 April 2015

Tracking DNA segments through time and space

One of the exciting aspects of autosomal DNA testing is that it gives us the opportunity of assigning segments of DNA to specific ancestors and then tracking the inheritance of those segments over time. To date the only match where I've been able to make the genealogical connection other than with immediate members of my family is with a fourth cousin, Mr K, who lives in Canada. I wrote previously about this match in my article on My first autosomal DNA success story. That match was very easy to identify because all my ancestors are in the UK and all Mr K's ancestors are in Canada and there was only one possible line where we could connect. It also helped that one of the shared surnames - Cruwys - is very rare. We can therefore state with confidence that the segment we share has been inherited from our ancestors William Cruwys (1793-1846) and Margaret Eastmond (1792-1874) who married in Rose Ash, Devon, on 18th July 1814.

I'd already tested my parents but one of my sons has now also taken the Family Finder test which gives me the opportunity to explore the inheritance patterns of these shared DNA segments in more detail.

In the screenshot below I've compared my dad with Mr K. They are third cousins once removed. They share three large segments in common: 20.12 centiMorgans on chromosome 1; 23.33 centiMorgans on chromosome 3 and 17.12 centiMorgans on chromosome 11.
Next I used the chromosome browser to compare myself with Mr K. Mr K and I are fourth cousins. You can see that two of the segments that my dad inherited have not been passed on to me and I only share a single segment on chromosome 11 with Mr K. This segment is 16.62 cMs and has been passed on virtually intact from my father to me.
The next screenshot shows a comparison with my son and Mr K. They are fourth cousins once removed. You can see that my son has inherited exactly the same segment as me. This segment measures 16.85 cMs and appears to be slightly larger in my son than it is in me, which is perhaps something to do with the rounding that FTDNA uses.
For all the above screenshots I've set the threshold at 5 cMs. Family Tree DNA are the only company who provide segment data under the 5 cM threshold. There has been much debate in the genetic genealogy community on the subject of small segments under 5 cMs, but there is a consensus that the vast majority of the small segments generated by the current chip tests are false positives and are nothing more than random noise. However, now that I have tested three generations of my family and we also have a match with a known cousin, I thought I would take the opportunity to do a comparison of the small segments to satisfy my own curiosity.

The screenshot below is taken from the perspective of my son, and I've set the threshold to 1 cM. The chromosome browser shows the segments my son shares in common with me, his maternal grandfather and his cousin Mr K. The segments shared with Mr K are shown in orange. The segments he shares with his grandfather are shown in green. The blue segments are shared in common with me. A child receives 50% of his DNA from his mother so my son matches me across the entire length of each chromosome. (Note that chromosomes come in pairs - we receive one set of chromosomes from our mother and one set of chromosomes from our father. However, the chromosome browser shows matches on a single chromosome and is unable to identify whether the match is on a maternal or a paternal chromosome.) There are 13 small segments that my son appears to share with Mr K. However, ten of these segments are seemingly shared by me, my son and Mr K but are not shared with my father. Clearly this is a biological impossibility because if a segment is identical by descent (IBD) then by definition it must have been passed on from a parent to a child and it couldn't possibly skip a generation. There are tiny segments on chromosome 6, chromosome 10 and chromosome 16 that are shared by all of us and these segments are therefore more likely to be IBD.

There have been suggestions that the process of triangulation (identifying three or more segments which match on the same chromosome) confirms that the segments are "real" or in other words that they are identical by descent (IBD). In this case the 13 small segments all triangulate with three people - me, my son and Mr K. However, when my dad is added to the mix we can see that the triangulation process breaks down. If the small segments were IBD then my dad should match on all of these small segments.

In the future when whole genome sequencing becomes the norm it should be possible to use small segments for genealogical matching purposes but with the limitations of the current technology extreme caution should be used when drawing conclusions about matches on small segments.

Further reading
The ISOGG Wiki article on identical by descent has further information on this subject:


There have been a number of blog posts that have dealt with the subject of small segments and they are all linked on the ISOGG Wiki page. I particularly recommend reading the following:


- Genealogy and autosomal DNA matches: common errors in “proving” an ancestor, and the allure of easy gateway ancestors by "Our Puzzling Past"

- Chromosome Pile-Ups in Genetic Genealogy: Examples from 23andMe and FTDNA by "Our Puzzling Past"



- What a difference a phase makes by Ann Turner (a guest post on Blaine Bettinger's blog)

Disclosure
I received a free DNA test from Family Tree DNA in compensation for speaking at Who Do You Think You Are? Live in 2014. I chose to have a Family Finder which I used to test my son.

© 2015 Debbie Kennett