- AncestryDNA's cutting edge gets even sharper (accessible from the "learn more" link in the above box)
- The science behind a more precise DNA matching algorithm – a blog post by Anna Swayne on the AncestryDNA Tech Roots blog
- AncestryDNA Matching White Paper – essential reading if you want all the very technical details about the update.
The biggest change is an improvement in the phasing process, Phasing is the process of sorting out the DNA letters – the As, Cs, Ts and Gs – and placing them on the maternal and paternal chromosomes. Phasing is important for ruling out false positive and false negative matches. AncestryDNA are now using a reference panel of more than 300,000 genotypes for their phasing. Previously they were using a "window" system for IBD detection which broke the large segments into too many small pieces. Now they are using a SNP-based system which provides more realistic results with fewer segments. Phasing can be done with reference panels with a high degree of accuracy – the error rate of Ancestry's Underdog phasing engine is less than 1%. The accuracy will increase as the reference panel grows in size.
The matching threshold has also been changed. Two people must now share a minimum of 6 cMs whereas the old threshold was 5 cMs. AncestryDNA have produced a revised table of confidence scores based on a new understanding of the amount of DNA shared between different relations.
Contrast the above scores with the old version of the chart below which, to my mind, was always overly optimistic, especially about the matches on segments under 20 cMs, the vast majority of which are actually shared with very distant cousins. (For more on this subject watch Dr Doug Speed's lecture Who's your cousin? Using DNA to determine relatedness which he presented at Who Do You Think You Are? Live this year.)
Comparing matches before and after
I thought it would be an interesting exercise to compare my matches before and after the update. Unlike Family Tree DNA and 23andMe, AncestryDNA do not provide a facility for customers to download their match list. Fortunately Rob Warthen from DNAGedcom has provided a tool known as the DNAGedcom Client, which allows us to download all our data from Ancestry, including details of the shared cM count and the number of shared segments. I downloaded my list of matches on 19th April. I ran the DNAGedcom Client again on 4th May, and I've compared the two datasets to see how many matches I've gained and lost.
Here is a comparison of the number of matches I had before and after the update:
|Date||Matches||4th cousins||Distant cousins||Shaky leaf hints||Circles||NADs|
There was a marginal increase in the number of matches, but a close analysis of these matches provides a different perspective. I actually lost 1169 (34%) of my matches. However, this is more than made up for by the fact that I have gained 1178 new matches.
This is a breakdown of the size of the segments I share with my matches before and after the update:
|Date||< 6 cMs||6-6.99 cMs||7-9.9 cMs||10-10.9 cMs||>15 cMs||Matches|
I thought it would be interesting to do a further breakdown of my matches who were predicted to be fourth cousins or closer. Note that what AncestryDNA describe as a fourth cousin can in fact be anything from a fourth to a sixth cousin.
|Relationship before||Relationship after||cMs before||cMs after||Segments before||Segments after|
|3rd cousin||3rd cousin||109.71||117.197||8||5|
|3rd cousin||3rd cousin||84.358||98.3169||4||4|
|4th cousin||4th cousin||52.361||61.0445||4||3|
|4th cousin||4th cousin||23.947||30.4435||4||3|
|4th cousin||4th cousin||23.695||29.4889||1||1|
|4th cousin||4th cousin||21.776||27.2691||1||1|
|4th cousin||4th cousin||24.098||25.2915||2||2|
|4th cousin||4th cousin||22.325||24.4259||1||1|
|Distant cousin||4th cousin||11.364||23.8771||3||2|
|4th cousin||4th cousin||20.065||23.8634||1||1|
|4th cousin||4th cousin||20.704||23.1975||2||2|
|4th cousin||4th cousin||17.159||22.9076||2||2|
|Distant cousin||4th cousin||13.204||22.8962||2||1|
|Distant cousin||4th cousin||13.653||22.2734||3||2|
|4th cousin||4th cousin||18.581||21.1956||1||1|
|4th cousin||4th cousin||22.604||20.9644||1||1|
|4th cousin||4th cousin||20.896||20.8483||2||1|
|4th cousin||4th cousin||18.033||20.6506||1||1|
|4th cousin||Distant cousin||24.068||13.0545||1||1|
|4th cousin||Distant cousin||19.149||19.3702||2||1|
|4th cousin||Distant cousin||18.629||18.2796||1||1|
|4th cousin||Distant cousin||18.474||18.6107||1||1|
As can be seen, for the matches that have been retained there has been a marginal increase in the cM count. Four matches have been downgraded from fourth cousins to distant cousins. Ten of my previous fourth cousins (35%) have disappeared from my match list completely. It may be that these matches were filtered out because of the improved phasing. Another possibility is that these segments were in SNP-poor regions. Ancestry explain in their white paper that matches in these regions are unreliable. To counteract this problem they "discount these matches by reducing their total length (in cM)". These matches are no great loss. All these fourth cousins were in America and it was impossible to find any sort of genealogical relationship despite the fact that some of these matches had huge and very detailed trees. I'd rather suspected that these matches must be very distant, if they were legitimate at all. I already have far more matches than I know what to do with and I can still only find the genealogical connection with two of my matches at AncestryDNA. I would much rather have fewer and more accurate matches.
It's important to remember that we are all pioneers in this field, the tests are in their infancy and we still have much to learn.
At Family Tree DNA, 23andMe and GedMatch we are used to working with unphased data which can produce false positive matches, particularly on smaller segments under 15 cMs. Ancestry are the only company who filter out the high-frequency matches which are not of genealogical relevance, though 23andMe do screen out some matches in known problem areas. IBD segments with high rates of matching are likely to be less useful for detecting relationships in a recent genealogical timeframe.
Without phasing and without frequency filters it is much easier for people to find false coincidental matches, but we all need to be very careful about jumping to conclusions, especially with more distant relationships, where it is so much more difficult to detect recent IBD with the currently available tests.
This is the second time that AncestryDNA have updated their algorithms. Family Tree DNA have already changed their algorithms once, which resulted in some lost matches. We should all expect to see further changes to the companies' matching algorithms in the future as they strive to improve the technology and produce more accurate results.
Thumbs up; AncestryDNA improves genetic matching technology - a review by Diahan Southard, 9 May 2016.
Thanks to Don Worth in the ISOGG Facebook group for sharing his Excel formula for calculating the number of lost matches.
© 2016 Debbie Kennett