Friday, 6 May 2016

AncestryDNA's updated matching algorithms - a before and after analysis

AncestryDNA rolled out their long-awaited new matching algorithms on Tuesday this week. This message will now greet you when you log into your AncestryDNA account.
Ancestry have provided a number of resources to describe the changes, all of which merit a close reading:
AncestryDNA have been able to make these improvements because they have such a massive autosomal DNA database. They have now tested nearly two million people. Their scientists have been able to exploit the power of this large database to provide new insights into relatedness and to improve the detection of genealogically relevant IBD segments.

The biggest change is an improvement in the phasing process, Phasing is the process of sorting out the DNA letters   the As, Cs, Ts and Gs  – and placing them on the maternal and paternal chromosomes. Phasing is important for ruling out false positive and false negative matches. AncestryDNA are now using a reference panel of more than 300,000 genotypes for their phasing. Previously they were using a "window" system for IBD detection which broke the large segments into too many small pieces. Now they are using a SNP-based system which provides more realistic results with fewer segments. Phasing can be done with reference panels with a high degree of accuracy  the error rate of Ancestry's Underdog phasing engine is less than 1%. The accuracy will increase as the reference panel grows in size.

The matching threshold has also been changed. Two people must now share a minimum of 6 cMs whereas the old threshold was 5 cMs. AncestryDNA have produced a revised table of confidence scores based on a new understanding of the amount of DNA shared between different relations.


Contrast the above scores with the old version of the chart below which, to my mind, was always overly optimistic, especially about the matches on segments under 20 cMs, the vast majority of which are actually shared with very distant cousins. (For more on this subject watch Dr Doug Speed's lecture Who's your cousin? Using DNA to determine relatedness which he presented at  Who Do You Think You Are? Live this year.)


Comparing matches before and after
I thought it would be an interesting exercise to compare my matches before and after the update. Unlike Family Tree DNA and 23andMe, AncestryDNA do not provide a facility for customers to download their match list. Fortunately Rob Warthen from DNAGedcom has provided a tool known as the DNAGedcom Client, which allows us to download all our data from Ancestry, including details of the shared cM count and the number of shared segments. I downloaded my list of matches on 19th April. I ran the DNAGedcom Client again on 4th May, and I've compared the two datasets to see how many matches I've gained and lost.

Here is a comparison of the number of matches I had before and after the update:

DateMatches4th cousinsDistant cousinsShaky leaf hintsCirclesNADs
4 May3423183405100
19 April3414283386100

There was a marginal increase in the number of matches, but a close analysis of these matches provides a different perspective. I actually lost 1169 (34%) of my matches. However, this is more than made up for by the fact that I have gained 1178 new matches.

This is a breakdown of the size of the segments I share with my matches before and after the update:

Date< 6 cMs6-6.99 cMs7-9.9 cMs10-10.9 cMs>15 cMsMatches
4 May015181456381683423
19 April1737704719214403414

I thought it would be interesting to do a further breakdown of my matches who were predicted to be fourth cousins or closer. Note that what AncestryDNA describe as a fourth cousin can in fact be anything from a fourth to a sixth cousin.

Relationship beforeRelationship aftercMs beforecMs afterSegments beforeSegments after
3rd cousin3rd cousin109.71117.19785
3rd cousin3rd cousin84.35898.316944
4th cousin4th cousin52.36161.044543
4th cousin4th cousin23.94730.443543
4th cousin4th cousin23.69529.488911
4th cousin4th cousin21.77627.269111
4th cousin4th cousin24.09825.291522
4th cousin4th cousin22.32524.425911
Distant cousin4th cousin11.36423.877132
4th cousin4th cousin20.06523.863411
4th cousin4th cousin20.70423.197522
4th cousin4th cousin17.15922.907622
Distant cousin4th cousin13.20422.896221
Distant cousin4th cousin13.65322.273432
4th cousin4th cousin18.58121.195611
4th cousin4th cousin22.60420.964411
4th cousin4th cousin20.89620.848321
4th cousin4th cousin18.03320.650611
4th cousinDistant cousin24.06813.054511
4th cousinDistant cousin19.14919.370221
4th cousinDistant cousin18.62918.279611
4th cousinDistant cousin18.47418.610711
4th cousinNone24.5381
4th cousinNone23.7311
4th cousinNone23.0991
4th cousinNone21.9781
4th cousinNone21.2851
4th cousinNone21.0931
4th cousinNone20.8781
4th cousinNone20.1721
4th cousinNone19.9371
4th cousinNone18.0711

As can be seen, for the matches that have been retained there has been a marginal increase in the cM count. Four matches have been downgraded from fourth cousins to distant cousins. Ten of my previous fourth cousins (35%) have disappeared from my match list completely. It may be that these matches were filtered out because of the improved phasing. Another possibility is that these segments were in SNP-poor regions. Ancestry explain in their white paper that matches in these regions are unreliable. To counteract this problem they "discount these matches by reducing their total length (in cM)". These matches are no great loss. All these fourth cousins were in America and it was impossible to find any sort of genealogical relationship despite the fact that some of these matches had huge and very detailed trees. I'd rather suspected that these matches must be very distant, if they were legitimate at all. I already have far more matches than I know what to do with and I can still only find the genealogical connection with two of my matches at AncestryDNA. I would much rather have fewer and more accurate matches.

Conclusion
It's important to remember that we are all pioneers in this field, the tests are in their infancy and we still have much to learn.

At Family Tree DNA, 23andMe and GedMatch we are used to working with unphased data which can produce false positive matches, particularly on smaller segments under 15 cMs. Ancestry are the only company who filter out the high-frequency matches which are not of genealogical relevance, though 23andMe do screen out some matches in known problem areas. IBD segments with high rates of matching are likely to be less useful for detecting relationships in a recent genealogical timeframe.

Without phasing and without frequency filters it is much easier for people to find false coincidental matches, but we all need to be very careful about jumping to conclusions, especially with more distant relationships, where it is so much more difficult to detect recent IBD with the currently available tests.

This is the second time that AncestryDNA have updated their algorithms. Family Tree DNA have already changed their algorithms once, which resulted in some lost matches. We should all expect to see further changes to the companies' matching algorithms in the future as they strive to improve the technology and produce more accurate results.

Further reading
Thumbs up; AncestryDNA improves genetic matching technology - a review by Diahan Southard, 9 May 2016.

Acknowledgements
Thanks to Don Worth in the ISOGG Facebook group for sharing his Excel formula for calculating the number of lost matches.

© 2016 Debbie Kennett

8 comments:

Jason said...

There's a lot here that casts AncestryDNA in a very flattering light. I'm not convinced they've earned that.

Debbie Kennett said...

AncestryDNA have done a very good job with their update and have written an excellent White Paper. They are doing far more than any of the other companies to detect IBD. What specifically is it that you disagree with?

Curmudgeon said...

I'm not sure if I've misunderstood these sentences "I already have far more matches than I know what to do with and I can still only find the genealogical connection with two of my matches at AncestryDNA. I would much rather have fewer and more accurate matches."

I've tested a number of my relatives, daughter, sister, aunt and uncle, some 4th cousins, and all of us have thousands of matches and hundreds of them have identifiable genealogical connections, common ancestors with us. I use the memo field to show the MRCA names and relationships and I'm adding segment size as I review after this latest update. I'll be comparing segment sizes from MRCAs to see if that tells me anything. We just gained new matches to seventh/eighth GGPs for some of us. I pretty much ignore matches with locked or no trees, unless they pop up in a search for certain surnames.

Debbie Kennett said...

I'm in the UK and all my ancestry is from the UK. About 95% of my matches are in the US. If my matches do have UK ancestry they generally don't know where in the UK their ancestors are from. They generally just have vague pins for Ireland, England or London. In such circumstances it is impossible to find any genealogical situation, and no amount of research is ever going to change that. The situation will no doubt improve as more people from the UK test.

rhoshundred said...

I'm just getting into DNA and find that I'm beginning to get more matches closer to home, as in 10 miles away or at my paternal ancestors birthplace. These matches unfortunately lead nowhere. I find that if I use Gedmatch with Ancestry DNA, my results improve. Recently I have had success in Australia and Newfoundland, the later led to the match finding an unknown cousin and myself discovering that a friend was a distant cousin. Unfortunately not everyone bothers to answer messages left when their contacted initially.

Debbie Kennett said...

Rhoshundred, A lot of the autosomal DNA matches, especially on the smaller segments are very distant. Don't forget that GedMatch is using unphased data. This can introduce false positive matches. Unfortunately there are a lot of people who don't answer messages. However, I have sometimes had responses months or even years later.

Unknown said...

Debbie, where on the Ancestry site are the tables of confidence scores? I don't see them in the DNA Circles white paper and I can't find them in the blog. I wish Ancestry would post all its information in one place so that it was easily found.

Thanks,

Barton

Debbie Kennett said...

Barton

You'll find the confidence scores by logging into your account and clicking through to see your list of matches. In the top right-hand corner of the match page click on the small question mark and that opens up the help menu. The confidence scores are in the section on "What does the match confidence score mean?" Ancestry have produced some very good help material but they've done their very best to hide it so that no one ever reads it! It's worth reading through all the content in the help menu. I think they've done a good job of explaining some difficult concepts.