Cruwys news: AncestryDNA's updated matching algorithms

Friday, 6 May 2016

AncestryDNA's updated matching algorithms - a before and after analysis

AncestryDNA rolled out their long-awaited new matching algorithms on Tuesday this week. This message will now greet you when you log into your AncestryDNA account.

Ancestry have provided a number of resources to describe the changes, all of which merit a close reading:

AncestryDNA's cutting edge gets even sharper (accessible from the "learn more" link in the above box)
The science behind a more precise DNA matching algorithm – a blog post by Anna Swayne on the AncestryDNA Tech Roots blog
AncestryDNA Matching White Paper – essential reading if you want all the very technical details about the update.

AncestryDNA have been able to make these improvements because they have such a massive autosomal DNA database. They have now tested nearly two million people. Their scientists have been able to exploit the power of this large database to provide new insights into relatedness and to improve the detection of genealogically relevant IBD segments.

The biggest change is an improvement in the phasing process, Phasing is the process of sorting out the DNA letters – the As, Cs, Ts and Gs – and placing them on the maternal and paternal chromosomes. Phasing is important for ruling out false positive and false negative matches. AncestryDNA are now using a reference panel of more than 300,000 genotypes for their phasing. Previously they were using a "window" system for IBD detection which broke the large segments into too many small pieces. Now they are using a SNP-based system which provides more realistic results with fewer segments. Phasing can be done with reference panels with a high degree of accuracy – the error rate of Ancestry's Underdog phasing engine is less than 1%. The accuracy will increase as the reference panel grows in size.

The matching threshold has also been changed. Two people must now share a minimum of 6 cMs whereas the old threshold was 5 cMs. AncestryDNA have produced a revised table of confidence scores based on a new understanding of the amount of DNA shared between different relations.

Contrast the above scores with the old version of the chart below which, to my mind, was always overly optimistic, especially about the matches on segments under 20 cMs, the vast majority of which are actually shared with very distant cousins. (For more on this subject watch Dr Doug Speed's lecture Who's your cousin? Using DNA to determine relatedness which he presented at Who Do You Think You Are? Live this year.)

Comparing matches before and after
I thought it would be an interesting exercise to compare my matches before and after the update. Unlike Family Tree DNA and 23andMe, AncestryDNA do not provide a facility for customers to download their match list. Fortunately Rob Warthen from DNAGedcom has provided a tool known as the DNAGedcom Client, which allows us to download all our data from Ancestry, including details of the shared cM count and the number of shared segments. I downloaded my list of matches on 19th April. I ran the DNAGedcom Client again on 4th May, and I've compared the two datasets to see how many matches I've gained and lost.

Here is a comparison of the number of matches I had before and after the update:

Date	Matches	4th cousins	Distant cousins	Shaky leaf hints	Circles	NADs
4 May	3423	18	3405	1	0	0
19 April	3414	28	3386	1	0	0

There was a marginal increase in the number of matches, but a close analysis of these matches provides a different perspective. I actually lost 1169 (34%) of my matches. However, this is more than made up for by the fact that I have gained 1178 new matches.

This is a breakdown of the size of the segments I share with my matches before and after the update:

Date	< 6 cMs	6-6.99 cMs	7-9.9 cMs	10-10.9 cMs	>15 cMs	Matches
4 May	0	1518	1456	381	68	3423
19 April	1737	704	719	214	40	3414

I thought it would be interesting to do a further breakdown of my matches who were predicted to be fourth cousins or closer. Note that what AncestryDNA describe as a fourth cousin can in fact be anything from a fourth to a sixth cousin.

Relationship before	Relationship after	cMs before	cMs after	Segments before	Segments after
3rd cousin	3rd cousin	109.71	117.197	8	5
3rd cousin	3rd cousin	84.358	98.3169	4	4
4th cousin	4th cousin	52.361	61.0445	4	3
4th cousin	4th cousin	23.947	30.4435	4	3
4th cousin	4th cousin	23.695	29.4889	1	1
4th cousin	4th cousin	21.776	27.2691	1	1
4th cousin	4th cousin	24.098	25.2915	2	2
4th cousin	4th cousin	22.325	24.4259	1	1
Distant cousin	4th cousin	11.364	23.8771	3	2
4th cousin	4th cousin	20.065	23.8634	1	1
4th cousin	4th cousin	20.704	23.1975	2	2
4th cousin	4th cousin	17.159	22.9076	2	2
Distant cousin	4th cousin	13.204	22.8962	2	1
Distant cousin	4th cousin	13.653	22.2734	3	2
4th cousin	4th cousin	18.581	21.1956	1	1
4th cousin	4th cousin	22.604	20.9644	1	1
4th cousin	4th cousin	20.896	20.8483	2	1
4th cousin	4th cousin	18.033	20.6506	1	1
4th cousin	Distant cousin	24.068	13.0545	1	1
4th cousin	Distant cousin	19.149	19.3702	2	1
4th cousin	Distant cousin	18.629	18.2796	1	1
4th cousin	Distant cousin	18.474	18.6107	1	1
4th cousin	None	24.538		1
4th cousin	None	23.731		1
4th cousin	None	23.099		1
4th cousin	None	21.978		1
4th cousin	None	21.285		1
4th cousin	None	21.093		1
4th cousin	None	20.878		1
4th cousin	None	20.172		1
4th cousin	None	19.937		1
4th cousin	None	18.071		1

As can be seen, for the matches that have been retained there has been a marginal increase in the cM count. Four matches have been downgraded from fourth cousins to distant cousins. Ten of my previous fourth cousins (35%) have disappeared from my match list completely. It may be that these matches were filtered out because of the improved phasing. Another possibility is that these segments were in SNP-poor regions. Ancestry explain in their white paper that matches in these regions are unreliable. To counteract this problem they "discount these matches by reducing their total length (in cM)". These matches are no great loss. All these fourth cousins were in America and it was impossible to find any sort of genealogical relationship despite the fact that some of these matches had huge and very detailed trees. I'd rather suspected that these matches must be very distant, if they were legitimate at all. I already have far more matches than I know what to do with and I can still only find the genealogical connection with two of my matches at AncestryDNA. I would much rather have fewer and more accurate matches.

Conclusion
It's important to remember that we are all pioneers in this field, the tests are in their infancy and we still have much to learn.

At Family Tree DNA, 23andMe and GedMatch we are used to working with unphased data which can produce false positive matches, particularly on smaller segments under 15 cMs. Ancestry are the only company who filter out the high-frequency matches which are not of genealogical relevance, though 23andMe do screen out some matches in known problem areas. IBD segments with high rates of matching are likely to be less useful for detecting relationships in a recent genealogical timeframe.

Without phasing and without frequency filters it is much easier for people to find false coincidental matches, but we all need to be very careful about jumping to conclusions, especially with more distant relationships, where it is so much more difficult to detect recent IBD with the currently available tests.

This is the second time that AncestryDNA have updated their algorithms. Family Tree DNA have already changed their algorithms once, which resulted in some lost matches. We should all expect to see further changes to the companies' matching algorithms in the future as they strive to improve the technology and produce more accurate results.

Further reading
Thumbs up; AncestryDNA improves genetic matching technology - a review by Diahan Southard, 9 May 2016.

Acknowledgements
Thanks to Don Worth in the ISOGG Facebook group for sharing his Excel formula for calculating the number of lost matches.

© 2016 Debbie Kennett

8 comments:

Jason said...: There's a lot here that casts AncestryDNA in a very flattering light. I'm not convinced they've earned that.; 6 May 2016 at 21:09
Debbie Kennett said...: AncestryDNA have done a very good job with their update and have written an excellent White Paper. They are doing far more than any of the other companies to detect IBD. What specifically is it that you disagree with?; 6 May 2016 at 21:27
Curmudgeon said...: I'm not sure if I've misunderstood these sentences "I already have far more matches than I know what to do with and I can still only find the genealogical connection with two of my matches at AncestryDNA. I would much rather have fewer and more accurate matches."

I've tested a number of my relatives, daughter, sister, aunt and uncle, some 4th cousins, and all of us have thousands of matches and hundreds of them have identifiable genealogical connections, common ancestors with us. I use the memo field to show the MRCA names and relationships and I'm adding segment size as I review after this latest update. I'll be comparing segment sizes from MRCAs to see if that tells me anything. We just gained new matches to seventh/eighth GGPs for some of us. I pretty much ignore matches with locked or no trees, unless they pop up in a search for certain surnames.; 7 May 2016 at 08:27
Debbie Kennett said...: I'm in the UK and all my ancestry is from the UK. About 95% of my matches are in the US. If my matches do have UK ancestry they generally don't know where in the UK their ancestors are from. They generally just have vague pins for Ireland, England or London. In such circumstances it is impossible to find any genealogical situation, and no amount of research is ever going to change that. The situation will no doubt improve as more people from the UK test.; 7 May 2016 at 11:56
rhoshundred said...: I'm just getting into DNA and find that I'm beginning to get more matches closer to home, as in 10 miles away or at my paternal ancestors birthplace. These matches unfortunately lead nowhere. I find that if I use Gedmatch with Ancestry DNA, my results improve. Recently I have had success in Australia and Newfoundland, the later led to the match finding an unknown cousin and myself discovering that a friend was a distant cousin. Unfortunately not everyone bothers to answer messages left when their contacted initially.; 9 May 2016 at 20:50
Debbie Kennett said...: Rhoshundred, A lot of the autosomal DNA matches, especially on the smaller segments are very distant. Don't forget that GedMatch is using unphased data. This can introduce false positive matches. Unfortunately there are a lot of people who don't answer messages. However, I have sometimes had responses months or even years later.; 10 May 2016 at 00:13
Unknown said...: Debbie, where on the Ancestry site are the tables of confidence scores? I don't see them in the DNA Circles white paper and I can't find them in the blog. I wish Ancestry would post all its information in one place so that it was easily found.

Thanks,

Barton; 11 May 2016 at 17:37
Debbie Kennett said...: Barton

You'll find the confidence scores by logging into your account and clicking through to see your list of matches. In the top right-hand corner of the match page click on the small question mark and that opens up the help menu. The confidence scores are in the section on "What does the match confidence score mean?" Ancestry have produced some very good help material but they've done their very best to hide it so that no one ever reads it! It's worth reading through all the content in the help menu. I think they've done a good job of explaining some difficult concepts.; 11 May 2016 at 17:56

Pages

Friday, 6 May 2016

AncestryDNA's updated matching algorithms - a before and after analysis

8 comments:

Thank you!