Tuesday, 1 September 2020

The AncestryDNA matching updates have now been completed

I wrote back in mid July that AncestryDNA would be updating their matching algorithms to provide information on the length of the longest segment and a more accurate tally of the number of matching segments. AncestryDNA also announced that they would no longer be reporting matches that shared a total of 8 cM or less after the application of the Timber algorithm. These changes were rolled out gradually in August with the small matches finally disappearing shortly before midnight last night UK time.

I made a note yesterday afternoon before the small matches disappeared of the number of matches at AncestryDNA for me, my mum and my dad. I've done a before and after comparison along with a comparison of the number of matches, where available, at the other testing companies. The number of 4th cousin or closer matches at AncestryDNA remains unchanged.


I've lost 66% of my matches at AncestryDNA but in reality this is no great loss as so many of these small matches are false matches which don't match either of my parents. Even when the person does match one of my parents I often find that the documentary link is on the wrong side, for example, the person has a DNA match with my dad but I've identified a genealogical link on my mum's side. Even if these small matches are valid, they are far more likely to trace back 10, 20 or 30 generations rather than fall within a useful genealogical timeframe. There are currently no tools which can determine the age of a single segment match and tell us whether we are matching a fifth cousin rather than a twentieth cousin. It's impossible to work with such small DNA matches when probably 95% or more of them are either false matches or very old matches. With whole genome sequencing we will probably have the ability to make these distinctions but that is currently a long way off.  

AncestryDNA previously set a much lower threshold for matching than 23andMe, FamilyTreeDNA and MyHeritage so this update now brings them more into line with the other companies. Ancestry have by far the largest database with over 18 million people tested so it's not surprising that my family have far more matches there than at any other company even after the purge. I was surprised to find that my transfer kit at MyHeritage had nearly 2000 more matches than the test I did directly with the company when I ordered their Health and Ancestry test. 23andMe restrict the number of matches to 2000 and this total includes people in the database who have not opted in to relative matching. However, they have just launched a new invite-only subscription Premium Membership which will provide new health reports as well as additional ancestry features such as the ability to view four times more DNA Relatives. For details see this page on the 23andMe website though you will need to be logged into your 23andMe account to view the page. If this trial is successful we may well see other companies offering access to a more extensive match list for a fee though I suspect that for the vast majority of AncestryDNA users a list of 10,000 or more matches is more than they can realistically handle.  

I've not yet had much time to look at the new information about the number of matching segments and the length of the longest segment. However, if a match only shares a single segment we can now get an idea of how Ancestry's Timber algorithm works because it is applied after the longest segment has been identified. Timber has the effect of downweighting regions where there are large numbers of matches. Matches are only likely to be genealogically relevant if they fall in a region which is shared with just a few cousins in your family rather than being shared with large numbers of people in the general population. Timber is only applied to matches sharing 90 cM or less. For full details see the updated AncestryDNA Matching White Paper.

Unless you're from an endogamous population you'll probably find that Timber has had little or no effect on most of your matches. For my match below, the longest segment size is identical to the total cM shared.
In other cases I am finding minor discrepancies in the matches, sometimes of just one or two cM or, as in the case below, a small reduction of just 0.1 cM.
However, I have found one single segment match where there was a sizeable discrepancy.
This match lives in Canada and has ancestry from Scotland, Wales, Norway and Newfoundland. I can see that the match is on my maternal side. For my mum the match has been similarly reduced in size from 56 cM to 38 cM. My mum has no known ancestry from Scotland, Wales or Norway and I'm not aware of any maternal ancestors or relatives who emigrated to Newfoundland. It seems unlikely that I will be able to document a connection and the fact that the match has been so drastically reduced is probably a red flag that this match should be treated with caution. I've clicked through to look at quite a few more matches but have not found any others with quite such a big discrepancy though I've found a few matches where there is a difference of 5 or 10 cM.

It will be interesting to see if people are able to make use of the longest segment data. I think it might be helpful, as in the example above, in highlighting matches that appear to fall into problem areas and which are likely to be less useful for genealogical purposes. See for example this very interesting blog post from Kalani Mondoy where he has shown how useful the longest segment data has been for him to distinguish between his genuine Hawaiian matches and the very distant Maori matches which are indicative of shared ancestry from about a thousand years ago before the two populations split.

I would hope that AncestryDNA will eventually be able to use the longest segment data to refine the matches for people with ancestry from endogamous populations. I have access to a British Ashkenazi Jewish account at Ancestry where the individual previously had 224,377 matches. After the match reduction there are still 169,928 matches remaining. There is clearly great scope for improving the matching for these populations.


The reduction in matches at AncestryDNA proved surprisingly controversial with some people, particularly those of African American heritage, arguing passionately for their retention. See, for example, this blog post by Fonte Felipe. However, the reduction has taken place and the decision is not likely to be reversed. We need to focus on what we can do and not what we can't do. How are you making use of the new segment data and the information about the longest segment? What tips do you have for making the most of your AncestryDNA matches? Do let me know what you think.

Update 3rd September 2020
AncestryDNA confirmed today in a conference call that they will soon be showing us the unweighted pre-Timber total cM shared. This will allow us to see at first hand how much of an effect, if any, the Timber algorithm is having on our matches. No exact date has been promised but the information is expected to be added to our accounts in the next two weeks or so.

Further reading

Tuesday, 21 July 2020

Malicious phishing attempt at MyHeritage

Following on from the recent security breach at GEDmatch, there has now been a malicious phishing attempt at MyHeritage which is possibly linked to the GEDmatch breach. Thanks to the prompt actions by MyHeritage staff the threat appears to have been averted but make sure you watch out for fake e-mails purporting to come from the company. 

It is quite possible that the other genetic genealogy companies will similarly be targeted for phishing attacks so be alert and look out for any suspicious e-mails and check the reply field to ensure that the e-mail is legitimate.

You can read about the MyHeritage incident in their blog post:
 
Security alert: malicious phishing attempt detected, possibly connected to GEDmatch breach

My post on the GEDmatch security breach has been updated several times since I published it on Sunday so do check back if you want to keep on top of all the developments.



Sunday, 19 July 2020

Major privacy breach at GEDmatch

There has been a major privacy breach at GEDmatch, the third-party genetic genealogy website which has become well known in the last two years because of its use by law enforcement agencies in the US to solve cold cases. A member of the Genetic Genealogy Ireland Facebook group posted a message at lunchtime today (13.38 pm UK time) to advise that the site had been compromised and that people were receiving what appeared to be fake matches with suspicious e-mail addresses.(This Facebook post has now been deleted.) Some users were reporting that they were receiving unusually large numbers of  new matches, all sharing unexpectedly high amounts of DNA which would normally indicate a very close relationship. In another group, one user reported receiving over 3000 matches, all of which shared over 700 cM. A match in this range would normally indicate a very close relationship such as a first cousin or closer.

Later on this afternoon (14.54 pm UK time) a user posted in the Genetic Genealogy Tips and Techniques group on Facebook that all his kits on GEDmatch were now publicly accessible and all marked as available to the police. This included not just standard kits but also phased kits and Lazarus kits,which are by default always marked as research kits and are not normally available for matching. I checked my own account at GEDmatch and found that all my kits had been changed without my consent to allow police access. This included two phased research kits which were never intended to be made public. I initially found that I was unable to change the settings on any of the kits. The site was up and down for a short while this afternoon before I was finally able to log in and restore my preferred access settings.

Since then GEDmatch has been offline with a message that the site is down for maintenance.
Many other people have also reported that their kits have been affected and that the settings have been changed to allow police access without their consent. Graham Coop shared on Twitter this afternoon a screenshot of his accounts showing how they had all been changed to allow police access..


It therefore appears that the entire database has been changed to make all kits available for police access. This also means that the law enforcement kits, which are normally uploaded as research kits so that they do not appear in match lists, have been compromised. Anyone logging onto the website during this period would have seen those kits and might have been able to save a screenshot with the kit numbers. Allowing unauthorised access to law enforcement kits could potentially have serious consequences and could compromise an investigation.

This is clearly a matter of great concern. There are well over 1.2 million profiles on GEDmatch but only around 200,000 or so kits had opted to make their profiles available for law enforcement matching. This means that the DNA profiles and e-mail addresses of probably around a million people have been exposed, including all the law enforcement kits. It is unlikely anyone would have been able to do anything with the matches during the period when the website was compromised because so many spurious matches were being produced. It is the exposure of the e-mail addresses and kit numbers which is likely to be of the most concern.

According to a report on the Tech in the City website the original privacy settings were restored before the site was taken down though I'm not clear what time this happened as I'm not clear what timezone the author is reporting from.

As GEDmatch operates in the European Union and has many EU customers, they are obliged to comply with the EU's General Data Protection Regulation (GDPR). Because of the serious nature of this breach it seems likely that they will have to report the matter to the appropriate regulatory authority in the EU. I don't know which authority they have registered with but the Information Commissioner's Office in the UK has information on how such data breaches should be reported. If a company or organisation has not protected the security of its customers than an enforcement action can be take and the company can be fined.

GEDmatch have since advised that they are aware of the issues and are responding. According to a post in the GEDmatch User Group on Facebook GEDmatch are "doing research right now to confirm what is happening. They are leaving the site down until they can clearly confirm what is going on." They are expected to make a formal statement later. It appears that this was an inadvertent update that went wrong. There appears to be no evidence that the site was hacked.

In the meantime it is pointless to speculate about what might have happened and we will need to await until further information is available. I will update this page if I receive any further news.

Update
Just after publishing this blog post I discovered (22.51 pm UK time) that GEDmatch is back up and running and my kits all have the correct access levels.

23.09 pm The following message has been posted on the GEDmatch Facebook page.

Update 21 July 2020
GEDmatch have announced on their Facebook page that they experienced a security breach on Sunday which was orchestrated through a sophisticated attack on one of their servers via an existing user account. The site was functioning briefly yesterday but reports started coming in late last night that people were once again receiving lots of unexpectedly high matches with a low SNP overlap in their match lists. I was able to briefly log into my account at 1.00 am night and found that the kit I checked had lots of matches with users with words like "imputed" and "partial" in the names. My highest match was at the first cousin level with a user from the Chinese company Gese DNA. The site has now been taken down and GEDmatch are working with a cybersecurity company to implement new security measures. Here is a screenshot of the message from GEDmatch. I've removed the contact details from the post but these are are available in the full version of the message in the Facebook group. 


It is good that GEDmatch are being transparent about the problems and this may turn out for the best in the long run if the security of the database is improved. The site was down for at least three hours and although they say that no data was downloaded in that time it would have been possible to take screenshots of match lists from many different accounts. Once you have a kit number you then essentially have access to that individual's account. It is also a cascading effect because you can click on all the matches of the matches as well. This essentially means that all the kit numbers have been compromised because no one will know which kits were affected. All the kit numbers will need to be changed. Ideally it would be better if GEDmatch did not reveal kit numbers in the match lists. It will be interesting to see what happens but I rather suspect the site will be down for a long time.

Further update 21 July 2020
5.00 pm 
From the GEDmatch Facebook page: "GEDmatch will remain offline for 2 to 3 days as we further enhance security protocols. Thank you for your patience. We apologize for the inconvenience this has caused."

Update 22 July 2020
MyHeritage advised late last night of a security alert involving a malicious phishing attempt that was possible related to the GEDmatch breach. For full details see the MyHeritage blog post:


The further reading section of this blog post has been updated to include an informative blog post from Leah Larkin explaining why we were seeing the mystery matches at GEDmatch sharing unusually high amounts of DNA. I have also included an official statement from Verogen which was published on their blog on 20th July, a further blog post from Leah Larkin which includes a timeline of the events and an article from Peter Aldhous of Buzzfeed News..
 
An e-mail has been sent out by Verogen to all GEDmatch users informing them of the breach. My e-mail arrived at 8.40 am. It may take time for a bulk e-mail to reach all 1.2 million or more users. If you haven't received the e-mail check your spam folder. I've copied the text below in case you haven't received it.

Dear GEDmatch member,

On the morning of July 19, GEDmatch experienced a security breach orchestrated through a sophisticated attack on one of our servers via an existing user account. We became aware of the situation a short time later and immediately took the site down. As a result of this breach, all user permissions were reset, making all profiles visible to all users. This was the case for approximately 3 hours. During this time, users who did not opt-in for law enforcement matching were available for law enforcement matching, and, conversely, all law enforcement profiles were made visible to GEDmatch users.

On Monday, July 20, as we continued to investigate the incident and work on a permanent solution to safeguard against threats of this nature, we discovered that the site was still vulnerable and made the decision to take the site down until such time that we can be absolutely sure that user data is protected against potential attacks. It was later confirmed that GEDmatch was the target of a second breach in which all user permissions were set to opt-out of law enforcement matching.

We can assure you that your DNA information was not compromised, as GEDmatch does not store raw DNA files on the site. When you upload your data, the information is encoded, and the raw file deleted. This is one of the ways we protect our users’ most sensitive information.

Further, we are working with a leading cybersecurity firm to conduct a comprehensive forensic review and help us implement the best possible security measures. We expect the site will be up within the next day or two.

We have reported the unauthorized access to the appropriate authorities and continue to work toward identifying the individuals responsible for this criminal act.

Today, we were informed that MyHeritage customers who are also GEDmatch users were the target of a phishing scam. Please remember to exercise caution when opening emails and clicking links. Never provide sensitive information via email. If an email seems suspicious, contact the company in question directly through the phone number or email address listed on their website, not via a reply to the suspicious email. You can reach GEDmatch at  xxxx or xxxxx [email address and telephone number removed]. At this time, we have no evidence to suggest the phishing scam is a result of the GEDmatch security breach this week. We are continuing to investigate the incident.

Please be assured that we take these matters very seriously. Our Number 1 responsibility is to protect the data of our users. We know we have not lived up to this responsibility this week, and we are working hard to regain your trust. We apologize for the concern and frustration this situation has caused.

Sincerely,

Brett Williams
CEO, Verogen Inc.

For a French translation of this e-mail see the post in the Facebook group France ADN - Généalogie Génétique (ISOGG).

Update 25th July 2020
There is a notice on the GEDmatch Facebook suggesting that the site will be back online today though at 11.35 am UK time the site was still down.

The site was restored in the afternoon of 25th July and no further issues have been reported to date.

Tuesday, 14 July 2020

Some updates to AncestryDNA's matching system and a database update


Ancestry announced at a conference call today that there are some changes in the pipeline in terms of how our matches are reported. There will be three main changes:

1) Ancestry will provide a more accurate report on the number of segments shared with your matches. The updated matching algorithm may reduce the estimated number of segments you share with some. of your DNA matches. However, it won't change the estimated total amount of shared DNA (measured in centimorgans/cM) or the predicted relationship to your matches.

2) Ancestry will report the length of the largest shared segment. This is particularly important for people who are descended from endogamous populations. Knowing the length of the longest segment you and a DNA match have in common can help determine if you’re actually related. The longer the segment, the more likely you’re related. Segment length is also the easiest way to evaluate the difference between multiple matches that all show the same estimated relationship.

3) The matches will be re-calibrated to remove false matches so that the reported matches are more likely to be related through a recent common ancestor. Once the update is implemented, only matches which share 8 cM or more will be reported. Ancestry estimate that this will remove about two thirds of the false matches. All matches that fall below the new threshold will disappear from your match list with the exception of matches you have messaged, matches where you've added a note and matches you have added to a group by using the system of coloured dots. Starred matches will also be retained as they are considered part of a group. If you save a match below 8cM, your match will also have it saved without additional action needed. Any matches sharing less than 8 cM in total will no longer appear as common ancestor hints or in the ThruLines feature and this change may affect the number of ThruLines you see. If you want to save these matches you'll need to make sure you add them to one of your groups or add a note. Note that it is only the total cM shared after the application of the Timber algorithm that is affected so you could still have matches which share some individual segments that are smaller than 8 cM so long as the sum total of all the segments is over 8 cM.

On site messaging will start to appear on the site in the next few days (this messaging is now live) to alert users to the updated matching system and a new matching white paper will be available later this week. (The White paper has now been published and can be accessed here.) We can expect to see the new matching system rolled out in early August.

The increase in the match threshold will mean that many matches will disappear from our match lists. However, in practice, this is not going to have any effect on our genealogical research as these small matches have proved to be so unreliable that they are impossible to work with. The last time I analysed my matches at AncestryDNA and compared them with my parents' match lists I found that 54% of my matches in the 6-7 cM range did not match either of my parents and were therefore probably false positives. (1) Clearly if there is over a 50% chance that a match will be false we cannot reliably assign these matches to a common ancestor, even if we can identify one in our shared family trees. Even if the match is real, the chances are still very low that it will be a reflection of a recent genealogical relationship and it is far more likely to be the result of very distant sharing. (2)

I currently have over 32,000 matches at AncestryDNA which is far more than I can ever possibly cope with. However, if you really are desperate to go through your matches and check the 6 and 7 cm matches before they disappear you can use the filter under Shared DNA to set a custom cM range to identify these matches.

In other news AncestryDNA's corporate page has been updated to show that they have now tested 18 million people. AncestryDNA now have by far the largest genetic genealogy database in the world. 23andMe is the next largest with a database of 12 million people. MyHeritage have 4 million people in their database, while FamilyTreeDNA have tested over two million people. (3)

The lockdown seems to have encouraged a renewed interest in family history so we can also look forward to receiving many more matches in the months and years to come.

Update 4th August
The roll out of the update has been delayed and will now be rolled out in stages. You will find full details, including FAQs, when you log into your AncestryDNA account.



Ancestry is now displaying decimal points for all matches sharing under 10 cM. All matches sharing under 8 cM will be removed at the end of August. This includes matches in the 7.5 to 7.9 cM range which were previously rounded up to 8 cM.

Further reading
Footnotes
1. See my blog post Comparing parent and child matches at AncestryDNA from August 2017 for the full details of this analysis.
2. See the ISOGG Wiki page on identity by descent which includes a chart from a 2015 paper by Doug Speed and David Balding providing the distribution of different-sized segments by generation.
3. FamilyTreeDNA do not publish details of the size of their autosomal DNA database. The two million figure about the number of people tested is taken from the FAQs on their home page. In the section headed "Who is FamilyTreeDNA?" they say: "Over 2 million people have tested with FamilyTreeDNA, resulting in the most comprehensive DNA matching database in the industry." FTDNA used to publish daily updates on the number of Y-DNA and mtDNA records in the database on their "Why choose FamilyTreeDNA page?" However, the figures on this page have not been updated since July 2019. Martin McDowell did an analysis in February 2020 based on FTDNA kit numbers in which estimated that FTDNA's autosomal DNA database was approaching two million. See the blog post "How big is the FamilyTreeDNA database" on the Genetic Genealogy Ireland website.

Updates
This page was updated on 15 July 2019 to include a third footnote to clarify information about the size of the FamilyTreeDNA database. It was updated on 16 July to include a link to the updated AncestryDNA white paper and a further reading list. It was also updated to clarify that starred matches will not be retained. The page was updated on 17 July to include a link to blog posts from Blaine Bettinger and Leah Larkin. The page was updated on 19 July following the receipt of an e-mail from AncestryDNA which clarified that starred matches would be retained after all and that any matches you save will also be automatically saved on your match's account. Additional information was added to the number points 1 and 2 with additional information from Ancestry about the changes in the reporting of segments. A link to Judy Russell's blog post was added on 28 July.