Friday 15 November 2013

A confusion of SNPs

This article is for experienced genetic genealogists and requires a reasonable understanding of SNPs and haplogroups.

The launch of the new Big Y test from Family Tree DNA has brought to light the difficulties in comparing the offerings of the different testing companies. We have a chart in the ISOGG Wiki which compares the various Y-SNP tests on the market but it is clear that we are not always comparing apples with apples. One of the major difficulties relates to the claims by the companies about the number of Y-SNPs on their chip. A SNP is a change or a mutation in the DNA alphabet at a single position on the Y-chromosome (eg, a C changing to a T). There are around 59 million base pairs in the Y-chromosome. However, surprising as it might be in this genomic era, there are still large sections of the Y-chromosome that have not yet been explored. Build 37, the current build of the human genome reference sequence, has only mapped out the positions of around 25 million base pairs  less than half of the Y-chromosome.The discovery of new SNPs is therefore limited to the parts of the Y-chromosome that can be sequenced using current technology. These areas represent just over 40% of the Y-chromosome. In theory, therefore, a SNP could be found on any one of the 25 million bases that can be sequenced.

The exact number of SNPs on the Y-chromosome is not yet known. There is no central resource listing all known SNPs because there is fierce competition and the companies are keen to keep knowledge of the SNPs that they have discovered from their competitors for as long as possible. We therefore have some SNPs that are in the public domain, some unpublished SNPs that are known only to Family Tree DNA/the Genographic Project, some SNPs that are known only to Full Genomes Corporation and some SNPs that are known only to BritainsDNA. To make matters worse all three companies use different naming systems for their SNPs. Full Genomes SNPs are prefixed by the letters FG, and BritainsDNA SNPs bear the prefix S.  I understand from the reports from the Family Tree DNA 2013 Conference that the Genographic Project will be publishing a paper some time in the New Year with the new 2014 Y-SNP tree. It therefore remains to be seen what naming system they will use for their SNPs. There will undoubtedly be considerable overlap in the SNPs offered by the different testing companies but until they release their data or until we have comparative results available we will not be able to work out which SNPs are equivalent (synonymous)  in other words which SNPs occur at the same position but which have been given different names by different companies. For example U106 and S21 are alternative names for a single SNP which defines one of the major branches of the R1b haplogroup.

The problem is well illustrated by the recent developments in R1b-M222, a subclade which predominates in Ireland and Scotland, and is seen in many of the surnames that are associated with the clans reputed to descend from the semi-legendary Irish historical figure Niall of the Nine Hostages.According to the early results from the Chromo 2 testing at BritainsDNA 27 new SNPs have been discovered downstream of M222.3 Yet at the Family Tree DNA Conference last weekend Miguel Vilar from the Genographic Project advised that they have identified 22 SNPS below M222. Do any of the Geno 2.0 SNPs correspond with the SNPs found by BritainsDNA? The answer is we simply do not know. Neither company releases the full raw data that will allow the participant to determine the genome reference position of the SNPs for which he has tested positive so the results from the two companies cannot be compared. Few results are in any case available at present from the Chromo 2 testing. The Genographic Project are presenting the results of their Gathering the Mayo Genes Project at a public event in Castlebar on Sunday so it may be that further information will be forthcoming then.

So where can we find out about SNPs and their position on the Y-DNA haplotree? By far the most important source is the Y-SNP tree maintained by ISOGG - the International Society of Genetic Genealogy. The tree was launched on 10th April 2006. By the end of the year there were 436 SNPs on the tree. By September 2013 there were 3610 SNPs on the ISOGG tree. According to Roberta Estes' report from Day 2 of the FTDNA conference the new 2014 Y-SNP tree, which will be published by the Genographic Project in 2014, will have 6200 SNPS and 1000 branches.This effectively doubles the size of the existing tree and will represent a significant workload for the team of volunteer project administrators who maintain the tree.

However, the ISOGG tree only documents the SNPs whose precise location on the Y-haplotree is known  in other words SNPs that define particular branches of the human family tree on the Y-line. There are thousands more known SNPs. For these SNPs we know that a mutation has been found on the Y-chromosome at the position in question but we do not know if it has any phylogenetic significance, that is, if these SNPs define branches on the Y-tree or if they are unique to the individual.

ISOGG have a SNP index that lists not just the SNPs that are on the haplotree but also those which "are or have been under active investigation and consideration for addition to the Y Haplotree." ISOGG further state that the "SNPs listed here are less than 10% of the currently known SNPs". To supplement the SNP index ISOGG member David Reynolds maintains the ISOGG SNP Compendium Spreadsheet. This was last updated about a month ago and contains a list of 47,680 SNPs which have yet to be added to the ISOGG tree and the SNP index. A small minority of these SNPs are alternative names for previously known SNPs that are already on the tree (for example, some S series SNPs correspond with some of the Z series SNPs that have already been placed on the tree). Most of the rest are SNPs whose position on the Y-chromosome is known but where we do not as yet know where they belong on the Y-tree. David Reynolds reported back in September that he had about another 5000 SNPs to process. He is "curating and combining duplicates" as he goes along so it is a time-consuming process.

There are no doubt many more SNPs that are being published in scientific papers and I don't know if anyone in the genetic genealogy community is currently keeping track of these. In one recent paper uploaded to the ArXiv preprint server two Chinese researchers discovered 25,000 new phylogenetically relevant SNPs.5

Let's now have a look at the offerings of the various testing companies in the light of these numbers. I'm discussing the companies in chronological order based on the dates when their tests were launched. Some companies offer chip-based SNP tests. These tests can only test for previously known SNPs, but the companies can customise the chips to include their own proprietary SNPs for investigation. The new gold standard tests are those which use next-generation sequencing technology. These have the potential to discover thousands of new SNPs.

The Geno 2.0 test from the Genographic Project
The Geno 2.0 test from the Genographic Project was launched in July 2012 and was the first chip test to come on the market with a comprehensive panel of Y-SNPs. The Genographic Consortium published a paper earlier this year with all the technical details of their new GenoChip.6  The supplementary data tell us that the Genographic Project started with "a raw SNP candidate database of approximately 27,500 SNPs" though some of these were duplicates. The original target was to produce a chip with 15,000 SNPs but according to the paper the chip includes around 12,000 SNPs. Customers can download a CSV file with a list of the SNPs. There were 12,059 SNPs in the most recent file that I downloaded for one of my project members. The Genographic Project do not currently provide the genome reference positions of the SNPs on their chip, and it seems likely that this information is being withheld pending publication of the 2014 tree.

The Chromo 2 test from BritainsDNA/ScotlandsDNA
The Chromo 2 test from BritainsDNA/ScotlandsDNA was launched in June 2013. It uses a customised Illumina chip which is advertised as "covering over 15,000 Y chromosome markers, carefully selected to be most informative, and as free from duplication as possible". Only a limited number of results have been released from this test so far, but a flood of results is expected in the next couple of weeks. Customers receive an Excel spreadsheet with a list of all the markers that have been tested. In the one spreadsheet that I've seen there was a list of 14,184 SNPs. Of these, 8,682 SNPs had the S prefix. On the current ISOGG 2013 Y-SNP index the S series SNPs stop at S530. In the list of SNPs that I saw there were 8385 S series SNPs with numbers higher than S530. Many of these SNPs will probably define new branches on the Y-tree but many more could simply be alternative names for currently known SNPs. We do know that the BritainsDNA chip includes SNPs found in the Genomes of the Netherlands Project, and also many SNPs that are likely to be informative for people of British descent. However, BritainsDNA, in common with the Genographic Project, do not publish the genome reference positions of their SNPs. Unless they provide ISOGG with the positions of their SNPs we will have no way of knowing where they fit on the tree and which of their SNPs correspond with those identified by other testing companies.

Full Genomes
Full Genomes is a new start-up company which made a quiet entry onto the market some time towards the end of 2012. They only began advertising their services publicly towards the end of March 2013.7 They currently offer the most comprehensive Y-DNA test on the market covering about 20 to 25 million base pairs representing around 42% of the Y-chromosome. Full Genomes claim to cover 47,000 of the known SNPs on the ISOGG tree and in the ISOGG SNP Compendium. This is after removing "ambiguous results, and synonyms from consideration".8 Around 14 million of the SNPs are reported to be within mappable regions. However, their test is also uncovering many new private SNPs which have not as yet been made public, and the number of new SNPs discovered can be expected to rise as more and more people get tested. At present each testee in one of the common haplogroups can probably expect to find between 25 and 40 private high-quality SNPs. Full Genomes make the raw data available in a BAM file so that customers will have access to the genome reference numbers and can check the ISOGG tree for alternative SNP names as and when new SNPs are placed on the tree.

The Big Y test from Family TreeDNA
The new Big Y test from Family Tree DNA was launched at the weekend at Family Tree DNA's Conference. I've provided preliminary details in a previous blog post. As this is a new test, no results are yet available, and proper comparisons with the other available tests cannot be done. The FTDNA FAQs tell us that the test covers "at least 10 million base-pairs of reliably mapped positions of non-recombining Y-Chromosome", though the exact number of base pairs sequenced has not been disclosed. One conference attendee who spoke to the FTDNA staff was told that "the number of bp [base pairs] analysed will be at least 10 million, but could in some samples go up to 12 million".9 FTDNA claim that their test provides more coverage "than any Y-DNA test on the market".  However, the test is clearly not quite so comprehensive as the Full Genomes test but it does have the virtue of being considerably cheaper which will make testing multiple people within a single subclade a feasible proposition. Confusingly FTDNA claim that the test will cover "nearly 25,000 known SNPs placing you deep on the haplotree". I can only think they've taken their figure of 25,000 known SNPs from the research into the Geno 2.0 chip and that they are seemingly unaware of the ISOGG SNP Compendium Index which, as discussed above, lists over 47,000 SNPs. If they are covering over 10 million SNPs then they will surely test most of the SNPs in the Compendium. Fortunately FTDNA have confirmed that they will make the raw data in the form of BAM files available to their customers so we will eventually be able to make comparisons.

Is next generation sequencing SNP testing for you?
Next generation sequencing is clearly becoming the gold standard for SNP testing. The Genographic Project have announced that they will be introducing a new test within the next seven to 12 months and I would imagine that their new test will use next generation sequencing. No doubt a rival new NGS test is in the works from BritainsDNA too.

The new next generation sequencing Y-SNP tests do have the potential in the long run to be genealogically relevant. There is supposedly a new SNP roughly every one and a half generations. In other words, if there's no SNP found in a son then there will more than likely be a SNP in the grandson. One day the SNPs will effectively allow us to draw complete trees for Y-lines within a genealogical a timeframe. As with any DNA test, a full Y-chromosome SNP test is only useful if you can compare your results with large numbers of other people so that we can work out the chronological order of the more recent SNPs and establish which ones are unique to specific lineages. With the Full Genomes test people in the common haplogroups are reportedly getting between 25 and 40 private SNPs. I imagine the numbers will be pretty similar for the Big Y test from FTDNA.

The numbers of people taking these tests are still relatively small  probably in the hundreds rather than the thousands. Even at $495 a time large-scale testing within a surname project is not going to be a practical proposition. However, if low-hanging SNPs are found that are specific to particular surname lineages then, if these SNPs are added to the a la carte menu, people could test for these single SNPs at $39 a time. STR markers can be used in combination with SNPs to predict who will be positive for which SNP, but ideally you need to be tested to at least 67 markers to make a confident prediction.

The potential problem is that FTDNA are only likely to want to invest money developing single SNPs if there are a reasonable number of people who would be willing to pay for such a test. The more recent the SNPs the fewer people will share them and consequently there will be less chance of the custom SNP tests being developed. FTDNA also only currently have the capacity to offer an additional 2000 custom SNPs. However, they have indicated that they will be re-introducing some form of static deep clade test, probably in the first quarter of 2014, which will be at a much more affordable price. SNPs found in the first phase of the Big Y testing will be candidates for inclusion on these chips so there is possibly some incentive for selected representatives of the various subclades to be tested to ensure that the key new SNPs are included in these tests. Full Genomes have also indicated that they hope to offer single SNPs, and a more economically priced SNP test, but it remains to be seen what they will offer. At the current prices NGS full Y testing is really only for people who wish to contribute to our scientific knowledge and to help delineate all the branches on the Y-tree. No doubt the costs will come down in time. Perhaps in five years or ten years the full Y test will be the norm but we're not there yet.

If you are interested in SNP testing the choice of testing company will be down to the individual and will depend on your budget and your objectives. The ISOGG SNP Testing Chart in the ISOGG Wiki provides a comparison between all the testing companies and is updated as new information becomes available. There will inevitably be new products coming onto the market in the next year with each new test appearing to have a slight advantage over its competitors until the next big thing comes along. I strongly recommend that you join the relevant haplogroup project. The group administrators are all very knowledgeable and will be able to offer good advice. There is a list of Y-DNA haplogroups in the ISOGG Wiki. Most of the projects have associated mailing lists which are currently buzzing with activity, and these will often be the best source of information and commentary.

The SNP tsunami 

The large number of SNPs that will be generated in what has been described as the SNP tsunami will represent a significant challenge for the haplogroup project admins and the citizen scientists who are trying to interpret these data. The new 2014 SNP tree from the Genographic Project, with a mere 6000 or so SNPs, will be something of an irrelevance, and by the time it is published it will be massively out of date, though it will at least lay the foundations for a new nomenclature. The volunteers who maintain the ISOGG tree will have their work cut out to keep up with the new developments. One of the team, David Dowell, has already commented: "It is clear that our processes need to be reorganized and streamlined if we are going to be able to continue to serve the genetic genealogy community and researchers in related disciplines in a timely basis."10

It seems likely that the current confusion will prevail for several months. As one poster on the U106 list has commented, the now infamous quote by Donald Rumsfeld is a very good summary of the current SNP situation:

"There are known knowns; there are things we know that we know.
 There are known unknowns; that is to say, there are things that we now know we  don't know.
 But there are also unknown unknowns – there are things we do not know we don't  know."11
There will be confusion, there will be chaos and there will be competition in the coming months, but from this confusion, chaos and competition many important new discoveries will emerge. I predict that as far as Y-chromosome research is concerned 2014 will be the Year of the SNP.

Updates
Vince Tilroe advises in a comment on Roberta Estes' blog that the 1.5 Y-SNPs per generation was based on the hypothetical presumption that "the entire 60 megabases [60 million bases] of the Y-chromosome could be sequenced. This is not the case by any means, and consequently a more realistic expectation should be closer to 1 Y-SNP per every 4 to 6 generations". Preliminary results from the Full Genomes testing suggest that there is around one Y-SNP every 3 to 4 generations.

Jim Wilson, the Chief Scientist from BritainsDNA, has provided a list of equivalent SNP names for some of the SNPs on the Chromo 2 chip. He has also advised that in due course he will be sharing the genome co-ordinates to allow comparisons with comprehensive Y-chromosome sequences. See CeCe Moore's blog post A list of alternate names for the Y-SNPs from BritainsDNA's Chromo 2 test for further details.

See also
A simplified Y-tree and a common standard for Y-DNA haplogroup and SNP nomenclature
- The Y-chromosome sequence interpretation service from YFull
- YSEQ.net - a new company offering a single SNP testing service

References and notes
1. For further information see the ISOGG Wiki article on the Y-chromosome:  www.isogg.org/wiki/Y_chromosome
2. Moore LT, McEvoy B, Cape E et al. A Y-chromosome signature of hegemony in Gaelic Ireland. American Journal of Human Genetics 2006 78(2): 334–338. Note, however, that this study only used 59 low-resolution STR haplotypes, and many people disagree with the conclusions, both in age and origins.
3. Paterson A. Message posted on the DNA R1b1c7 list. 25 October 2013.
4. Estes R. 2013 Family Tree DNA Conference Day 2DNAeXplained blog, 12 November 2013.
5. Wang C-C, Li H. Discovery of phylogenetic relevant Y-chromosome variants in 1000 Genomes Project data. ArXiv preprint server. Submitted 24 October 2013.
6. Elhaik E, Greenspan E, Staats S et alThe GenoChip: a new tool for genetic anthropologyGenome Biology and Evolution 2013; 5(5): 1021-31.
7. See the thread entitled Full Y chromosome sequencing: Phase III Pilot on the Anthrogenica Forum.
8. Magoon G. Message posted in the R1b-U06 mailing list, 11 November 2013.
9. See the comment thread in the private ISOGG Facebook group at https://www.facebook.com/groups/isogg/permalink/10152015234637922/.
10. Dowell D. ISOGG group gears up for SNP tsunami. Dr D Digs Up Ancestors blog, 13 November 2013.
11. For the background to the quote see the entry for Donald Rumsfeld at Wikiquote: https://en.wikiquote.org/wiki/Donald_Rumsfeld.

© 2013 Debbie Kennett

10 comments:

Kelly said...

Debbie,
An excellent article as always. I am going to link to this in my beginner guide. Just like the controversy about Haplogroup naming conventions I hope that ISOGG can agree advocate for some standardization before the full tsunami hits. Each lab names their newly discovered SNPs only later to find that they have just rediscovered an already discovered SNP. I would like to put forward that we standardize by NOT using all the names but at least for ISOGG use the earliest registered SNP assignment. Its going to be challenging enough sorting the tree and its branches and branchlets without having 3 or 4 names for each SNP. I totally "get" why the companies may not want to do this but I hope the ISOGG community is loud in its demand for "A standard."
Kelly

Debbie Kennett said...

Thanks Kelly. I think it can be very difficult to work out who discovered which SNP first as they are often found independently at around the same time. I would have thought that with such a flood of data the time has come to stop giving SNPs different prefixes. There ought to be a system whereby new SNPs can be reported to ISOGG who then assign a name. However, it's really up to the ISOGG team and they have some significant challenges ahead of them. There may be all sorts of other issues involved of which we are unaware. Did you see my earlier blog post about the Phylotree team's simplified tree and their attempts to set a common standard:

http://cruwys.blogspot.co.uk/2013/11/a-simplified-y-tree-and-common-standard.html

It would help if everyone would at least agree to use the same names for the main branches of the tree.

Anonymous said...

Your facts about M222 are in need of some correction. Dr. James Wilson has found 27 (not 16) new SNPs under M222 by private sequencing of Scots. None of these match SNPs on Geno 2.0 which is where all the 'new' ones illustrated at the recent talk in Houston can be found. This has all been discussed in great detail on the M222 forum, see in particular this post:
http://archiver.rootsweb.ancestry.com/th/read/DNA-R1B1C7/2013-10/1382696731

plus admittedly so many more your eyes will glaze over :-)

Debbie Kennett said...

Many thanks for pointing this out. I'd been trying to find a reference for the BritainsDNA M222 SNPs and hadn't been able to locate where I'd originally seen the figure. I've now updated my article and included Sandy Patterson's post as a reference. How do we know that the Chromo 2 and Geno 2.0 SNPs don't correspond given that neither company have given out the genome reference positions and the Genographic Project haven't yet made their data public? I believe Geno 2.0 accounts are going to be updated this weekend. Whatever the case this is a major breakthrough for M222, and we can expect to see the picture replicated across all the subclades.

Kelly said...

Debbie, Thanks I had. I have just posted a new Lesson to my Beginners Guide including links to your blog post and that of Roberta Estes. It is aimed at the confused average YDNA tester whose eyes are glazing over at the words Big Y or Complete Y. Happy to include other links or revise. https://sites.google.com/site/wheatonsurname/beginners-guide-to-genetic-genealogy/lesson-15-the-future-y-is-here
Kelly

Debbie Kennett said...

Thanks Kelly. That's an excellent article. I think it's always very helpful to have lots of different perspectives, and especially so when we're entering uncharted territory.

Anonymous said...

FGC $1250 tests 23 mega bp; results for 14 mega bp.

BIG Y $495/$695 tests 18 mega bp; results for 10 mega bp, and may be as high as 12 mega bp for some.

Is it really worth the price difference between BIG Y and the "Comprehensive Y-Chromosome Sequencing" from FGC? Everyone will need to make their own call, but from what I see, FGC is far from a full Y and probably why they are calling it a comprehensive Y in a lot of their marketing. Personally, I do not see 4 mega bp and the other differences worth the price difference. Several bloggers made a good point about the questionable usefulness of having all the new STRs if very few people have them.

I want to wait and see what the lower coverage FGC product is going to offer and cost, but am not sure losing the $200 discount would be worth it.

I agree with a friend who said the smart play is one, two, or all three companies allow transfers. There would be gaps if BritainsDNA did not test a region and transferred BIG Y that had the region or FGC did not test a region Chromo 2 covered.

Debbie Kennett said...

These are very valid points. The difficult part is that people are having to make a decision now before full comparisons are available.

It remains to be seen if it will ever be possible to sequence the full Y-chromosome. Perhaps we will have to wait for third-generation or fourth-generation sequencing technology to become available. Even if we were to sequence the full Y we won't know if the data will be informative.

I would suggest that people hold off ordering Chromo 2 and Geno 2 for now, and wait for the newer tests to come out some time next year which will almost certainly offer better coverage at a better price. The new deep clade tests from FTDNA might well prove to be a more economical alternative for those who already have their STR results.

James K. said...

It would help if people stopped calling Full Genomes "Full Y;" they call it Comprehensive Y-Chromosome Sequencing if you check out the website and order form. I understand they can't test the full Y, but they shouldn't market the test like they are testing the full Y. Being able to map 14 million base pairs out of 23 million base pairs is a far cry from being full.

It sounds like a good comparison of 14 million base pairs at Full Genomes and 10 million base pairs at Family Tree DNA. The big difference is expected STRs from Full Genome's Comprehensive Y. Are those extra STRs worth the much larger price? If I could compare them to a database of other testers, possibly. I haven't tested with Family Tree DNA's 111 marker and it has a good sized database. Do I believe somebody who raised the question of STRs becoming an obsolete measure? I would better offer spending the large amount to have a full genome test done, but I don't have $5,000 to do it.

Debbie Kennett said...

James

I think you're right that people should stick to the proper name and call the test the Comprehensive Y instead of a Full Y. I'm sure that with time more comprehensive tests will be come available. STRs are only helpful when you have lots of results to compare them with. Comprehensive Y testing is only ever going to be a niche market at current prices. STRs will still be needed for many years. Even with FTDNA's massive database not that many people have gone up to 111 markers. They don't provide statistics for 111 markers but you can see the figures for the lower testing levels:

http://www.familytreedna.com/why-ftdna.aspx

A 67-marker haplotype can often be used to predict the subclade and you can then just test for the single SNP to confirm the subclade. I don't know but I would have thought the FG and Big Y NGS tests are optimised for the Y-chromosome. Perhaps the results wouldn't be so good if the Y-chromosome were to be thrown in as part of a whole genome test. The Y presents particular problems because of its repetitive nature and the way it recombines with itself. It also crosses over with the X at its tips.