Friday, 25 April 2014

The new 2014 Y-DNA haplotree has arrived!

Today saw the launch of Family Tree DNA's new 2014 Y-DNA haplotree which has been created in partnership with National Geographic's Genographic Project. If you've tested with Family Tree DNA you will find the new tree by going to your personal page and clicking on "haplotree and SNPs". Below is a screenshot of the upper portion of the tree for haplogroup R:


The tree is too large to fit into a single screenshot. Here is the relevant portion of the tree for my dad who is R-Z12, a sub-branch of U106.


Note that this is very much an interim tree. It is based on SNPs tested with the Genographic Project's Geno 2.0 chip, and the cut off date for inclusion of SNPs is November 2013. The new tree does not include all the thousands of new SNPs identified from testing with Big Y, Full Genomes and Chromo 2. The tree will eventually be much more comprehensive but FTDNA are being careful about the data they use from other sources and are insisting that SNPs are only added from published data and raw data that they have personally verified rather than from interpreted data. They have promised that at least one update will be released this year. The FTDNA Learning Center will eventually be updated with information about the new haplotree. If you have questions about a particular SNP that is in the wrong place on the tree or if you spot any other errors you should send an e-mail to the FTDNA help desk with Y-Tree in the subject line.

FTDNA are now recommending SNPs for people to test. I've only had a chance to look briefly at the SNP recommendations for a few project members. It is apparent that in some cases the SNPs that are recommended for testing are not appropriate. SNPs are only recommended if they pass certain percentage thresholds and there might well be a more appropriate downstream SNP that would be more suitable. If you are interested in ordering single SNP testing, make sure you join the appropriate Y-DNA haplogroup project and seek advice from the project administrators. If not, you could end up wasting money ordering unnecessary SNPs.

The following information has been provided by Family Tree DNA.

 FAST FACTS
• Created in partnership with National Geographic’s Genographic Project
• Used GenoChip containing ~10,000 previously unclassified Y-SNPs
• Some of those SNPs came from Walk Through the Y and the 1000 Genome Project
• Used first 50,000 high-quality male Geno 2.0 samples
• Verified positions from 2010 YCC by Sanger sequencing additional anonymous samples
• Filled in data on rare haplogroups using later Geno 2.0 samples

Statistics
• Expanded from approximately 400 to over 1200 terminal branches
• Increased from around 850 SNPs to over 6200 SNPs
• Cut-off date for inclusion for most haplogroups was November 2013

Total number of SNPs broken down by haplogroup:
A 406
B 69
BT 8
C 371
CT 64
D 208
DE 16
E 1028
F 90
G 401
H 18
I 455
IJ 29
IJK 2
J 707
K 11
K(xLT) 1
L 129
LT 12
M 17
N 168
NO 16
O 936
P 81
Q 198
R 724
S 5
T 148

myFTDNA Interface
• Existing customers receive free update to predictions and confirmed branches based on existing SNP test results.
• Haplogroup badge updated if new terminal branch is available
• Updated haplotree design displays new SNPs and branches for your haplogroup
• Branch names now listed in shorthand using terminal SNPs
• For SNPs with more than one name, in most cases the original name for SNP was used, with synonymous SNPs listed when you click "More…"
• No longer using SNP names with .1, .2, .3 suffixes. Back-end programming will place SNP in correct haplogroup using available data.
• SNPs recommended for additional testing are pre-populated in the cart for your convenience. Just click to remove those you don’t want to test.
• SNPs recommended for additional testing are based on 37-marker haplogroup origins data where possible, 25- or 12-marker data where 37 markers weren't available.
• Once you've tested additional SNPs, that information will be used to automatically recommend additional SNPs for you if they’re available.
• If you remove those prepopulated SNPs from the cart, but want to re-add them, just refresh your page or close the page and return.
• Only one SNP per branch can be ordered at one time – synonymous SNPs can possibly [be] ordered from the Advanced Orders section on the Upgrade Order page.
• Tests taken have moved to the bottom of the haplogroup page.

Coming attractions
• Group Administrator Pages will have longhand removed.
• At least one update to the tree to be released this year.
• Update will include: data from Big Y, relevant publications, other companies' tests from raw data.
• We'll set up a system for those who have tested with other big data companies to contribute their raw data file to future versions of the tree.
• We're committed to releasing at least one update per year.
• The Genographic Project is currently integrating the new data into their system and will announce on their website when the process is complete in the coming weeks. At that time, all Geno 2.0 participants’ results will be updated accordingly and accessible via the Genographic Project website.

BACKGROUND
Family Tree DNA created the 2014 Y-DNA Haplotree in partnership with the National Geographic Genographic Project using the proprietary GenoChip. Launched publicly in late 2012, the chip tests approximately 10,000 Y-DNA SNPs that had not, at the time, been phylogenetically classified.

The team used the first 50,000 male samples with the highest quality results to determine SNP positions. Using only tests with the highest possible “call rate” meant more available data, since those samples had the highest percentage of SNPs that produced results, or “calls.”

In some cases, SNPs that were on the 2010 Y-DNA Haplotree didn’t work well on the GenoChip, so the team used Sanger sequencing on anonymous samples to test those SNPs and to confirm ambiguous locations.

For example, if it wasn’t clear if a clade was a brother (parallel) clade, or a downstream clade, they tested for it.

The scope of the project did not include going farther than SNPs currently on the GenoChip in order to base the tree on the most data available at the time, with the cutoff for inclusion being about November of 2013.

Where data were clearly missing or underrepresented, the team curated additional data from the chip where it was available in later samples. For example, there were very few Haplogroup M samples in the original dataset of 50,000, so to ensure coverage, the team went through eligible Geno 2.0 samples submitted after November, 2013, to pull additional Haplogroup M data. That additional research was not necessary on, for example, the robust Haplogroup R dataset, for which they had a significant number of samples.

Family Tree DNA, again in partnership with the Genographic Project, is committed to releasing at least one update to the tree this year. The next iteration will be more comprehensive, including data from external sources such as known Sanger data, Big Y testing, and publications. If the team gets direct access to raw data from other large companies’ tests, then that information will be included as well. We are also committed to at least one update per year in the future.

Known SNPs will not intentionally be renamed. Their original names will be used since they represent the original discoverers of the SNP. If there are two names, one will be chosen to be displayed and the additional name will be available in the additional data, but the team is taking care not to make synonymous SNPs seems as if they are two separate SNPs. Some examples of that may exist initially, but as more SNPs are vetted, and as the team learns more, those examples will be removed.

In addition, positions or markers within STRs, as they are discovered, or large insertion/deletion events inside homopolymers, potentially may also be curated from additional data because the event cannot accurately be proven. A homopolymer is a sequence of identical bases, such as AAAAAAAAA or TTTTTTTTT. In such cases it’s impossible to tell which of the bases the insertion is, or if/where one was deleted. With technology such as Next Generation Sequencing, trying to get SNPs in regions such as STRs or homopolymers doesn’t make sense because we’re discovering non-ambiguous SNPs that define the same branches, so we can use the non-ambiguous SNPs instead.  Some SNPs from the 2010 tree have been intentionally removed. In some cases, those were SNPs for which the team never saw a positive result, so while it may be a legitimate SNP, even haplogroup defining, it was outside of the current scope of the tree. In other cases, the SNP was found in so many locations that it could cause the orientation of the tree to be drawn in more than one way. If the SNP could legitimately be positioned in more than one haplogroup, the team deemed that SNP to not be haplogroup defining, but rather a high polymorphic location.

To that end, SNPs no longer have .1, .2, or .3 designations. For example, J-L147.1 is simply J-L147, and I-147.2 is simply I-147.  Those SNPs are positioned in the same place, but back-end programming will assign the appropriate haplogroup using other available information such as additional SNPs tested or haplogroup origins listed. If other SNPs have been tested and can unambiguously prove the location of the multi-locus SNP for the sample, then that data is used. If not, matching haplogroup origin information is used.

We will also move to shorthand haplogroup designations exclusively. Since we’re committing to at least one iteration of the tree per year, using longhand that could change with each update would be too confusing.  For example, Haplogroup O used to have three branches: O1, O2, and O3. A SNP was discovered that combined O1 and O2, so they became O1a and O1b.

There are over 1200 branches on the 2014 Y Haplogroup tree, as compared to about 400 on the 2010 tree. Those branches contain over 6200 SNPs, so we’ve chosen to display select SNPs as “active” with an adjacent “More” button to show the synonymous SNPs if you choose.

The Genographic Project is currently integrating the new data into their system and will announce on their website when the process is complete in the coming weeks.  At that time, all Geno 2.0 participants’ results will be updated accordingly and will be accessible via the Genographic Project website.

QUOTES
Elliot Greenspan has provided the following quotes in conversation with Janine Cloud, Family Tree DNA's GAP Liaison and Events Co-ordinator:

"I want it to be the most accurate tree it can be, but I also want it to be interesting. That's the key. Historical relevance is what we're to discover. Anthropological relevance. It's not just who has the largest tree, it's who can make the most sense out of what you have [that] is important."

"This year we're committing to launching another tree. This tree will be more comprehensive, utilizing data from external sources: known Sanger data, as well as data such as Big Y, and if we have direct access to the raw data to make the proof (from large companies, such as the Chromo2) or a publication, or something of that nature. That is our intention that it be added into the data."

"We’re definitely committed to update at least once per year. Our intention is to use data from other sources, as well as any SNPs we can, but it must be well-vetted. NGS and SNP technology inherently has errors. You must curate for those errors otherwise you’re just putting slop out to customers. There are some SNPs that may bind to the X chromosome that you didn’t know. There are some low coverages that you didn’t know."

"With technology such as this [next-generation sequencing] you're able to overcome the urge to test only what you’re likely to be positive for, and instead use the shotgun method and test everything. This allows us to make the discovery that SNPs are not nearly as stable as we thought, and they have a larger potential use in that sense."

"Not only does the raw data need to be vetted but it needs to make sense. Using Geno 2.0, I only accepted samples that had the highest call rate, not just because it was the best quality but because it was the most data. I don't want to be looking at data where I'm missing potential information A, or I may become confused by potential information B. That is something that will bog us down. When you’re looking at large data sets, I’d much rather throw out 20% of them because they’re going to take 90% of the time than to do my best to get one extra SNP on the tree or one extra branch modified, that is not worth all of our time and effort. What is, is figuring out what the broader scope of people are, because that is how you break down origins. Figuring one single branch for one group of three people is not truly interesting until it's 50 people, because 50 people is a population. Three people may be a family unit. You have to have enough people to determine relevance. That's why using large datasets and using complete datasets are very, very important."

Update 27 April 2014
A recording of the Family Tree DNA webinar presented by Elise Friedman on the launch of the new 2014 haplotree is now available online and can be accessed here (free registration required).

Related blog posts
- A confusion of SNPs

14 comments:

Brian Swann said...

Did you get pre-information to do this, Debbie? Otherwise it is pretty impressive to capture this in less than 25 minutes after the Webinar finished.

Debbie Kennett said...

Brian, Most of the information I used in the blog post was provided direct by FTDNA. The bloggers all received the information from FTDNA last night, and I just had to refine my introductory paragraphs during the webinar.

Unknown said...

Yes some SNPs being suggested are not only inappropriate but are SNPs you might already have tested for.

Debbie Kennett said...

CJ, I know there is supposed to be a problem with the transfer of Geno 2.0 results. Only the positive SNP results got transferred to FTDNA. I understand they are now supposed to be rectifying the problem and arranging for the negative results to be transferred too.

Mark D said...

Based on all the negative comments I've read on the several forums, it appears FTDNA should have waited until the Big Y results could be included. Another 6-8 months' wait would not have created as much heartburn as this release has. The Haplotree had not been updated for years anyway. FTDNA could also have worked more closely with project administrators in touching up the final product. I'm a member of the DF27 project, administered by some very competent people, and FTDNA did not even include DF27 on the tree. Major oversight!

Anonymous said...

FTDNA should also have included branches for which they have sold tests yielding positive results, instead of removing those results from users' pages. - Bill H

Debbie Kennett said...

Mark, I suspect this will all get sorted out in the long run. It could take a very long time to analyse the Big Y results, so I think it's better that they've made a start, even if the tree is already out of date. I predicted this problem several months ago! At least the groundwork has been laid for the new tree, and I presume now that it is up and running it will be easier to update it in the future. I did hear someone say that the next update might actually happen in June. I do think it would have been best if the haplogroup project admins had been involved. However, at present we don't have the full picture as the dataset of 50,000 Geno 2.0 results that were used has not been made publicly available. Only perhaps around 20% of Genographic testers transfer their results to FTDNA. I presume DF27 is a newish discovery which is why it's not yet on the tree. It should be remembered that this update was a major technical challenge with a database of almost half a million Y-DNA results. There will inevitably be errors, but you often only find the errors when you go ahead and publish and the work is available for review by the genetic genealogy community.

Debbie Kennett said...

Bill, If you have a branch that is not being identified then I think you need to write to the FTDNA help desk and ask them to rectify it.

Lynn David said...

It sure surprised me. Especially when going to the I2a project and seeing people who were previously listed as I2a3 (I-L233) now listed as G1a1b. At least I am still I-L233. Hoping the Big-Y test do help to clear things up a bit. Strangely enough while still I-L233, I have a further SNP P27.1_2. The admins of the I2a project is now listing those of us as I-L233 as I2a1b which seems strange because the Geno2/ISOGG tree would have us as I2a1c. Seems a shakeout is still to come. Unless I'm making my usual uneducated guesses!

Debbie Kennett said...

Lynn, I've just raised this question in the ISOGG Facebook group. I suspect this is a convergence problem. I have a project member who is similarly affected. He was previously predicted to be I2a and is now predicted to be G-L201 (G1a1b). FTDNA are now using 37 markers rather than 12 markers to make the haplogroup predictions which probably explains the change. My project member has 37-marker matches with people whose terminal SNP is listed as P37 (G1a1b), which appears to be synonymous with L201, and one person whose terminal SNP is L233 (I2a1b). P37 seems to be one of those SNPs that occurs in multiple haplogroups. I suspect the only way to solve the puzzle is to get some SNP testing, but I would wait and get advice first from the haplogroup project admins.

max said...

the most useful thing everyone can do is to exercise patience - i like everyone else wants to know from which exotic or unusual place i descend - and even when we get the answer the chances are science wil later on change the answer - it does with most other things - Max

Debbie Kennett said...

Thanks Max. Those are very wise words. I think everyone is underestimating the complexities of implementing these changes in a massive database of almost half a million Y-DNA records. I'm sure all the glitches will be resolved in good time.

Jeepguy2014 said...

Since the Big-Y my cousin has changed from R-M269 to R-P297. I cannot seem to find this particular hap group. Is this a new one? He also went to a different shorthand name R1b1a1a1a1a3a1 which I can't seem to find that either. Anybody know?

Steve

Debbie Kennett said...

Jeepguy

You can locate the position of P297 on the ISOGG haplotree. Here's the link to the haplogroup R section of the tree:

http://www.isogg.org/tree/ISOGG_HapgrpR.html

You can also look up any SNP in the SNP index:

http://www.isogg.org/tree/ISOGG_YDNA_SNP_Index.html

As you'll see from the tree, P297 is upstream of M269 so I don't understand where the longhand name came from. These longhand names are in the process of being changed and replaced with the shorthand names on project websites.

You should make sure your cousin joins the R1b and subclades gateway project:

https://www.familytreedna.com/public/r1b/

They will be able to check his haplogroup prediction and advise on further SNP testing to determine the subclade.