Cruwys news: Y-SNP tree

Showing posts with label Y-SNP tree. Show all posts

Friday, 22 August 2014

clarifY DNA - a new Y-SNP analysis service

clarifY DNA is a new Y-DNA analysis service from Chris Morley, a well respected citizen scientist in the genetic genealogy community who is best known for his Geno 2.0 subclade predictor and his experimental Geno 2.0 trees. The methodology is outlined in his white paper "An experimental computer-generated Y-chromosomal phylogeny, leveraging public Geno 2.0 results and the current ISOGG tree". The new service is a natural development from the Geno 2.0 tool and allows users to receive a computer-generated phylogeny based on next-generation sequencing results. The service is currently restricted to an analysis of Big Y VCF/BED files, but there are plans to add the Full Genomes test (from a text file output), and the Chromo2 test from BritainsDNA in due course. The analysis currently costs $30 which includes the initial analysis and a subscription providing further updates at least until the end of 2014.

It is first of all necessary to register for an account. Once your payment has been approved and you've uploaded your files the automated report can be generated. The reports are manually checked before being uploaded to the website and I understand the turnaround is usually within 24 hours though is often much quicker. Once the report is ready you can download the PDF file from the phylogenetic reports menu.

Here is the tree generated from my dad's Big Y files.

The tree is very clear and easy to understand. It builds on the good work of the ISOGG Y-SNP tree but also provides a more provisional perspective. clarifYDNA communicates which aspects are accepted, which aspects are provisional, and which aspects are most in need of further investigation. The tree is also a vast improvement on the current Family Tree DNA haplotree. The FTDNA tree was produced in partnership with the Genographic Project but the cut-off date was November 2013 and the tree does not include any of the new SNPs identified from testing with Big Y, Full Genomes and Chromo 2. The FTDNA tree still shows my dad's most downstream SNP as Z12 (a branch of R1b-U106), yet he had already tested positive for Z12 prior to taking the Big Y test.

According to the clarifY DNA analysis my dad has 18 private SNPs (all the SNPs highlighed in orange on line 14), which is the same number of private SNPs identified by the U106 project team. For genealogical purposes it is of course these private SNPs which are of the most interest and in the long term, as more people get tested, in theory we should be able to establish precisely where all these private SNPs are positioned on the tree and we will have the complete branching process of our Cruwys/Cruse/Cruise tree right down to the last few hundred years.

The report includes some of the technical details about how the algorithm works which I've reproduced here for reference:

The contents of this report were produced by a computer algorithm. This report will be frequently re-generated as more information becomes available. The pilot-scale implementation of this algorithm is able to process a dataset of over 4000 Big Y kits (over 400 real and 3600 simulated) in one run.

clarifY DNA’s automation capabilities analyse large Y-SNP datasets with great speed, great accuracy and great comprehensiveness. These facets are critical for: helping a testing company’s customers make informed SNP-ordering decisions; uniting customers and/or research participants with their most meaningful patrilineal matches; and, overall, scientific progress, customer satisfaction and further growth.

All in all, clarifY DNA’s software is the key to truly realising the “Y Tree” in “Family Tree”.
The phylogenetic algorithm employed here was initially developed in June 2013 for Geno 2.0 data; see http://ytree.morleydna.com/experimental-phylogeny for similar reports (from an earlier version of the phylogenetic algorithm) leveraging public Geno 2.0 data. While this report represents a large advance over existing Y-DNA trees, please treat some aspects of this report as experimental and preliminary; some enhancements specific to next-generation sequencing have not been exhaustively tested, and there are several discrepancies over the definitions of high-level SNPs.

The service also provides the option to contact your closest "genetic neighbours" on your branch of the Y-tree. You can opt to make your kit number and e-mail address available to your neighbours or you can choose to remain anonymous. If you opt not to reveal your email address, your matches can still send you a message, routed through clarifYDNA.com, and it is then up to you to decide whether or not to reply (thereby revealing your email address).

All in all this looks like a very promising new service which provides cutting edge haplogroup analysis in a report which distils the pertinent information into an easy to understand phylogenetic tree. The value of the service will grow as more users contribute their data, and I understand that further enhancements are in the pipeline. clarifY DNA will be of particular benefit to people who have taken the Big Y test but who do not have the advantage of participating in a haplogroup project with administrators and team members who are actively involved in the interpretation and analysis of Big Y results. Even if you have received a detailed analysis from your project admins the service is worthwhile for the clarity of the presentation of the tree which helps to put your results in context.

Disclosure: I was given a complimentary analysis of my dad's Big Y data to enable me to write this review.

Friday, 25 April 2014

The new 2014 Y-DNA haplotree has arrived!

Today saw the launch of Family Tree DNA's new 2014 Y-DNA haplotree which has been created in partnership with National Geographic's Genographic Project. If you've tested with Family Tree DNA you will find the new tree by going to your personal page and clicking on "haplotree and SNPs". Below is a screenshot of the upper portion of the tree for haplogroup R:

The tree is too large to fit into a single screenshot. Here is the relevant portion of the tree for my dad who is R-Z12, a sub-branch of U106.

Note that this is very much an interim tree. It is based on SNPs tested with the Genographic Project's Geno 2.0 chip, and the cut off date for inclusion of SNPs is November 2013. The new tree does not include all the thousands of new SNPs identified from testing with Big Y, Full Genomes and Chromo 2. The tree will eventually be much more comprehensive but FTDNA are being careful about the data they use from other sources and are insisting that SNPs are only added from published data and raw data that they have personally verified rather than from interpreted data. They have promised that at least one update will be released this year. The FTDNA Learning Center will eventually be updated with information about the new haplotree. If you have questions about a particular SNP that is in the wrong place on the tree or if you spot any other errors you should send an e-mail to the FTDNA help desk with Y-Tree in the subject line.

FTDNA are now recommending SNPs for people to test. I've only had a chance to look briefly at the SNP recommendations for a few project members. It is apparent that in some cases the SNPs that are recommended for testing are not appropriate. SNPs are only recommended if they pass certain percentage thresholds and there might well be a more appropriate downstream SNP that would be more suitable. If you are interested in ordering single SNP testing, make sure you join the appropriate Y-DNA haplogroup project and seek advice from the project administrators. If not, you could end up wasting money ordering unnecessary SNPs.

The following information has been provided by Family Tree DNA.

FAST FACTS
• Created in partnership with National Geographic’s Genographic Project
• Used GenoChip containing ~10,000 previously unclassified Y-SNPs
• Some of those SNPs came from Walk Through the Y and the 1000 Genome Project
• Used first 50,000 high-quality male Geno 2.0 samples
• Verified positions from 2010 YCC by Sanger sequencing additional anonymous samples
• Filled in data on rare haplogroups using later Geno 2.0 samples

Statistics
• Expanded from approximately 400 to over 1200 terminal branches
• Increased from around 850 SNPs to over 6200 SNPs
• Cut-off date for inclusion for most haplogroups was November 2013

Total number of SNPs broken down by haplogroup:
A 406
B 69
BT 8
C 371
CT 64
D 208
DE 16
E 1028
F 90
G 401
H 18
I 455
IJ 29
IJK 2
J 707
K 11
K(xLT) 1
L 129
LT 12
M 17
N 168
NO 16
O 936
P 81
Q 198
R 724
S 5
T 148

myFTDNA Interface
• Existing customers receive free update to predictions and confirmed branches based on existing SNP test results.
• Haplogroup badge updated if new terminal branch is available
• Updated haplotree design displays new SNPs and branches for your haplogroup
• Branch names now listed in shorthand using terminal SNPs
• For SNPs with more than one name, in most cases the original name for SNP was used, with synonymous SNPs listed when you click "More…"
• No longer using SNP names with .1, .2, .3 suffixes. Back-end programming will place SNP in correct haplogroup using available data.
• SNPs recommended for additional testing are pre-populated in the cart for your convenience. Just click to remove those you don’t want to test.
• SNPs recommended for additional testing are based on 37-marker haplogroup origins data where possible, 25- or 12-marker data where 37 markers weren't available.
• Once you've tested additional SNPs, that information will be used to automatically recommend additional SNPs for you if they’re available.
• If you remove those prepopulated SNPs from the cart, but want to re-add them, just refresh your page or close the page and return.
• Only one SNP per branch can be ordered at one time – synonymous SNPs can possibly [be] ordered from the Advanced Orders section on the Upgrade Order page.
• Tests taken have moved to the bottom of the haplogroup page.

Coming attractions
• Group Administrator Pages will have longhand removed.
• At least one update to the tree to be released this year.
• Update will include: data from Big Y, relevant publications, other companies' tests from raw data.
• We'll set up a system for those who have tested with other big data companies to contribute their raw data file to future versions of the tree.
• We're committed to releasing at least one update per year.
• The Genographic Project is currently integrating the new data into their system and will announce on their website when the process is complete in the coming weeks. At that time, all Geno 2.0 participants’ results will be updated accordingly and accessible via the Genographic Project website.

BACKGROUND
Family Tree DNA created the 2014 Y-DNA Haplotree in partnership with the National Geographic Genographic Project using the proprietary GenoChip. Launched publicly in late 2012, the chip tests approximately 10,000 Y-DNA SNPs that had not, at the time, been phylogenetically classified.

The team used the first 50,000 male samples with the highest quality results to determine SNP positions. Using only tests with the highest possible “call rate” meant more available data, since those samples had the highest percentage of SNPs that produced results, or “calls.”

In some cases, SNPs that were on the 2010 Y-DNA Haplotree didn’t work well on the GenoChip, so the team used Sanger sequencing on anonymous samples to test those SNPs and to confirm ambiguous locations.

For example, if it wasn’t clear if a clade was a brother (parallel) clade, or a downstream clade, they tested for it.

The scope of the project did not include going farther than SNPs currently on the GenoChip in order to base the tree on the most data available at the time, with the cutoff for inclusion being about November of 2013.

Where data were clearly missing or underrepresented, the team curated additional data from the chip where it was available in later samples. For example, there were very few Haplogroup M samples in the original dataset of 50,000, so to ensure coverage, the team went through eligible Geno 2.0 samples submitted after November, 2013, to pull additional Haplogroup M data. That additional research was not necessary on, for example, the robust Haplogroup R dataset, for which they had a significant number of samples.

Family Tree DNA, again in partnership with the Genographic Project, is committed to releasing at least one update to the tree this year. The next iteration will be more comprehensive, including data from external sources such as known Sanger data, Big Y testing, and publications. If the team gets direct access to raw data from other large companies’ tests, then that information will be included as well. We are also committed to at least one update per year in the future.

Known SNPs will not intentionally be renamed. Their original names will be used since they represent the original discoverers of the SNP. If there are two names, one will be chosen to be displayed and the additional name will be available in the additional data, but the team is taking care not to make synonymous SNPs seems as if they are two separate SNPs. Some examples of that may exist initially, but as more SNPs are vetted, and as the team learns more, those examples will be removed.

In addition, positions or markers within STRs, as they are discovered, or large insertion/deletion events inside homopolymers, potentially may also be curated from additional data because the event cannot accurately be proven. A homopolymer is a sequence of identical bases, such as AAAAAAAAA or TTTTTTTTT. In such cases it’s impossible to tell which of the bases the insertion is, or if/where one was deleted. With technology such as Next Generation Sequencing, trying to get SNPs in regions such as STRs or homopolymers doesn’t make sense because we’re discovering non-ambiguous SNPs that define the same branches, so we can use the non-ambiguous SNPs instead. Some SNPs from the 2010 tree have been intentionally removed. In some cases, those were SNPs for which the team never saw a positive result, so while it may be a legitimate SNP, even haplogroup defining, it was outside of the current scope of the tree. In other cases, the SNP was found in so many locations that it could cause the orientation of the tree to be drawn in more than one way. If the SNP could legitimately be positioned in more than one haplogroup, the team deemed that SNP to not be haplogroup defining, but rather a high polymorphic location.

To that end, SNPs no longer have .1, .2, or .3 designations. For example, J-L147.1 is simply J-L147, and I-147.2 is simply I-147. Those SNPs are positioned in the same place, but back-end programming will assign the appropriate haplogroup using other available information such as additional SNPs tested or haplogroup origins listed. If other SNPs have been tested and can unambiguously prove the location of the multi-locus SNP for the sample, then that data is used. If not, matching haplogroup origin information is used.

We will also move to shorthand haplogroup designations exclusively. Since we’re committing to at least one iteration of the tree per year, using longhand that could change with each update would be too confusing. For example, Haplogroup O used to have three branches: O1, O2, and O3. A SNP was discovered that combined O1 and O2, so they became O1a and O1b.

There are over 1200 branches on the 2014 Y Haplogroup tree, as compared to about 400 on the 2010 tree. Those branches contain over 6200 SNPs, so we’ve chosen to display select SNPs as “active” with an adjacent “More” button to show the synonymous SNPs if you choose.

The Genographic Project is currently integrating the new data into their system and will announce on their website when the process is complete in the coming weeks. At that time, all Geno 2.0 participants’ results will be updated accordingly and will be accessible via the Genographic Project website.

QUOTES
Elliot Greenspan has provided the following quotes in conversation with Janine Cloud, Family Tree DNA's GAP Liaison and Events Co-ordinator:

"I want it to be the most accurate tree it can be, but I also want it to be interesting. That's the key. Historical relevance is what we're to discover. Anthropological relevance. It's not just who has the largest tree, it's who can make the most sense out of what you have [that] is important."

"This year we're committing to launching another tree. This tree will be more comprehensive, utilizing data from external sources: known Sanger data, as well as data such as Big Y, and if we have direct access to the raw data to make the proof (from large companies, such as the Chromo2) or a publication, or something of that nature. That is our intention that it be added into the data."

"We’re definitely committed to update at least once per year. Our intention is to use data from other sources, as well as any SNPs we can, but it must be well-vetted. NGS and SNP technology inherently has errors. You must curate for those errors otherwise you’re just putting slop out to customers. There are some SNPs that may bind to the X chromosome that you didn’t know. There are some low coverages that you didn’t know."

"With technology such as this [next-generation sequencing] you're able to overcome the urge to test only what you’re likely to be positive for, and instead use the shotgun method and test everything. This allows us to make the discovery that SNPs are not nearly as stable as we thought, and they have a larger potential use in that sense."

"Not only does the raw data need to be vetted but it needs to make sense. Using Geno 2.0, I only accepted samples that had the highest call rate, not just because it was the best quality but because it was the most data. I don't want to be looking at data where I'm missing potential information A, or I may become confused by potential information B. That is something that will bog us down. When you’re looking at large data sets, I’d much rather throw out 20% of them because they’re going to take 90% of the time than to do my best to get one extra SNP on the tree or one extra branch modified, that is not worth all of our time and effort. What is, is figuring out what the broader scope of people are, because that is how you break down origins. Figuring one single branch for one group of three people is not truly interesting until it's 50 people, because 50 people is a population. Three people may be a family unit. You have to have enough people to determine relevance. That's why using large datasets and using complete datasets are very, very important."

Update 27 April 2014
A recording of the Family Tree DNA webinar presented by Elise Friedman on the launch of the new 2014 haplotree is now available online and can be accessed here (free registration required).

Related blog posts
- A confusion of SNPs

Wednesday, 23 April 2014

The 2014 Y-DNA haplotree and special offers for DNA day

Family Tree DNA group administrators have received notification that the long awaited Y-DNA 2014 haplotree is to be launched on Friday 25th April to coincide with DNA Day and also something known in America as National Arbor Day which I'd never previously heard of but is rather aptly related to the planting and nurturing of trees, albeit real ones rather than those constructed from DNA. Starting on Friday the 37-marker Y-DNA test will also be on sale for a limited time. Here are the details:

National DNA Day, celebrated on April 25, commemorates the completion of the Human Genome Project and the discovery of DNA's double helix on April 25, 2003.

Since 1970, the U.S. has observed National Arbor Day, dedicated to the planting and nurturing of trees, on the last Friday in April.

This year National Arbor Day falls on National DNA Day, so what better opportunity for Family Tree DNA to release the long-awaited 2014 Y-DNA Haplotree!

We wanted you, the group administrators who have done so much to contribute to the success of the company, to know before we release the news to the entire Y database and the genetic genealogy community.

In addition to expanding the tree from 400 to 1000 terminal branches, the Haplotree page will have an updated, fresh design.

Our engineering team will begin to push the code that will update the database prior to the official release of the tree, so you'll see some changes in terminal SNPs and haplogroups for those who have done additional testing.

To help with the transition, our Webinar Coordinator, Elise Friedman will host a live webinar on DNA Day for a demonstration of the new tree and more details about this landmark update on Friday, April 25, 2014 @ 12pm Central (5pm UTC).

To register, click here: http://bit.ly/1dGbbbx

A recording of this webinar will be posted to the Webinars page of our Learning Center within 24-48 hours after the live event: https://www.familytreedna.com/learn/ftdna/webinars

***********************************************************************

And because we know you're going to ask...we will have a DNA Day sale that suits the occasion!

Y-DNA SNPs will be 20% off from April 25 - 29. In addition, the Y-DNA 37 test will be 20% off the retail price.

The sale officially begins at 12.01am Houston time on 25th April and ends at 1.59 pm on 29th April. If you are ordering a Y-DNA test make sure you order through a surname project or a geographical project to benefit from the additional project discount. As always I would be very happy to welcome new members to my Cruwys/Cruse/Cruise/Crew(es) DNA Project and my Devon DNA Project.

Thomas Krahn's company YSEQ has also announced a price reduction. Single SNPs are reduced to $25 with immediate effect through until Father's Day on 15th June 2014. For further details about YSEQ see my previous blog post YSEQ.net - a new company offering a single SNP testing service.

Family Tree DNA last updated their Y-DNA haplotree back in 2010. There have been a huge number of changes since then so the new tree will be most welcome. However, with the tsunami of new SNPs now being identified from the Big Y, Full Genomes and Chromo 2 tests, the 2014 tree is already going to be very out of date as soon as it is published. To understand the problem read my previous blog post on a confusion of SNPs. I presume the new tree will also see the full implementation of the shorthand naming system. For example, the format R-Z12 will be used instead of the unwieldy longhand version which, according to the current ISOGG Y-SNP tree, is R1b1a2a1a1c2b2a1a1a1. I would also hope that the new tree will have the facility built in to allow more frequent updates in the future. Let's wait and see what Friday brings. Here's hoping for a smooth transition.

Saturday, 1 March 2014

The BIG Y roll out – the SNP tsunami is on its way!

The genetic genealogy community has been eagerly anticipating the arrival of the so-called SNP tsunami for several months and it now seems that the first waves are starting to appear on the horizon. I was one of a select few genetic genealogists and bloggers who was invited to participate late on Thursday afternoon (UK time) in a private webinar led by Dr David Mittelman, Family Tree DNA’s Chief Scientific Officer, in preparation for the rollout of the first results from FTDNA’s next-generation sequencing BIG Y test.¹ During the webinar we were given a sneak preview of some sample results from the test and we had the opportunity to ask lots of questions. I don't know what it says about me and my enthusiasm for Y-SNP testing but I seemed to be the one asking most of the questions! I am very excited about the implications of comprehensive Y-chromosome sequencing. These tests will not only allow us to define the exact branching within each haplogroup but will also reach right down into genealogical time and will eventually make it possible to delineate recent branches of the Y-line and identify the common ancestor almost down to the exact generation.

Background
There are almost 60 million base pairs in the Y-chromosome but about half of it is full of repeating complexities which have yet to be deciphered. There are only around 20 million or so bases which are good candidates for sequencing.^{2, 3} The BIG Y test was designed to provide the most information at the most affordable price. The intention is also to provide information in the most clear and easy-to-use way.

There seems to have been some confusion about how much of the Y-chromosome is sequenced for the BIG Y test so I asked Dr Mittelman for clarification. He advised that the test sequences around 13.5 million bases on the Y-chromosome and provides results for between 11.5 and 12.5 million positions. It is not possible to give a precise figure because NGS results vary from person to person. This is an improvement on the spec that was advertised when the pre-sale was announced in November when a figure of 10 million bases was quoted.

When the BIG Y pre-sale was announced the coverage was advertised as 60x (the number refers to the number of times the Illumina machines read the sequence – the more reads the better). The information on the BIG Y FAQ page has since been updated and the coverage is now being advertised as “55x to 80x average coverage”.

The roll out
The BIG Y tests have been processed in the order in which they have been received, but some people had to supply new DNA samples so their tests will take longer. The first 100 results were released on Thursday 27th February, and there will be a gradual roll out of results running through to the end of March. We had been expecting all the BIG Y results to be released on the same day but it now appears that the anticipated tsunami will be more of a steady trickle of waves – a slow-motion tsunami⁴ – rather than one giant flood of data. The following message is now being displayed on the personal pages of people who are awaiting their results:

"We expect that all samples ordered during the initial sale (last November & December) will be delivered by March 28th. We are processing samples in first come first serve order. If a sample doesn't pass quality control, we will place it in the next set of results to be processed as long as we have enough DNA sample. If we require an additional sample, we will send a new test kit and place the new sample in the first set to be processed when it is returned."

My dad is one of the people waiting his results but I did not place the order until the very end of the pre-sale period so his results will probably be amongst the last to be processed. Along with other people who have ordered the BIG Y test I received an e-mail this morning from Nir Leibovich, FTDNA's Chief Business Officer, apologising for the delay. He advised: "The entire FTDNA team has been working very hard over the last few months with high determination and many late nights. Launching a new product is always a challenge with many moving parts, some more predictable than others. Unfortunately we ran into some surprises beyond our control when one of our suppliers ran out of certain reagents we needed for running the Big Y product... We hope you will let the wonderful product we produced make up for delays that were needed to refine it! We have updated expected results dates on customer pages and will work around the clock to beat them." [Click here to read the full text of the e-mail.]

How many BIG Y tests have been ordered?

I asked if we could be given an idea of the number of BIG Y tests ordered. Although a precise figure was not revealed we were told that there had been "thousands" of orders and that "FTDNA have more Y than anyone else". I know that large numbers of orders have gone through some of the haplogroup projects. There have been 149 orders in the R1b-U016 Project alone and around 340 orders in the R1b-L21 Project. If you have ordered the BIG Y test do make sure you join the relevant haplogroup project so that the very helpful and knowledgeable volunteer admins can help you to understand your results. There is a list of Y-DNA haplogroup projects in the ISOGG Wiki:

www.isogg.org/wiki/Y-DNA_haplogroup_projects.

What is reported
Screenshots of the user interface and explanations of the various features can be seen on the BIG Y page in the FTDNA Learning Center:

www.familytreedna.com/learn/user-guide/other-test-results/big-y-page

FTDNA have a big internal SNP database with details of 36,562 known SNPs. Customers will be given a list of their results for all the SNPs in the database. They will be told whether they are ancestral or derived for each position, whether or not the SNP is on the tree, the genome reference co-ordinates, their genotype (their DNA letters) and the confidence rating.

There are three confidence levels for the SNP calls. High confidence means that all the reads essentially agree. Medium confidence means that the information looks good but it has to be manually curated. Low confidence indicates noisy data.

NGS coverage varies from person to person but it is expected that results will be provided for between 25,000 to 35,000 known SNPs per person. The amount of overlap with the tests from Full Genomes, Geno 2.0 and Chromo 2 is not yet known, but it is expected that the BIG Y will cover 90% of the SNPs in the Geno 2 and Chromo 2 tests. There are a handful of people in the genetic genealogy community who have tested with all four companies. Some people have also taken the Walk Through the Y test, the previous SNP discovery test from FTDNA which utilised Sanger sequencing. Once the BIG Y results have all been released and compared with the other tests the haplogroup project admins will be able to provide better information on the overlap between all the tests.

Customers will also be given a separate list of novel variants. These are defined as variants which differ from the reference sequence and which are not seen in the FTDNA SNP database. Thankfully the genome reference co-ordinates will be provided which will allow comparisons with SNPs identified in tests from other providers (with the exception of BritainsDNA who have not released the co-ordinates for their new S series SNPs [see my update from 4th March below]). Dr Mittelman does not yet know how many novel SNPs to expect per person. There is currently no function to compare novel variants in the database, but the test is very much a work in progress and he is open to suggestions for new ideas.

Information will not as yet be provided on INDELS (insertions and deletions), but experienced users will be able to extract the information from the raw data.

File formats
Two types of files will be provided: a VCF file and a BED file. These files are not currently available but should be ready for download some time next week.

The VCF (variant call format) file will consist of a list of all the variants identified, tagged by confidence and location. This is essentially a file showing all your differences from the reference sequence. For an explanation of the file format see the paper by Danecek et al (2011).⁵A sample VCF file can be found in the 1000 Genomes Wiki:

www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

The BED file is a text file which will provide a bunch of ranges for all the areas where information is available for which it was possible to make confident calls. This file will cover all the positions that passed quality control. A useful guide to BED files can be found here:

http://genome.ucsc.edu/FAQ/FAQformat.html#format1

Information about the VCF and BED file formats will be added to the BIG Y Learning Center page in due course.

The raw data files in the form of BAM/FASTQ files will also be made available in due course but a decision needs to be made on the best way to provide the data. I imagine that the data will almost certainly be made available in the cloud, perhaps taking advantage of the new Google Genomics service, or another similar application.

Single SNP testing
The value of a DNA test is in the comparison process and the BIG Y test is no exception. It is hoped that large numbers of new SNPs will be discovered, many of which will be in a genealogical time frame. Ideally a paired testing strategy should be adopted with two very distantly related men from the same subclade taking the test. If novel SNPs are found which identify particular family groups then in theory it should be possible to order single SNPs. Single SNPs can be ordered either direct from Family Tree DNA or from Thomas Krahn’s new company YSEQ. The two companies offer a complementary range of SNPs. Single SNPs cost $35 each from YSEQ and $39 each from FTDNA. However, I suspect that if you are able to identify a SNP in the last two hundred years or so that is only likely to be shared by half a dozen men it will not be cost-effective for any company to offer a single SNP test. Much will also depend on the number of new SNPs identified in a given tree. It might well turn out to be more economical for a surname project to club together and pay for BIG Y tests for project members representing branches of the tree that are of particular interest.

There were some misleading reports emanating from the FTDNA group administrators' conference in Houston last November which suggested that FTDNA had an upper limit of 2000 on the number of new SNPs on offer. Dr Mittelman clarifed that there is no limit on the number of new SNPs that can be ordered. There is a limit on the number of SNPs that can be tested at one time on the lab deck and that limit is 2000. FTDNA can in theory calibrate for use as many SNPs as they can order and design but it’s a question of managing the time.

SNP validation
I asked whether it was necessary for SNPs identifed through next-generation sequencing to be validated using Sanger sequencing. Dr Mittelman advised that with high-confidence SNPs the data is very clean and validation is not necessary. Sanger sequencing might be needed for medium- and low-confidence calls where there are flags and not a lot of data. He also advised that next-generation sequencing is being used to validate the SNPs on the new Geno chip.

Poznik et al (2013) (supplementary data) did in fact validate their NGS SNPs using Sanger sequencing and found a concordance rate of 99.92% with just one discordant genotype.²

White paper
Dr Mittelman advised that once all the data has been through quality control FTDNA will then produce a white paper which will provide information on some of the technical details of the test. The paper will cover performance metrics, value proposition, etc, and they also hope to look at mutation rates, something which is of great interest to the genetic genealogy community and a subject of considerable debate and disagreement! The paper should be out in the next four to six weeks or so.

The new Y-tree
BIG Y data is currently being released using the now very out-of-date and somewhat irrelevant 2010 Y-tree. Bennett Greenspan, the Chief Executive Officer of Family Tree DNA, advised in the webinar that they have had teams of people working on the new tree in collaboration with the Genographic Project. The new tree will be fully integrated with Geno 2.0. The tree needs to be ready from both the technical point of view and the graphical interface, and it seems that it is the latter which is proving more problematic. The tree is not dependent on the release of a scientific paper. Bennett advised that it might be ready in the “next several weeks”. When the new tree is finally launched, SNPs from the BIG Y will be automatically mapped on the new tree.

Third-party tools
FTDNA want to encourage people to use third party tools to get more out of their results and to come up with new ways to analyse the data. I have previously written about YFULL, a Russian company which provides a very nice Y-chromosome interpretation service. See my review from November 2013. The service is currently free if you agree to let them have your sequence, but it is expected that they will charge a fee at some point. The Full Genomes Corporation have also indicated that they might be able to analyse BIG Y data though no announcement has yet been made. With the increasing availability of Y-chromosome sequencing data no doubt other tools and analytical services will appear in the future.

Additional questions
After the webinar had finished I realised that there were still some questions that I hadn't asked and David Mittelman kindly provided me with some answers by e-mail.

Q: Are there any plans to provide results for Y-STRs?
A: Big Y does span STRs but that was not the intent of the product. So you can go to the VCF files or the raw data and you will see insertions and deletions at STRs, however, we do not plan to add this to the web page. I would much rather recommend our established and proven STR tests.

Q: Does the BIG Y raw data also include the full mtDNA genome?
A: No, it is comprehensive sequencing of the accessible parts of the Y chromosome. We, as you know, offer full mitochondrial sequencing as a separate product.

Q: Will a list of positive SNP results be posted on the Project SNP pages?
A: Yes, if they are on the tree

Preliminary analysis of BIG Y results
The initial results from the first batch of BIG Y tests were producing an unexpectedly high number of novel variants. Vince Tilroe has analysed some of these results and reports as follows on the U106 mailing list:

It looks like many of the novel variants shared by many Big-Y testees may belong to a particular subclade below R-L20, the haplogroup to which the primary source of the anonymous male donors belongs to, whose sequences were used to build the ChrY reference assembly, and many of those may even be exclusively private to him. Greg Magoon had filtered them out from the 1KGP and FGC reports, but YFull had assigned "Y" identifiers to some of them.

I've compared novel variants from six Big-Y returns belonging to haplogroup R-L51 and below, and have so far identified 56 "novel variants" shared between at least two of them so far, but individual samples only had between 43 and 48 of those. This pretty much cuts the typical true novel variant count in half, leaving a count that is more in line to what was expected for this process.

Charles Moore, the U106 admin, has since received confirmation from another group that many of the novel variants are ancestral shared novel SNPs.

Other SNP tests
Full Genomes Corporation is the only other company which currently offers comprehensive Y-chromosome sequencing. Their test is substantially more expensive than the BIG Y but sequences more of the Y-chromosome. When the BIG Y raw data files become available it will be possible to do a comparison of the two tests. For comparisons of the available SNP tests, including the Geno 2.0 and Chromo 2 chip tests, see the SNP testing comparison chart in the ISOGG Wiki.

What are we going to do with all these SNPs?
I wrote in a previous blog post about the confusion of SNPs generated by the various SNP tests offered by the different testing companies. We now have a situation where four companies/organisations (Family Tree DNA/Genographic Project, Full Genomes, BritainsDNA/ScotlandsDNA and YFull) are maintaining their own proprietary SNP databases. There is a great need for an open access independent database of validated SNPs. ISOGG – the International Society of Genetic Genealogy – are probably in the best position to produce such a database, but they also have responsibility for maintaining the Y-SNP tree. The sheer amount of data generated from the next-generation sequencing tests will represent a significant challenge for the volunteer Y-SNP team. I do wonder if the present tree system is actually sustainable and, if in the long run, it might be better to report results as differences from the reference sequence, as is the practice for mitochondrial DNA. Whatever happens, we will have an interesting year ahead of us.

Are you interested in ordering the BIG Y or another SNP test?
My advice for anyone thinking of ordering SNP testing is to be patient and wait for a few months until all the results from the first batches of BIG Y and Full Genomes tests have been analysed and compared. Once this process has been completed we will have a better picture of the new Y-chromosome landscape and the shape of the tree, and it will then be possible to make an informed choice as to which test to purchase. Dr Mittelman advised that there are no immediate plans for another BIG Y sale. At the moment the priority is to bring down the turnaround time for new orders which is currently 8 to 10 weeks.

If you are interested in being involved make sure you join the relevant haplogroup mailing lists and Facebook groups. If you've tested at Family Tree DNA make sure you join the appropriate haplogroup or subclade project. The mailing lists and groups are usually linked from the haplogroup project websites. There is also a list of mailing lists and Facebook groups in the ISOGG Wiki:

www.isogg.org/wiki/Genetic_genealogy_mailing_lists

Further information
There is a set of BIG Y FAQs in the FTDNA Learning Center:

www.familytreedna.com/learn/y-dna-testing/big-y

The BIG Y page in the Learning Center provides screenshots and descriptions of the user interface:

www.familytreedna.com/learn/user-guide/other-test-results/big-y-page

Elise Friedman presented a webinar on 28th February on the subject of "Getting to know BIG Y Results". A recording of the webinar should eventually be made available in the webinar archive in the Learning Center:

www.familytreedna.com/learn/ftdna/webinars

Update 2nd March 2014
The recording of the BIG Y webinar is now available online and can be accessed via this link (free registration required):

https://attendee.gotowebinar.com/recording/4739415541486853122

Update 3rd March 2014
I have put the full text of the letter from Nir Leibovich, in which he apologises for the lack of communication about the expected date of release of BIG Y results, online here. Despite expectations to the contrary, it was never FTDNA's intention to deliver all the results on 28th February. That was the date when the results were expected to start rolling out. It also transpires that there is currently no way for FTDNA to change the expected date on customers' personal pages until the expected date has actually passed.

I've received a number of comments about the problem with reagents which contributed to the delay. Dr David Mittelman has contacted me to clarify the issue:

"We sequence the Y using Illumina HiSeq equipment and we ran out of reagents to do this, and for a period in December and January, Illumina had a back order in place so we could not order more. Illumina filled the orders in the second half of January and we continued our work. Back orders happen and since Illumina is the only game in town, we don’t have other vendors to go to, when Illumina runs out. Of course we are now rolling out samples continuously and each week, in batches. Just like we do for all our products and just like Full Genomes and other companies do."

He adds

"In the meantime as more batches complete I am confident people will be thrilled with the data. We were able to deliver better specs than I originally promised and... we will not ship subpar results to anyone. Everyone will get great data."

Update 4th March 2014
Dr Jim Wilson of BritainsDNA/ScotlandsDNA has now released a spreadsheet with details of the genome reference co-ordinates for all the Y-SNPs on the Chromo 2 chip. See the following blog post from CeCe Moore for further details and to download the spreadsheet:

- Dr. Jim Wilson and ScotlandsDNA Release Y-SNP Positions for Chromo2

Thomas Krahn has now uploaded the 8000 or so novel markers to Ybrowse. This will allow the genetic genealogy community to cross-check all the new tree branches discovered by Jim Wilson earlier this year. Thomas Krahn has advised that his company YSEQ can design primers for some of the new SNPs as required.

Update 1st April 2014
Although the BIG Y .vcf and .bed files do not include mitochondrial DNA data, it now transpires that mtDNA is included in the BAM files. The mtDNA data can be extracted using third-party tools. For further details see the following blog post from Roberta Estes:

http://dna-explained.com/2014/04/01/mitochondrial-dna-results-from-the-big-y-test

See also Felix Chandrakumar's blog post on the YFull interpretation service which includes a report on the mtDNA data extracted from his BIG Y BAM test:

http://www.fc.id.au/2014/03/yfull-y-chr-sequence-interpretation.html

Update 29th August 2014
Family Tree DNA have published a white paper outlining the methodology used for the test and the analysis.

Footnotes and references
1. For links and resources on next-generation sequencing see the ISOGG Wiki page: www.isogg.org/wiki/Next_generation_sequencing

2. A good description of the Y-chromosome reference sequence is provided by Poznik et al (2013) Sequencing Y chromosomes resolves discrepancy in time to common ancestor of males versus females. Science 2013 341; 6145: 562-565:

The Y-chromosome reference sequence is 59.36 Mb, but this includes a 30-Mb stretch of constitutive heterochromatin on the q arm, a 3-Mb centromere, 2.65-Mb and 330-kb telomeric pseudoautosomal regions (PAR) that recombine with the X chromosome, and eight smaller gaps.

This effectively leaves around 22.98 Mb of “assembled reference sequence”. If you can get hold of the Poznik paper it contains a very nice figure (Figure 1. Callability mask for the Y-chromosome) showing the regions of the Y-chromosome in which reliable genotype calls can be made.

On a side note, this paper has come in for a lot of criticism, not the least of which is for the authors' mistaken assumption that mitochondrial Eve and Y-chromosomal Adam should be expected to date back to the same time. For a critique of this paper and some useful related diagrams see the three-part series of articles by Melissa Wilson Ayres: Y and mtDNA are not Adam and Eve: Part 1; Y and mtDNA are not Adam and Eve: Part 2 - What it means to be the Most Recent Common Ancestor and Y and mtDNA are not Adam and Eve: Part 3 - Resolving a discrepancy.

3. Further papers of interest are listed on the Y-chromosome page in the ISOGG Wiki: http://www.isogg.org/wiki/Y_chromosome

4. The term "slow-motion tsunami" was coined by Charles Moore, the administrator of the R1b-U106 project: https://groups.yahoo.com/neo/groups/R1b1c_U106-S21/conversations/messages/21323

5. Danecek P, Auton A, Abecasis G et al (2011). The variant call format and VCFtools. Bioinformatics 27 (15): 2156-2158.

© 2014 Debbie Kennett

Friday, 8 November 2013

A simplified Y-tree and a common standard for Y-DNA haplogroup and SNP nomenclature

This article is for experienced genetic genealogists and requires an understanding of SNPs and haplogroups.

A very useful online resource for Y-chromosome researchers in the form of a simplified version of the Y-chromosome SNP tree has come online this week. The new pared-down version of the Y-tree is introduced in a paper by Mannis Van Oven, Anneleen Van Geystelen, Manfred Kayser, Ronny Decorte and Maarten H D Larmuseau entitled Seeing the Wood for the Trees: A Minimal Reference Phylogeny for the Human Y Chromosome. The paper has been accepted for publication in the scientific journal Human Mutation but has yet to go through the full editorial process. Mannis Van Oven's name is already well known to mitochondrial DNA researchers because he maintains the Phylotree website which hosts the definitive mtDNA tree. The simplified Y-tree is conveniently being maintained on the same website and can be found at www.phylotree.org/Y

The new Phylotree version of the Y-tree will serve as a complement to the full Y-SNP tree which is maintained by ISOGG (the International Society of Genetic Genealogy). The Y-tree is now a very complicated structure and is set to become even more detailed in the coming months with the flood of new Y-SNPs that are being discovered from academic projects and through commercial testing with Full Genomes Corp, the Genographic Project (Geno 2.0) and BritainsDNA/ScotlandsDNA (Chromo 2). There will always be a need to have the fine detail of the full high-resolution tree, especially when one is trying to drill right down to the low-hanging branches. However, sometimes it's useful to get an overview of the structure of the tree as a whole without the complication of all the addition sub-branches, twigs and twiglets, and this is something that the new Phylotree Y-tree does very nicely.

I'm very pleased to see that the paper acknowledges the contributions made by the many "independent researchers" within the genetic genealogy community. The resources that the authors used to compile their reference phylogeny included "a large number of websites maintained by independent researchers", all of whom are named in the acknowledgements.

An important innovation in this paper is a very welcome attempt to introduce a much-needed common standard for Y-SNP and Y-haplogroup nomenclature. As the authors explain "Due to multiple independent discovery events, a considerable number of Y-SNPs are known by multiple names". This diversity of names is a source of considerable confusion for both academic researchers and genetic genealogists. For example, haplogroup R1b1a2, the predominant European haplogroup, has two major branches. The markers that define these branches are known as P312 and U106 at the Genographic Project and Family Tree DNA but have the alternative names S116 and S21 at BritainsDNA/ScotlandsDNA. All four of these marker names appear in the scientific literature but the scientists often don't provide the alternative names. ISOGG provides a Y-SNP index which allows the researcher to check for other SNP names but not every researcher will know of this resource. The solution proposed by Van Oven et al is to decide on "one default name depending on which of the aliases is most frequently used in the literature", and these are the names which appear in the Phylotree Y-tree, though the alternative names are given in the accompanying spreadsheet.

It does of course remain to be seen if the scientists and testing companies will adopt the recommended nomenclature for the 417 SNPs included on the simplified Y-tree, but we can certainly hope that they will do so. Most of the names are already in use at Family Tree DNA and within the various FTDNA haplogroup projects. The one SNP name on the tree which will probably cause the most difficulties is R-M529, which is currently better known as L21 and sometimes S145. The name M529 seems to have been chosen because it was cited in an academic paper published in 2011 by Myres et al.¹ However, the name L21 is now so ingrained in the collective genetic genealogy consciousness that I suspect that the proposed new name will probably not catch on. BritainsDNA have always used their own proprietary S series naming system but I hope that they will at least consider adopting the new nomenclature for the core SNPs included on the Phylotree Y-tree so that we can all speak a common language.

In the coming months we can expect an explosion of new Y-SNPs now that the first results have started to come in for the Chromo 2 test from BritainsDNA/Scotlands DNA and from the full Y-chromosome sequencing tests at Full Genomes. However, the nomenclature will continue to be a big problem as each company tries to maintain a competitive advantage. Full Genomes have already indicated that they will be offering custom single SNPs for sale to compete with FTDNA. We can probably expect to see a flood of FG SNPs being made available in the next few months. The positions of the new FG SNPs on the tree are not yet known so no other companies will be able to offer these new SNPs. So far I've only seen one data file from the BritainsDNA Chromo 2 test. This file contains over 14,000 Y-SNPs, of which around 8000 or more are proprietary S series SNPs, only a tiny percentage of which are listed in the ISOGG Y-SNP index. It may be that many of the BritainsDNA SNPs will turn out to be equivalent to the SNPs that are already on the ISOGG tree or included on the Geno 2.0 chip, and these SNPs will almost certainly be included in the Full Genomes test. However, neither BritainsDNA nor the Genographic Project provide the genome reference positions for the SNPs on their chips so there is currently no way of knowing which S series SNPs are already known about and which ones are new. Fortunately there are many pioneers with large pockets in the genetic genealogy community who can afford to have their DNA tested at Full Genomes, BritainsDNA and the Genographic Project. With data available for comparison from two or more companies it should then be possible for the volunteer haplogroup project administrators to compare the results and establish the positions of any newly discovered SNPs on the Y-tree.

The other unknown is whether or not Family Tree DNA will be responding to the competition from Full Genomes and BritainsDNA. Their group administrators' conference is taking place this weekend in Houston, Texas, and the conference schedule has now been made available online. Miguel Vilar from the Genographic Project will be providing a Geno 2.0 update and talking about the Y-2014 tree, and Michael Hammer will be talking about the "implications of the 2014 Y-tree". FTDNA usually make a big announcement at the conference and the speculation is that they will perhaps be announcing the launch of a new Geno chip and/or the introduction of a full Y-chromosome test. Spencer Wells has already indicated that a new Geno chip might be on the way as early as 2014.²

Unfortunately, all three currently available Y-SNP tests are very expensive and well beyond the means of the average genetic genealogist. I'm rather hoping that at some point one of the companies will introduce a cheaper Y-SNP test that will allow a customer to have a refined haplogroup designation sufficient to rule out false positive matches but without breaking the bank.

For the moment I would advise anyone considering ordering a Y-SNP test to wait and see what the results are from the tests taken by the early adopters. If you want to join the pioneers and experiment with one of the new SNP tests then you can see a chart comparing the services offered by the main testing companies in the ISOGG Wiki.

With so many exciting new developments I wonder what the Y-chromosome tree will look like in 2014. The ISOGG SNP Index lists all the SNPs that are either on the Y-tree or which are under investigation, but these SNPs represent less than 10% of the known Y-SNPs. David Reynolds maintains the ISOGG Y-SNP Compendium Spreadsheet which currently contains almost 40,000 additional Y-SNPs, and has indicated that he still has over 12,000 SNPs to add, time permitting. The SNPs in this spreadsheet have not all been validated and many are not available for testing at any commercial company. It may well be that the tree will increase in size ten-fold or more in the next twelve months which will represent a significant challenge for the volunteer ISOGG Y-SNP team who maintain the tree in their own free time.

Chris Tyler-Smith cautioned us in February at a special ISOGG presentation at the Sanger Institute in Cambridge that the Y-tree nomenclature system was set to break down in 2013, and indeed that already seems to be the case. He raised the possibility of using an ancestral reference sequence for the Y-chromosome along the lines of the RSRS (Reconstructed Sapiens Reference Sequence) introduced for mitochondrial DNA in 2012.³ I wonder if that is something that we will see implemented in 2014.

Whatever the future has in store it is certainly a very exciting time for Y-chromosome researchers and, as Chris Tyler-Smith commented in February, there will be "more opportunities than ever for computer-literate citizen scientists".

References
1. Myres NM, Rootsi S, Lin AA et al. A major Y-chromosome haplogroup R1b Holocene era founder effect in Central and Western Europe. European Journal of Human Genetics 2011; 19 (1); 95-101.
2. Petrone J. National Geographic considering move to new SNP chip for Genographic Project. GenomeWeb, 13 August 2013.
3. Behar DM, Van Oven M, Rosset S et al. A "Copernican" reassessment of the human mitochondrial DNA tree from its root. American Journal of Human Genetics 2012; 90 (5): 936.

Resources
- The ISOGG Y-DNA SNP testing comparison chart
- A list of Y-DNA haplogroup projects
- BritainsDNA haplogroup nicknames

See also
- A confusion of SNPs

© 2013 Debbie Kennett

Pages