The genetic genealogy community has been eagerly anticipating the arrival of the so-called SNP tsunami for several months and it now seems that the first waves are starting to appear on the horizon. I was one of a select few genetic genealogists and bloggers who was invited to participate late on Thursday afternoon (UK time) in a private webinar led by Dr David Mittelman, Family Tree DNA’s Chief Scientific Officer, in preparation for the rollout of the first results from FTDNA’s next-generation sequencing
BIG Y test.
1 During the webinar we were given a sneak preview of some sample results from the test and we had the opportunity to ask lots of questions. I don't know what it says about me and my enthusiasm for Y-SNP testing but I seemed to be the one asking most of the questions! I am very excited about the implications of comprehensive Y-chromosome sequencing. These tests will not only allow us to define the exact branching within each haplogroup but will also reach right down into genealogical time and will eventually make it possible to delineate recent branches of the Y-line and identify the common ancestor almost down to the exact generation.
Background
There are almost 60 million base pairs in the Y-chromosome but about half of it is full of repeating complexities which have yet to be deciphered. There are only around 20 million or so bases which are good candidates for sequencing.
2, 3 The BIG Y test was designed to provide the most information at the most affordable price. The intention is also to provide information in the most clear and easy-to-use way.
There seems to have been some confusion about how much of the Y-chromosome is sequenced for the BIG Y test so I asked Dr Mittelman for clarification. He advised that the test sequences around 13.5 million bases on the Y-chromosome and provides results for between 11.5 and 12.5 million positions. It is not possible to give a precise figure because NGS results vary from person to person. This is an improvement on the spec that was advertised when the pre-sale was announced in November when a figure of 10 million bases was quoted.
When the BIG Y pre-sale was announced the coverage was advertised as 60x (the number refers to the number of times the Illumina machines read the sequence – the more reads the better). The information on the BIG Y FAQ page has since been updated and the coverage is now being advertised as
“55x to 80x average coverage”.
The roll out
The BIG Y tests have been processed in the order in which they have been received, but some people had to supply new DNA samples so their tests will take longer. The first 100 results were released on Thursday 27th February, and there will be a gradual roll out of results running through to the end of March. We had been expecting all the BIG Y results to be released on the same day but it now appears that the anticipated tsunami will be more of a steady trickle of waves
– a slow-motion tsunami
4 – rather than one giant flood of data. The following message is now being displayed on the personal pages of people who are awaiting their results:
"We expect that all samples ordered during the initial sale (last November & December) will be delivered by March 28th. We are processing samples in first come first serve order. If a sample doesn't pass quality control, we will place it in the next set of results to be processed as long as we have enough DNA sample. If we require an additional sample, we will send a new test kit and place the new sample in the first set to be processed when it is returned."
My dad is one of the people waiting his results but I did not place the order until the very end of the pre-sale period so his results will probably be amongst the last to be processed. Along with other people who have ordered the BIG Y test I received an e-mail this morning from Nir Leibovich, FTDNA's Chief Business Officer, apologising for the delay. He advised: "The entire FTDNA team has been working very hard over the last few months with high determination and many late nights. Launching a new product is always a challenge with many moving parts, some more predictable than others. Unfortunately we ran into some surprises beyond our control when one of our suppliers ran out of certain reagents we needed for running the Big Y product... We hope you will let the wonderful product we produced make up for delays that were needed to refine it! We have updated expected results dates on customer pages and will work around the clock to beat them." [
Click here to read the full text of the e-mail.]
How many BIG Y tests have been ordered?
I asked if we could be given an idea of the number of BIG Y tests ordered. Although a precise figure was not revealed we were told that there had been "thousands" of orders and that "FTDNA have more Y than anyone else". I know that large numbers of orders have gone through some of the haplogroup projects. There have been 149 orders in the
R1b-U016 Project alone and around 340 orders in the
R1b-L21 Project. If you have ordered the BIG Y test do make sure you join the relevant haplogroup project so that the very helpful and knowledgeable volunteer admins can help you to understand your results. There is a list of Y-DNA haplogroup projects in the ISOGG Wiki:
www.isogg.org/wiki/Y-DNA_haplogroup_projects.
What is reported
Screenshots of the user interface and explanations of the various features can be seen on the BIG Y page in the FTDNA Learning Center:
www.familytreedna.com/learn/user-guide/other-test-results/big-y-page
FTDNA have a big internal SNP database with details of 36,562 known SNPs. Customers will be given a list of their results for all the SNPs in the database. They will be told whether they are ancestral or derived for each position, whether or not the SNP is on the tree, the genome reference co-ordinates, their genotype (their DNA letters) and the confidence rating.
There are three confidence levels for the SNP calls. High confidence means that all the reads essentially agree. Medium confidence means that the information looks good but it has to be manually curated. Low confidence indicates noisy data.
NGS coverage varies from person to person but it is expected that results will be provided for between 25,000 to 35,000 known SNPs per person. The amount of overlap with the tests from Full Genomes, Geno 2.0 and Chromo 2 is not yet known, but it is expected that the BIG Y will cover 90% of the SNPs in the Geno 2 and Chromo 2 tests. There are a handful of people in the genetic genealogy community who have tested with all four companies. Some people have also taken the Walk Through the Y test, the previous SNP discovery test from FTDNA which utilised Sanger sequencing. Once the BIG Y results have all been released and compared with the other tests the haplogroup project admins will be able to provide better information on the overlap between all the tests.
Customers will also be given a separate list of novel variants. These are defined as variants which differ from the reference sequence and which are not seen in the FTDNA SNP database. Thankfully the genome reference co-ordinates will be provided which will allow comparisons with SNPs identified in tests from other providers (with the exception of BritainsDNA who have not released the co-ordinates for their new S series SNPs [see my update from 4th March below]). Dr Mittelman does not yet know how many novel SNPs to expect per person. There is currently no function to compare novel variants in the database, but the test is very much a work in progress and he is open to suggestions for new ideas.
Information will not as yet be provided on INDELS (insertions and deletions), but experienced users will be able to extract the information from the raw data.
File formats
Two types of files will be provided: a VCF file and a BED file. These files are not currently available but should be ready for download some time next week.
The
VCF (variant call format) file will consist of a list of all the variants identified, tagged by confidence and location. This is essentially a file showing all your differences from the reference sequence. For an explanation of the file format see the paper by Danecek
et al (2011).
5 A sample VCF file can be found in the 1000 Genomes Wiki:
www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
The
BED file is a text file which will provide a bunch of ranges for all the areas where information is available for which it was possible to make confident calls. This file will cover all the positions that passed quality control. A useful guide to BED files can be found here:
http://genome.ucsc.edu/FAQ/FAQformat.html#format1
Information about the VCF and BED file formats will be added to the BIG Y Learning Center page in due course.
The raw data files in the form of BAM/FASTQ files will also be made available in due course but a decision needs to be made on the best way to provide the data. I imagine that the data will almost certainly be made available in the cloud, perhaps taking advantage of the new
Google Genomics service, or another similar application.
Single SNP testing
The value of a DNA test is in the comparison process and the BIG Y test is no exception. It is hoped that large numbers of new SNPs will be discovered, many of which will be in a genealogical time frame. Ideally a paired testing strategy should be adopted with two very distantly related men from the same subclade taking the test. If novel SNPs are found which identify particular family groups then in theory it should be possible to order single SNPs. Single SNPs can be ordered either direct from Family Tree DNA or from
Thomas Krahn’s new company YSEQ. The two companies offer a complementary range of SNPs. Single SNPs cost $35 each from YSEQ and $39 each from FTDNA. However, I suspect that if you are able to identify a SNP in the last two hundred years or so that is only likely to be shared by half a dozen men it will not be cost-effective for any company to offer a single SNP test. Much will also depend on the number of new SNPs identified in a given tree. It might well turn out to be more economical for a surname project to club together and pay for BIG Y tests for project members representing branches of the tree that are of particular interest.
There were some misleading reports emanating from the
FTDNA group administrators' conference in Houston last November which suggested that FTDNA had an upper limit of 2000 on the number of new SNPs on offer. Dr Mittelman clarifed that there is no limit on the number of new SNPs that can be ordered. There is a limit on the number of SNPs that can be tested at one time on the lab deck and that limit is 2000. FTDNA can in theory calibrate for use as many SNPs as they can order and design but it’s a question of managing the time.
SNP validation
I asked whether it was necessary for SNPs identifed through next-generation sequencing to be validated using Sanger sequencing. Dr Mittelman advised that with high-confidence SNPs the data is very clean and validation is not necessary. Sanger sequencing might be needed for medium- and low-confidence calls where there are flags and not a lot of data. He also advised that next-generation sequencing is being used to validate the SNPs on the new Geno chip.
Poznik
et al (2013) (supplementary data) did in fact validate their NGS SNPs using Sanger sequencing and found a concordance rate of 99.92% with just one discordant genotype.
2
White paper
Dr Mittelman advised that once all the data has been through quality control FTDNA will then produce a white paper which will provide information on some of the technical details of the test. The paper will cover performance metrics, value proposition, etc, and they also hope to look at mutation rates, something which is of great interest to the genetic genealogy community and a subject of considerable debate and disagreement! The paper should be out in the next four to six weeks or so.
The new Y-tree
BIG Y data is currently being released using the now very out-of-date and somewhat irrelevant 2010 Y-tree. Bennett Greenspan, the Chief Executive Officer of Family Tree DNA, advised in the webinar that they have had teams of people working on the new tree in collaboration with the Genographic Project. The new tree will be fully integrated with Geno 2.0. The tree needs to be ready from both the technical point of view and the graphical interface, and it seems that it is the latter which is proving more problematic. The tree is not dependent on the release of a scientific paper. Bennett advised that it might be ready in the “next several weeks”. When the new tree is finally launched, SNPs from the BIG Y will be automatically mapped on the new tree.
Third-party tools
FTDNA want to encourage people to use third party tools to get more out of their results and to come up with new ways to analyse the data. I have previously written about YFULL, a Russian company which provides a very nice Y-chromosome interpretation service. See
my review from November 2013. The service is currently free if you agree to let them have your sequence, but it is expected that they will charge a fee at some point. The Full Genomes Corporation have also indicated that they might be able to analyse BIG Y data though no announcement has yet been made. With the increasing availability of Y-chromosome sequencing data no doubt other tools and analytical services will appear in the future.
Additional questions
After the webinar had finished I realised that there were still some questions that I hadn't asked and David Mittelman kindly provided me with some answers by e-mail.
Q: Are there any plans to provide results
for Y-STRs?
A: Big Y does span STRs but that was not the intent of the product. So you can go to the VCF files or the raw data and you will see insertions and deletions at STRs, however, we do not plan to add this to the web page. I would much rather recommend our established and proven STR tests.
Q: Does the BIG Y raw data also include the full mtDNA genome?
A: No, it is comprehensive sequencing of the accessible parts of the Y chromosome. We, as you know, offer full mitochondrial sequencing as a separate product.
Q: Will a list of positive SNP results be posted on the Project SNP pages?
A: Yes, if they are on the tree
Preliminary analysis of BIG Y results
The initial results from the first batch of BIG Y tests were producing an unexpectedly high number of novel variants. Vince Tilroe has analysed some of these results and reports as follows on the
U106 mailing list:
It looks like many of the novel variants shared by many Big-Y testees may belong to a particular subclade below R-L20, the haplogroup to which the primary source of the anonymous male donors belongs to, whose sequences were used to build the ChrY reference assembly, and many of those may even be exclusively private to him. Greg Magoon had filtered them out from the 1KGP and FGC reports, but YFull had assigned "Y" identifiers to some of them.
I've compared novel variants from six Big-Y returns belonging to haplogroup R-L51 and below, and have so far identified 56 "novel variants" shared between at least two of them so far, but individual samples only had between 43 and 48 of those. This pretty much cuts the typical true novel variant count in half, leaving a count that is more in line to what was expected for this process.
Charles Moore, the U106 admin, has since received confirmation from another group that many of the novel variants are ancestral shared novel SNPs.
Other SNP tests
Full Genomes Corporation is the only other company which currently offers comprehensive Y-chromosome sequencing. Their test is substantially more expensive than the BIG Y but sequences more of the Y-chromosome. When the BIG Y raw data files become available it will be possible to do a comparison of the two tests. For comparisons of the available SNP tests, including the Geno 2.0 and Chromo 2 chip tests, see the
SNP testing comparison chart in the ISOGG Wiki.
What are we going to do with all these SNPs?
I wrote in a previous blog post about the
confusion of SNPs generated by the various SNP tests offered by the different testing companies. We now have a situation where four companies/organisations (Family Tree DNA/Genographic Project, Full Genomes, BritainsDNA/ScotlandsDNA and YFull) are maintaining their own proprietary SNP databases. There is a great need for an open access independent database of validated SNPs. ISOGG – the International Society of Genetic Genealogy – are probably in the best position to produce such a database, but they also have responsibility for maintaining the Y-SNP tree. The sheer amount of data generated from the next-generation sequencing tests will represent a significant challenge for the volunteer Y-SNP team. I do wonder if the present tree system is actually sustainable and, if in the long run, it might be better to report results as differences from the reference sequence, as is the practice for mitochondrial DNA. Whatever happens, we will have an interesting year ahead of us.
Are you interested in ordering the BIG Y or another SNP test?
My advice for anyone thinking of ordering SNP testing is to be patient and wait for a few months until all the results from the first batches of BIG Y and Full Genomes tests have been analysed and compared. Once this process has been completed we will have a better picture of the new Y-chromosome landscape and the shape of the tree, and it will then be possible to make an informed choice as to which test to purchase. Dr Mittelman advised that there are no immediate plans for another BIG Y sale. At the moment the priority is to bring down the turnaround time for new orders which is currently 8 to 10 weeks.
If you are interested in being involved make sure you join the relevant haplogroup mailing lists and Facebook groups. If you've tested at Family Tree DNA make sure you join the appropriate
haplogroup or subclade project. The mailing lists and groups are usually linked from the haplogroup project websites. There is also a list of mailing lists and Facebook groups in the ISOGG Wiki:
www.isogg.org/wiki/Genetic_genealogy_mailing_lists
Further information
There is a set of BIG Y FAQs in the FTDNA Learning Center:
www.familytreedna.com/learn/y-dna-testing/big-y
The BIG Y page in the Learning Center provides screenshots and descriptions of the user interface:
www.familytreedna.com/learn/user-guide/other-test-results/big-y-page
Elise Friedman presented a webinar on 28th February on the subject of "Getting to know BIG Y Results". A recording of the webinar should eventually be made available in the webinar archive in the Learning Center:
www.familytreedna.com/learn/ftdna/webinars
Update 2nd March 2014
The recording of the BIG Y webinar is now available online and can be accessed via this link (free registration required):
https://attendee.gotowebinar.com/recording/4739415541486853122
Update 3rd March 2014
I have put the full text of the letter from Nir Leibovich, in which he apologises for the lack of communication about the expected date of release of BIG Y results, online
here. Despite expectations to the contrary, it was never FTDNA's intention to deliver all the results on 28th February. That was the date when the results were expected to start rolling out. It also transpires that there is currently no way for FTDNA to change the expected date on customers' personal pages until the expected date has actually passed.
I've received a number of comments about the problem with reagents which contributed to the delay. Dr David Mittelman has contacted me to clarify the issue:
"We sequence the Y using
Illumina HiSeq equipment and we ran out of reagents to do this, and for a period
in December and January, Illumina had a back order in place so we could not
order more. Illumina filled the orders in the second half of January and we
continued our work. Back orders happen and since Illumina is the only game in
town, we don’t have other vendors to go to, when Illumina runs out. Of course we
are now rolling out samples continuously and each week, in batches. Just like we
do for all our products and just like Full Genomes and other companies do."
He adds
"In the meantime as more batches complete I am confident people will be thrilled
with the data. We were able to deliver better specs than I originally promised
and... we will not ship subpar results to
anyone. Everyone will get great data."
Update 4th March 2014
Dr Jim Wilson of BritainsDNA/ScotlandsDNA has now released a spreadsheet with details of the genome reference co-ordinates for all the Y-SNPs on the Chromo 2 chip. See the following blog post from CeCe Moore for further details and to download the spreadsheet:
-
Dr. Jim Wilson and ScotlandsDNA Release Y-SNP Positions for Chromo2
Thomas Krahn has now uploaded the 8000 or so novel markers to
Ybrowse. This will allow the genetic genealogy community to cross-check all the new tree branches discovered by Jim Wilson earlier this
year. Thomas Krahn has advised that his company
YSEQ can design primers for some of the new SNPs as required.
Update 1st April 2014
Although the BIG Y .vcf and .bed files do not include mitochondrial DNA data, it now transpires that mtDNA is included in the BAM files. The mtDNA data can be extracted using third-party tools. For further details see the following blog post from Roberta Estes:
http://dna-explained.com/2014/04/01/mitochondrial-dna-results-from-the-big-y-test
See also Felix Chandrakumar's blog post on the YFull interpretation service which includes a report on the mtDNA data extracted from his BIG Y BAM test:
http://www.fc.id.au/2014/03/yfull-y-chr-sequence-interpretation.html
Update 29th August 2014
Family Tree DNA have published a
white paper outlining the methodology used for the test and the analysis.
Footnotes and references
1. For links and resources on next-generation sequencing see the ISOGG Wiki page:
www.isogg.org/wiki/Next_generation_sequencing
The Y-chromosome reference sequence is 59.36 Mb, but this
includes a 30-Mb stretch of constitutive heterochromatin on the q arm, a 3-Mb
centromere, 2.65-Mb and 330-kb telomeric pseudoautosomal regions (PAR) that
recombine with the X chromosome, and eight smaller gaps.
This effectively leaves
around 22.98 Mb of “assembled reference sequence”. If you can get hold of the
Poznik paper it contains a very nice figure (
Figure 1. Callability mask for the Y-chromosome) showing the regions of the Y-chromosome in which reliable genotype
calls can be made.