Saturday, 3 May 2014

Driving in the wrong direction with a dodgy DNA satnav

I've been receiving a lot of questions in the last couple of days about the new DNA "satnav" tool called GPS (Geographic Population Structure) which purports to pinpoint the village that your ancestors lived in one thousand years ago. See, for example, the articles in the Daily Mail and the Washington Post. There was also some prominent and uncritical coverage on BBC Breakfast News on Thursday featuring a segment in which the BBC weather presenter Carol Kirkwood was given the results of her DNA test on air and told that her ancestors were from the town of Crieff in Scotland. As Chris Jiggins has pointed out on Twitter the acronym GPS seems to have been chosen deliberately to "promote a completely false sense of accuracy".  

The company which is offering this service is a new start up by the name of Prosapia Genetics, which has been set up by Tatiana Tatarinova from the Children's Hospital Los Angeles. The company proudly proclaim on their website: "Our first tool, GPS, will tell you where your DNA was forged, and is accurate to home village with a time resolution of the past 1,000 years."

The reports are based on an analysis of autosomal SNPs. You can either order a test through Prosapia Genetics, who appear to have an affiliate relationship with Family Tree DNA, or you can submit your raw data file from a test you've already taken with one of the companies that offers autosomal DNA testing - AncestryDNA (US only),  23andMe, Family Tree DNA, Geno 2.0 or BritainsDNA/ScotlandsDNA. A range of reports is offered with prices varying depending on the number of reference populations used for the analysis. The reports simply give you a set of geographical co-ordinates, which are supposed to represent the "ancient home" of all of your ancestors, and a map showing where your ancestors lived. We are now getting feedback from a number of people who've paid for this service and it would appear, not surprisingly, that the reality does not match the hype.

Julie Matthews bought the Basic Test, which covers 100 reference populations. She commented in the Facebook R1b-L21 group:
I spent $29 to discover that my "homeland" was in the middle of the River Humber in England. I knew we all descended from fish - here's proof. Don't waste your money!
Teresa Vega paid for the Super Test, which includes 500 reference populations. She writes in the ISOGG Facebook group:
Totally unconvincing. Stupid me paid $42.99 for nada! My ancestral home is smack dab west of Puerto Rico in the Atlantic Ocean! I learned nothing and it told me to upgrade to another test for more detailed results -- a test they don't even have listed! Don't believe the hype!!!!
Teresa's report can be seen online here.

JoAnn O'Linger had a similarly misleading result. She reports in the ISOGG Facebook group:
I had a similarly disappointing result from Prosapia (paid for), it was the "Super Test" as well: 
" JoAnn ordered a Super GPS Test of her DNA data. We found the following GPS Co-ordinates : Latitude 56.7811288256845 and Longitude 4.26921663910535 
A map pointing the location is given below with a short guide on how to interpret this results.
How to interpret your results? 
GPS coordinates indicate the place where your DNA was forged before your family may have moved to your current location. Because borders changed throughout history, your ancestors may have been part of an ancient country once ruled the region. If your GPS coordinates are in the water, it indicates mixture between two populations on the two ends of the body of water, in which case we suggest you register to the upcoming GPS2 tool that would provide you with the origins of your parents. If you wish to learn more about your past, we suggest you try the Advanced test or the Super test, which provide much higher accuracy." 
JoAnn says: "Those coordinates are squarely in the North Sea, which does make sense as I am the typical American mutt, with mostly Irish and English heritage, but if one goes further back, much of that is from Norman French and Gaelic-Norse Orcadians. So it makes sense, but in my opinion it's not worth the high price."
Prosapia Genetics have a Forum where you can read the comments from their customers, many of whom have expressed similar disappointment at the service offered:

http://prosapiagenetics.com/community/viewforum.php?f=2

[Update 10th May 2014 The Prosapia Genetics Forum is now restricted to members only. I am told that complaints and negative comments have been deleted and comments are being moderated.]

This is not surprising as the whole concept of the test is fundamentally flawed. If we assume 30 years per generation and we go back 35 generations to the year 1050 theoretically we will have 34,359,738,367 ancestors. This figure does of course exceed the population of the world at that time and in reality there will be lots of pedigree collapse which will reduce the number of ancestors considerably. Even so, the mind-boggling figures demonstrate that it is quite meaningless to try and pinpoint a single geographical location as the origin of all those diverse ancestors one thousand years ago. Furthermore, we only inherit the DNA of a tiny subset of our ancestors. To understand why this is the case read Luke Jostin's blog post "How many ancestors share our DNA" and the posts from Graham Coop and Blaine Bettinger that are linked in that article.

Even if it were possible to pinpoint a single location to represent our millions of ancestors from a thousand years ago, we would need accurate "maps" in the form of carefully sampled reference populations in order to be able to use our DNA satnav. Unfortunately, we only have a limited number of reference populations available, many of which have been sampled for medical purposes with no attempt made to collect the relevant "co-ordinates" in the form of  detailed genealogical information. Consequently, any maps included in a reference genome "satnav" are going to have massive black holes. It is therefore not surprising that this DNA satnav is misdirecting people into rivers and oceans!

The methodology behind the GPS tool was outlined in a paper by Elhaik et al entitled Geographic population structure analysis of worldwide human populations infers their biogeographical origins. The paper was published in the scientific journal Nature Communications. Despite the fact that the Prosapia Genetics website appears to have been launched on the same day that the paper was published Tatiana Tatarinova, the founder of the Prosapia Genetics website and one of the lead authors, has not declared any "competing financial interests". The paper has already been the subject of controversy. The technique described in the paper offers nothing new and it is claimed that the methodology has been copied from that used by the blogger Dienekes Pontikos, who writes under a pseudonym. For background see Dienekes' two blog posts on the subject:

- Nature Communications, the Genographic Project, Elhaik et al. re-discover zombies, the Oracle, etc. 3 years after the fact...
- The Geographic Position Structure (GPS) algorithm of Elhaik et al. (2014) is basically wrong

See in particular the comments section of the first of the above two posts where Eran Elhaik attempts to defend the charge of plagiarism.

Joe Pickrell, one of the reviewers of the paper, has posted a summary of his critique which is well worth a read. The review can be found here:

http://jkplab.org/2014/04/30/review-geographic-population-structure-gps-of-worldwide-human-populations-infers-biogeographical-origin/

The authors themselves concede in the paper that the technique has its limitations and will only work if "the appropriate samples are available in the reference population data set". They appear to have cherry-picked some conveniently isolated populations such as the Sardinians for the purposes of their study, but the technique did not work for other populations:
To test GPS’s accuracy with individuals from populations that were not included in the reference population set, we conducted two analyses. We first repeated the previous analysis using the leave-one-out procedure at the population level. As expected, GPS accuracy decreased with 50% of worldwide individuals predicted to be 450 km away from their true origin. The predicted distance increased to 1,100 and 1,750 km for 80 and 90% of the individuals, respectively (Fig. 4a). Because GPS best localizes individuals surrounded by M genetically related populations, populations from island nations (for example, Japan and United Kingdom) or populations whose most related populations were under-represented in our reference population data set (for example, Peru and Russia) were most poorly predicted. Consequently, the median distances to the true origin were much smaller for individuals residing in Europe (250 km), Africa (300 km) and Asia (450 km) due to their being more commonly represented in the reference population data set compared with Native Americans and Oceanians. These results represent the upper limit of GPS’s accuracy when the specific population of the test individual is absent from the reference population data set.
A hyped up press release was issued by the University of Sheffield which also includes a link to a video on YouTube. As is often the case, the media have picked up on the hype in the press release and have made no attempt to read the scientific paper and understand the limitations of the methodology. I hope that there have not been too many people who have paid out good money for these misleading DNA satnav reports.

Note that if you've taken a test with one of the genetic genealogy companies there are many free services that you can use to get an alternative reading of your data and a prediction of your "ethnicity", all of which will give much better results than the commercial offerings from Prosapia. One of the best free websites is GedMatch which allows you to get readings from a wide range of different services. You can find a full list of services in the ISOGG Wiki article on admixture analyses. However, it is still very difficult to distinguish between populations at anything more than the Continental level, and all such reports should be treated with a very large pinch of salt.

Update 6 May 2014
Teresa Vega now tells me that she has received a full refund for her test from PayPal. She told PayPal that she had felt misled by the company's claims and she was unhappy that they had recommended upgrading to a test that they did not even have on their site. JoAnn O'Linger is now also in the process of applying for a refund.

Update 3rd September 2014
Although the Prosapia Genetics domain name was originally registered to Dr Tatiana Tatarinova, it was subsequently transferred to Vladimir Makarov.

Update 30th May 2015
In April 2015 Dr Eran Elhaik gave a presentation at Who Do You Think You Are? Live on the subject of "Reaching the Holy Grail in genetic genealogy: from genome to home village". For further details see the summary on the DNA sat nav page on the UCL website. In particular do listen to the recording of the exchange in the Q&A session between Eran Elhaik and Professor Mark Thomas.

Update 16th July 2016
A new paper by Pavel Flegontov, Alexei Kassian, Mark G. Thomas, Valentina Fedchenko, Piya Changmai and George Starostin "Pitfalls of the geographic population structure (GPS) approach applied to human genetic history: a case study of Ashkenazi Jews" provides a critique of the GPS methodology used for the Prosapia Genetics test with specific reference to its application to infer the origins of the Yiddish language.

Update 31st October 2016
A corrigendum to the Elhaik et al 2014 paper on geographic population structure has been published by Nature Communications. It contains a conflict of interests statement from the authors. The statement includes an acknowledgement that one of the authors (Tatiana Tatarinova) has a link with Prosapia Genetics.

Acknowledgements
Many thanks to Julie Matthews, JoAnn O'Linger and Teresa Vega for permission to use their quotes and reports.

Related blog posts
- My letter in Family Tree Magazine about "genetic homeland" stories

See also
Since writing this article I have discovered other discussions on the subject. I have posted the relevant links below and will update the list if further links become available:
- Prosapia Genetics - Worth the money? A review by Lorine McGinnis Schulze
- Researchers develop DNA GPS tool to accurately trace geographical ancestry -  a discussion on the Reddit forum
- Is GPS DNA tracking too good to be true? An article by Peter Calver in the Lost Cousins newsletter, May 2014
- So many genes, so close to home by Matthew Thomas, BioNews, 12 May 2014.
- Ancestral home pinpointed by DNA by Julie Lutter, Family History Research by Jodi, 13 May 2014.

© 2014-2016 Debbie Kennett

Saturday, 26 April 2014

A new BAM file analysis service from Full Genomes Corporation and a special offer on the FGC test

The following message is posted on behalf of Full Genomes Corporation:

Full Genomes Corporation (FGC) is announcing the official launch of a service to analyze BAM files from Family Tree DNA's Big Y product. The analysis is being launched at a price of $50 per kit. Recently, FGC had offered the Big Y analysis for a limited time, as a beta product, at no charge. FGC will continue to allow individuals to contribute their BAM files to the Full Genomes database without charge, so that their results may be used in kit cross-comparisons. The offering is designed to provide broader access to FGC's proprietary Y chromosome analysis services and to build FGC's database for purposes of kit comparisons.

The analysis will include the same reports as provided to customers of the Full Genomes sequencing product, with the exception of the mitochondrial DNA analysis, which will use the Yoruba reference sequence for Big Y kits. So, the analysis will consider Y-STRs and INDELs, in addition to Y-SNPs. To be clear, however, the results won't be able to achieve the same resolution as the Full Genomes sequencing product due to limitations with the underlying data from the Big Y test.

Interested individuals should first obtain access to their Big Y BAM file by contacting Family Tree DNA customer service. Those interested in ordering analysis can follow the instructions here to set up a Full Genomes account, make payment, and upload their BAM file; analysis will be performed in weekly batches. Those who are only interested in contributing their results to the Full Genomes comparison database may send the download link to fgcfilesharing@gmail.com, while also indicating their interest in donating their results and optionally providing a name (like FTDNA Kit Number) to associate with the results.

According to Dr. Greg Magoon, Y chromosome data analysis consultant for FGC, "I think the FGC analysis will address many of the needs that have been expressed by members of the genetic genealogy community who have been looking at Big Y results in recent weeks. In my view, the main strengths of the FGC analysis include its cross-kit comparisons and its SNP reliability classifications. We have put a lot of R&D into separating the wheat from the chaff to allow customers and researchers to quickly focus on the most reliable, phylogenetically-useful variants. I think the FGC analysis will help to significantly speed the interpretation of results and decrease the burden on busy genetic genealogists."

Separately, FGC is announcing a beta-stage referral program, which will provide customers with access to advanced analyses of their Full Genomes "next-gen" sequencing data. A Full Genomes customer who refers at least three other individuals to order the Full Genomes test will be entitled to a bleeding-edge, advanced analysis of their choosing. Potential analysis options include:
-Remapping of results to the newer, build 38 human genome reference sequence
-Remapping of results with a new and improved alignment algorithm/approach
-Y-STR analysis using a newer, larger STR database
-Phylogenetic analysis for portions of the Y tree
-Variant calling (SNPs and INDELs) for autosomal and X-chromosome data

Interested customers are advised to contact sales@fullgenomes.com to supply documentation of referrals and to discuss custom analysis options.

Dr Magoon said: "From a research perspective, I'm very excited about the potential for the referral program to push the boundaries of Y-chromosome analysis. We've already been able to work with customers on a case-by-case basis to do some very interesting customized analyses with the Full Genomes results, including the identification of large duplications and deletions through copy-number variation (CNV) analysis."

Speaking about FGC's next-gen sequencing test, CEO Justin Loe said:  "The FGC Y chromosome product is the most comprehensive in the market today but it is also, as we recognize, expensive for many potential customers. Over the near term, we expect to be able to make this product more affordable. Additionally, with the advent of new sequencing technologies other products will also be offered."
In fact, in honor of DNA Day, Full Genomes is currently offering a limited-time discount of 20% off the normal price for their comprehensive Y chromosome sequencing test (using coupon code "FGCDNA").

Dr Magoon commented: "I think what we're seeing across genetic genealogy is that companies are finding a niche with products focused on particular areas. For example, 23andMe has been a pioneer in autosomal DNA. We have seen that BritainsDNA has been making great advances in developing innovative chip-based tests for Y chromosome (and other) markers. Family Tree DNA has established a leadership role in Y-STRs and in full mitochondrial DNA sequencing. YSEQ, with Dr. Thomas Krahn, is the world leader in developing Y chromosome marker tests using Sanger sequencing. I am very excited to see FGC working hard to establish a similar role here in the field of "next-gen" Y chromosome sequencing."

Related blog posts
- A confusion of SNPs

Friday, 25 April 2014

The new 2014 Y-DNA haplotree has arrived!

Today saw the launch of Family Tree DNA's new 2014 Y-DNA haplotree which has been created in partnership with National Geographic's Genographic Project. If you've tested with Family Tree DNA you will find the new tree by going to your personal page and clicking on "haplotree and SNPs". Below is a screenshot of the upper portion of the tree for haplogroup R:


The tree is too large to fit into a single screenshot. Here is the relevant portion of the tree for my dad who is R-Z12, a sub-branch of U106.


Note that this is very much an interim tree. It is based on SNPs tested with the Genographic Project's Geno 2.0 chip, and the cut off date for inclusion of SNPs is November 2013. The new tree does not include all the thousands of new SNPs identified from testing with Big Y, Full Genomes and Chromo 2. The tree will eventually be much more comprehensive but FTDNA are being careful about the data they use from other sources and are insisting that SNPs are only added from published data and raw data that they have personally verified rather than from interpreted data. They have promised that at least one update will be released this year. The FTDNA Learning Center will eventually be updated with information about the new haplotree. If you have questions about a particular SNP that is in the wrong place on the tree or if you spot any other errors you should send an e-mail to the FTDNA help desk with Y-Tree in the subject line.

FTDNA are now recommending SNPs for people to test. I've only had a chance to look briefly at the SNP recommendations for a few project members. It is apparent that in some cases the SNPs that are recommended for testing are not appropriate. SNPs are only recommended if they pass certain percentage thresholds and there might well be a more appropriate downstream SNP that would be more suitable. If you are interested in ordering single SNP testing, make sure you join the appropriate Y-DNA haplogroup project and seek advice from the project administrators. If not, you could end up wasting money ordering unnecessary SNPs.

The following information has been provided by Family Tree DNA.

 FAST FACTS
• Created in partnership with National Geographic’s Genographic Project
• Used GenoChip containing ~10,000 previously unclassified Y-SNPs
• Some of those SNPs came from Walk Through the Y and the 1000 Genome Project
• Used first 50,000 high-quality male Geno 2.0 samples
• Verified positions from 2010 YCC by Sanger sequencing additional anonymous samples
• Filled in data on rare haplogroups using later Geno 2.0 samples

Statistics
• Expanded from approximately 400 to over 1200 terminal branches
• Increased from around 850 SNPs to over 6200 SNPs
• Cut-off date for inclusion for most haplogroups was November 2013

Total number of SNPs broken down by haplogroup:
A 406
B 69
BT 8
C 371
CT 64
D 208
DE 16
E 1028
F 90
G 401
H 18
I 455
IJ 29
IJK 2
J 707
K 11
K(xLT) 1
L 129
LT 12
M 17
N 168
NO 16
O 936
P 81
Q 198
R 724
S 5
T 148

myFTDNA Interface
• Existing customers receive free update to predictions and confirmed branches based on existing SNP test results.
• Haplogroup badge updated if new terminal branch is available
• Updated haplotree design displays new SNPs and branches for your haplogroup
• Branch names now listed in shorthand using terminal SNPs
• For SNPs with more than one name, in most cases the original name for SNP was used, with synonymous SNPs listed when you click "More…"
• No longer using SNP names with .1, .2, .3 suffixes. Back-end programming will place SNP in correct haplogroup using available data.
• SNPs recommended for additional testing are pre-populated in the cart for your convenience. Just click to remove those you don’t want to test.
• SNPs recommended for additional testing are based on 37-marker haplogroup origins data where possible, 25- or 12-marker data where 37 markers weren't available.
• Once you've tested additional SNPs, that information will be used to automatically recommend additional SNPs for you if they’re available.
• If you remove those prepopulated SNPs from the cart, but want to re-add them, just refresh your page or close the page and return.
• Only one SNP per branch can be ordered at one time – synonymous SNPs can possibly [be] ordered from the Advanced Orders section on the Upgrade Order page.
• Tests taken have moved to the bottom of the haplogroup page.

Coming attractions
• Group Administrator Pages will have longhand removed.
• At least one update to the tree to be released this year.
• Update will include: data from Big Y, relevant publications, other companies' tests from raw data.
• We'll set up a system for those who have tested with other big data companies to contribute their raw data file to future versions of the tree.
• We're committed to releasing at least one update per year.
• The Genographic Project is currently integrating the new data into their system and will announce on their website when the process is complete in the coming weeks. At that time, all Geno 2.0 participants’ results will be updated accordingly and accessible via the Genographic Project website.

BACKGROUND
Family Tree DNA created the 2014 Y-DNA Haplotree in partnership with the National Geographic Genographic Project using the proprietary GenoChip. Launched publicly in late 2012, the chip tests approximately 10,000 Y-DNA SNPs that had not, at the time, been phylogenetically classified.

The team used the first 50,000 male samples with the highest quality results to determine SNP positions. Using only tests with the highest possible “call rate” meant more available data, since those samples had the highest percentage of SNPs that produced results, or “calls.”

In some cases, SNPs that were on the 2010 Y-DNA Haplotree didn’t work well on the GenoChip, so the team used Sanger sequencing on anonymous samples to test those SNPs and to confirm ambiguous locations.

For example, if it wasn’t clear if a clade was a brother (parallel) clade, or a downstream clade, they tested for it.

The scope of the project did not include going farther than SNPs currently on the GenoChip in order to base the tree on the most data available at the time, with the cutoff for inclusion being about November of 2013.

Where data were clearly missing or underrepresented, the team curated additional data from the chip where it was available in later samples. For example, there were very few Haplogroup M samples in the original dataset of 50,000, so to ensure coverage, the team went through eligible Geno 2.0 samples submitted after November, 2013, to pull additional Haplogroup M data. That additional research was not necessary on, for example, the robust Haplogroup R dataset, for which they had a significant number of samples.

Family Tree DNA, again in partnership with the Genographic Project, is committed to releasing at least one update to the tree this year. The next iteration will be more comprehensive, including data from external sources such as known Sanger data, Big Y testing, and publications. If the team gets direct access to raw data from other large companies’ tests, then that information will be included as well. We are also committed to at least one update per year in the future.

Known SNPs will not intentionally be renamed. Their original names will be used since they represent the original discoverers of the SNP. If there are two names, one will be chosen to be displayed and the additional name will be available in the additional data, but the team is taking care not to make synonymous SNPs seems as if they are two separate SNPs. Some examples of that may exist initially, but as more SNPs are vetted, and as the team learns more, those examples will be removed.

In addition, positions or markers within STRs, as they are discovered, or large insertion/deletion events inside homopolymers, potentially may also be curated from additional data because the event cannot accurately be proven. A homopolymer is a sequence of identical bases, such as AAAAAAAAA or TTTTTTTTT. In such cases it’s impossible to tell which of the bases the insertion is, or if/where one was deleted. With technology such as Next Generation Sequencing, trying to get SNPs in regions such as STRs or homopolymers doesn’t make sense because we’re discovering non-ambiguous SNPs that define the same branches, so we can use the non-ambiguous SNPs instead.  Some SNPs from the 2010 tree have been intentionally removed. In some cases, those were SNPs for which the team never saw a positive result, so while it may be a legitimate SNP, even haplogroup defining, it was outside of the current scope of the tree. In other cases, the SNP was found in so many locations that it could cause the orientation of the tree to be drawn in more than one way. If the SNP could legitimately be positioned in more than one haplogroup, the team deemed that SNP to not be haplogroup defining, but rather a high polymorphic location.

To that end, SNPs no longer have .1, .2, or .3 designations. For example, J-L147.1 is simply J-L147, and I-147.2 is simply I-147.  Those SNPs are positioned in the same place, but back-end programming will assign the appropriate haplogroup using other available information such as additional SNPs tested or haplogroup origins listed. If other SNPs have been tested and can unambiguously prove the location of the multi-locus SNP for the sample, then that data is used. If not, matching haplogroup origin information is used.

We will also move to shorthand haplogroup designations exclusively. Since we’re committing to at least one iteration of the tree per year, using longhand that could change with each update would be too confusing.  For example, Haplogroup O used to have three branches: O1, O2, and O3. A SNP was discovered that combined O1 and O2, so they became O1a and O1b.

There are over 1200 branches on the 2014 Y Haplogroup tree, as compared to about 400 on the 2010 tree. Those branches contain over 6200 SNPs, so we’ve chosen to display select SNPs as “active” with an adjacent “More” button to show the synonymous SNPs if you choose.

The Genographic Project is currently integrating the new data into their system and will announce on their website when the process is complete in the coming weeks.  At that time, all Geno 2.0 participants’ results will be updated accordingly and will be accessible via the Genographic Project website.

QUOTES
Elliot Greenspan has provided the following quotes in conversation with Janine Cloud, Family Tree DNA's GAP Liaison and Events Co-ordinator:

"I want it to be the most accurate tree it can be, but I also want it to be interesting. That's the key. Historical relevance is what we're to discover. Anthropological relevance. It's not just who has the largest tree, it's who can make the most sense out of what you have [that] is important."

"This year we're committing to launching another tree. This tree will be more comprehensive, utilizing data from external sources: known Sanger data, as well as data such as Big Y, and if we have direct access to the raw data to make the proof (from large companies, such as the Chromo2) or a publication, or something of that nature. That is our intention that it be added into the data."

"We’re definitely committed to update at least once per year. Our intention is to use data from other sources, as well as any SNPs we can, but it must be well-vetted. NGS and SNP technology inherently has errors. You must curate for those errors otherwise you’re just putting slop out to customers. There are some SNPs that may bind to the X chromosome that you didn’t know. There are some low coverages that you didn’t know."

"With technology such as this [next-generation sequencing] you're able to overcome the urge to test only what you’re likely to be positive for, and instead use the shotgun method and test everything. This allows us to make the discovery that SNPs are not nearly as stable as we thought, and they have a larger potential use in that sense."

"Not only does the raw data need to be vetted but it needs to make sense. Using Geno 2.0, I only accepted samples that had the highest call rate, not just because it was the best quality but because it was the most data. I don't want to be looking at data where I'm missing potential information A, or I may become confused by potential information B. That is something that will bog us down. When you’re looking at large data sets, I’d much rather throw out 20% of them because they’re going to take 90% of the time than to do my best to get one extra SNP on the tree or one extra branch modified, that is not worth all of our time and effort. What is, is figuring out what the broader scope of people are, because that is how you break down origins. Figuring one single branch for one group of three people is not truly interesting until it's 50 people, because 50 people is a population. Three people may be a family unit. You have to have enough people to determine relevance. That's why using large datasets and using complete datasets are very, very important."

Update 27 April 2014
A recording of the Family Tree DNA webinar presented by Elise Friedman on the launch of the new 2014 haplotree is now available online and can be accessed here (free registration required).

Related blog posts
- A confusion of SNPs

Wednesday, 23 April 2014

The 2014 Y-DNA haplotree and special offers for DNA day

Family Tree DNA group administrators have received notification that the long awaited Y-DNA 2014 haplotree is to be launched on Friday 25th April to coincide with DNA Day and also something known in America as National Arbor Day which I'd never previously heard of but is rather aptly related to the planting and nurturing of trees, albeit real ones rather than those constructed from DNA. Starting on Friday the 37-marker Y-DNA test will also be on sale for a limited time. Here are the details:
National DNA Day, celebrated on April 25, commemorates the completion of the Human Genome Project and the discovery of DNA's double helix on April 25, 2003. 
Since 1970, the U.S. has observed National Arbor Day, dedicated to the planting and nurturing of trees, on the last Friday in April.
This year National Arbor Day falls on National DNA Day, so what better opportunity for Family Tree DNA to release the long-awaited 2014 Y-DNA Haplotree! 
We wanted you, the group administrators who have done so much to contribute to the success of the company, to know before we release the news to the entire Y database and the genetic genealogy community. 
In addition to expanding the tree from 400 to 1000 terminal branches, the Haplotree page will have an updated, fresh design. 
Our engineering team will begin to push the code that will update the database prior to the official release of the tree, so you'll see some changes in terminal SNPs and haplogroups for those who have done additional testing. 
To help with the transition, our Webinar Coordinator, Elise Friedman will host a live webinar on DNA Day for a demonstration of the new tree and more details about this landmark update on Friday, April 25, 2014 @ 12pm Central (5pm UTC). 
To register, click here: http://bit.ly/1dGbbbx 
A recording of this webinar will be posted to the Webinars page of our Learning Center within 24-48 hours after the live event: https://www.familytreedna.com/learn/ftdna/webinars 
*********************************************************************** 
And because we know you're going to ask...we will have a DNA Day sale that suits the occasion!  
Y-DNA SNPs will be 20% off from April 25 - 29. In addition, the Y-DNA 37 test will be 20% off the retail price.
The sale officially begins at 12.01am Houston time on 25th April and ends at 1.59 pm on 29th April. If you are ordering a Y-DNA test make sure you order through a surname project or a geographical project to benefit from the additional project discount. As always I would be very happy to welcome new members to my Cruwys/Cruse/Cruise/Crew(es) DNA Project and my Devon DNA Project.

Thomas Krahn's company YSEQ has also announced a price reduction. Single SNPs are reduced to $25 with immediate effect through until Father's Day on 15th June 2014. For further details about YSEQ see my previous blog post YSEQ.net - a new company offering a single SNP testing service.

Family Tree DNA last updated their Y-DNA haplotree back in 2010. There have been a huge number of changes since then so the new tree will be most welcome. However, with the tsunami of new SNPs now being identified from the Big Y, Full Genomes and Chromo 2 tests, the 2014 tree is already going to be very out of date as soon as it is published. To understand the problem read my previous blog post on a confusion of SNPs. I presume the new tree will also see the full implementation of the shorthand naming system. For example, the format R-Z12 will be used instead of the unwieldy longhand version which, according to the current ISOGG Y-SNP tree, is  R1b1a2a1a1c2b2a1a1a1. I would also hope that the new tree will have the facility built in to allow more frequent updates in the future. Let's wait and see what Friday brings. Here's hoping for a smooth transition.