sábado, 9 de abril de 2011
Genome Biology | Full text | A standard variation file format for human genome sequences
A standard variation file format for human genome sequences
Martin G Reese1*, Barry Moore2, Colin Batchelor3, Fidel Salas1, Fiona Cunningham4, Gabor T Marth5, Lincoln Stein6, Paul Flicek4, Mark Yandell2 and Karen Eilbeck7*
* Corresponding authors: Martin G Reese mreese@omicia.com - Karen Eilbeck keilbeck@genetics.utah.edu
Author Affiliations
1 Omicia, 2200 Powell Street, Suite 525, Emeryville, CA 94608, USA
2 Department of Human Genetics and Eccles Institute of Human Genetics, 15 North 2030 East, University of Utah, Salt Lake City, UT 84108, USA
3 Royal Society of Chemistry, Thomas Graham House, Cambridge, CB4 0WF, UK
4 EMBL Outstation - Hinxton, European Bioinformatics Institute, Wellcome Trust, Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
5 Department of Biology, Boston College, 140 Commonwealth Avenue, Chestnut Hill, MA 02467, USA
6 Ontario Institute for Cancer Research, 101 College St, Suite 800, Toronto, ON M5G0A3, Canada
7 Department of Biomedical Informatics, Health Sciences Education Building, Suite 5700, 26 South 2000 East, University of Utah, Salt Lake City, UT 84112, USA
For all author emails, please log on.
Genome Biology 2010, 11:R88 doi:10.1186/gb-2010-11-8-r88
The electronic version of this article is the complete one and can be found online at: http://genomebiology.com/2010/11/8/R88
Received: 29 April 2010
Revisions received: 26 July 2010
Accepted: 26 August 2010
Published: 26 August 2010
© 2010 Reese et al; licensee BioMed Central Ltd.
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Here we describe the Genome Variation Format (GVF) and the 10Gen dataset. GVF, an extension of Generic Feature Format version 3 (GFF3), is a simple tab-delimited format for DNA variant files, which uses Sequence Ontology to describe genome variation data. The 10Gen dataset, ten human genomes in GVF format, is freely available for community analysis from the Sequence Ontology website and from an Amazon elastic block storage (EBS) snapshot for use in Amazon's EC2 cloud computing environment.
Background
With the advent of personalized genomics we have seen the first examples of fully sequenced individuals [1-9]. Now, next generation sequencing technologies promise to radically increase the number of human sequences in the public domain. These data will come not just from large sequencing centers, but also from individual laboratories. For reasons of resource economy, 'variant files' rather than raw sequence reads or assembled genomes are rapidly emerging as the common currency for exchange and analysis of next generation whole genome re-sequencing data. Several data formats have emerged recently for sequencing reads (SRF) [10], read alignments (SAM/BAM) [11], genotype likelihoods/posterior SNP probabilities (GLF) [12], and variant calling (VCF) [13]. However, the resulting variant files of single nucleotide variants (SNVs) and structural variants (SVs) are still distributed as non-standardized tabular text files, with each sequence provider producing its own idiomatic data files [1-9]. The lack of a standard format complicates comparisons of data from multiple sources and across projects and sequencing platforms, tremendously slowing the progress of comparative personal genome analysis. In response we have developed GVF, the Genome Variation Format.
full-text:
Genome Biology | Full text | A standard variation file format for human genome sequences
Suscribirse a:
Enviar comentarios (Atom)
No hay comentarios:
Publicar un comentario