JGI Joint Genome Institute CIG Center for Integrative Genomics

Glycine max

browse the genome blast against the genome download the data
 

About the Glycine max genome:

 

Overview

The Soybean (Glycine max) genome project was initiated through the DOE-JGI Community Sequencing Program (CSP) by a consortium led by Gary Stacey, Randy Shoemaker, Scott Jackson, Jeremy Schmutz, and Dan Rokhsar.

Large-scale shotgun sequencing of soybean began in the middle of 2006 and will be completed early in 2008. A total of ~13 million attempted shotgun reads have been produced and deposited in the NCBI Trace Archive in accordance with our commitment to early access and the Fort Lauderdale genome data release policy . See below for our publication plans.

The present assembly (Glyma0), gene set (Glyma0.1b), and browser constitute a preliminary release only, based on a partial dataset. It will be only transiently supported, as we expect to replace it with an improved, chromosome-scale Glyma1 version by the end of 2008. In particular, please note that gene identifiers and coordinates on the Glyma0 soybean genome will not be preserved in future releases. Early users of this data are encouraged to track their favorite genes by saving local copies of the DNA sequences of these loci, and not by identifier or sequence coordinate.

Statistics


Genome Size
Approximately 950Mb arranged in supercontigs with N50 = 6.5Mb
Loci
58556 loci containing protein-coding transcripts
Transcripts
62199 protein-coding transcripts

Exploring Soybean

The Soybean Genome
Use the Browse the Genome button at the top of this page to view soybean gene models in their full genomic context. Alternatively, if you'd like to search the genome for regions homologous to a particular sequence, use the BLAST against the genome button. The browsing environment, Gbrowse, provides overview and detailed views of gene structure. Gbrowse provides a search interface allowing you to look up Soybean models by model name or location, as well as by model names associated with Arabidopsis models that align to the same region (see "How do I find my favorite genes?" section of the FAQ). Once you've located a gene model of interest, click on it to go to its detail page (e.g, here). The detail page provides sequence information on the model and its translated peptide. From the detail page you can see what other ancestral genes share sequence similarity with this gene by clicking on the detail page's Phytozome BLAST link, which will pull up similar gene clusters at whichever Phytozome node you choose.
Downloading Data
CDS and peptide sequence for individual genes is available from the details page of the soybean Gbrowse environment.

FAQ

How was the genome sequenced?

Whole genome shotgun methodology
Although the first plant and animal genomes were sequenced by a BAC-by-BAC approach, almost all current animal and fungal genome sequencing projects use the whole genome shotgun strategy in which the entire genome is randomly sheared, subcloned, and redundantly sequenced. The ease, cost-efficiency, and speed of whole genome shotgun approach has made it the method of choice in many cases, but there are lingering concerns about its effectiveness for large repeat-rich plant genomes, especially grasses. Soybean is the most complex plant genome sequenced to date by this strategy.
How was the assembly generated?
The Glyma0 release is a preliminary whole genome shotgun assembly produced by Jeremy Schmutz at JGI-Stanford Human Genome Center using the Arachne2 assembler in a mode tuned to the highly repetitive soybean genome. To make a provisional assembly available to the community as rapidly as possible, this genome reconstruction was carried out prior to the completion of sequencing, with ~7-fold redundant coverage. An additional ~1.0X of plasmid, fosmid, and BAC-end sequence is still in progress and will be complete by mid 2008. This data will be incorporated into the next release, and integrated with the soybean physical and genetic maps. This effort will produce an assembly in chromosomal coordinates by the end of 2008.
What is a "super"?
A "supercontig" (also known as a "scaffold") is a reconstructed genomic region that may contain modest gaps whose size is approximately known.  
How come there are no chromosomes?
To accelerate the release of the soybean genome sequence, we are releasing Glyma0 supercontigs prior to the integration of the assembly with the soybean genetic and physical maps. This work is ongoing and the late 2008 Glyma1 release (and subsequent versions) will use chromosome coordinates.
Is it complete?
Comparison with the soybean EST set suggests that more than 98% known soybean protein-coding genes are represented in the assembly (many that aren't are turning out to be contamination of EST libraries). This result support the claim that Glyma0 is largely complete with respect to "gene space." You'll also find that vast tracts of repetitive sequence are also assembled.
Is it accurate?
The vast majority of Glycine max ESTs align to the genome at nearly 100% identity, suggesting that Glyma0 is highly accurate in genic regions. We are currently evaluating the base-pair-level accuracy in repetitive regions by comparing the assembly with BAC clones produced for the project. On a larger scale, there are likely to be several dozen "misjoins" with apparent discrepancies between the shotgun assembly and the independently obtained maps. These will be reconciled in the chromosome-scale Glyma1 release by the end of 2008.
What about polyploidy?
The soybean genome experienced a tetraploidization event in its recent past. Homeologous regions have diverged sufficiently, however, that they can assembled apart from one another in the shotgun assembly. Thus both homeologs are typically represented in the Glyma0 sequence.

How do I find my favorite genes?

BLAST
To BLAST against the soybean genome with peptide or nucleotide probes, click here. The default BLAST database is a soybean genome assembly that has been masked for high fidelity repeats, and default BLAST parameters are suitable for use with grass peptides and coding sequences. You can view your blast alignment against the genome by clicking on the hit of interest to see the detailed alignment, and then clicking on the scaffold name (shown in blue). If you're interested in transposable element families in the sorghum genome, please DO NOT BLAST these, it'll just clog up our BLAST queue!  Similarly, please don't BLAST entire BACs.  
Search
We have pre-aligned known soybean, medicago, and lotus ESTs to the soybean sequence, along with current proteomes of rice and Arabidopsis. If you enter into the Gbrowser "Search" box text keywords from common gene names like "zein" or "agamous", or gene identifiers like "At1g12340," the result will be a list of genomic regions that hit ESTs or rice/Arabidopsis genes that are associated with these words/identifiers. Clicking on the red diamonds will then bring you to the specific region of interest. Note that you may need to zoom in to see details, which are only shown over regions shorter than 70 kb.
NOTE
The "super" location and coordinates of your region of interest will NOT be preserved in future soybean genome releases. Please keep records of these regions by saving unique sequence markers (e.g., the gene space of your gene or genes) on your computer. To map forward, you can then simply BLAST these nucleotide sequences against future releases.

How do I work with the soybean Gbrowser browser?

How can I view the soybean sequence and various genomic features?
To facilitate early use of the soybean genome the DOE-JGI and the UC Berkeley Center for Integrative Genomics have developed a simple genome browser using the Gmod/Gbrowse software. Due to the density of information, detailed features are only visible when looking at 70 kb or smaller regions. You may need to zoom in to get to this size. Typically, clicking on a feature will reveal its sequence and alignment to the genome. For gene models, you can also click to bo
How do I retrieve soybean sequence of interest to me?
From the browser, locate the region of interest. With your region in view, select "Download Sequence" from the menu above the Scroll/Zoom bar.  Then click the "Go" button and you'll get your sequence on your browser to cut and paste.  If you click on a gene model, you can retrieve the predicted peptide and coding sequencing.
What happens when I click on a gene on the browser
You'll see a web page that displays the predicted peptide, genomic span of the gene with coding exons shaded, and the (spliced) coding sequence.  From this page you can also launch BLAST vs. the NCBI non-redundant protein database or Phytozome.

Where do the various tracks on the genome browser come from?
How were repeats identified?
Sixteen-base-pair "words" (16-mers) that are over-represented in and clustered on the genome were used to define repetitive regions (J. Chapman, unpublished). These typically represent recently active retrotransposons and simple-sequence-repeats in the soybean genome. Nearly ~40% of the genome appears to be covered by such clustered/over-represented regions. This is clearly an underestimate of the repeat content of soybean, as many older/more diverged transposable element "fossils", as well as low copy elements, have not been characterized yet.
How were ESTs aligned?
We aligned the consensus EST sequences of Glycine max, Medicago trunculata and Lotus japonicus from the TIGR Plantta database to the soybean genome using Jim Kent's BLAT and and filtering for best hit to the genome, along with any hit within 97% coverage of that hit to account for genome duplication.
How were rice and Arabidopsis peptides aligned?
The Arabidopsis and rice peptides were downloaded from NCBI RefSeq and aligned to the (unmasked) genome by gapped BLASTX; high-scoring sequence pairs (HSP's) are shown. Note that gapped BLAST was used to increase sensitivity, so that in many cases the HSP (shown in yellow) spans adjacent exons and the intervening intron(s). Also, small exons (evident from the maize/sorghum/sugarcane ESTs) are often missed.

How did you get the gene set for soybean?

Where did the gene set come from?
To produce a provisional "Glyma0.1b" gene set, we used the homology-based gene prediction program, GenomeScan from Chris Burge and FgenesH predictions provided by Asaf Salamov at JGI. The gene set shown on the browser was produced by Therese Mitros at UC Berkeley. Briefly, arabidopsis peptides and TIGR legume EST assemblies were aligned to the genome, and their overlaps used to define putative protein-coding gene loci. The corresponding genomic region was submitted to GenomeScan and FgenesH, along with the arabidopsis peptide and/or ORFs from the overlapping EST assemblies. GenomeScan identifies likely protein coding exons, favoring regions that align well to the given homologous peptides. These homology-based predictions were integrated with expressed sequence information using PASA (Haas et al. 2003) using legume ESTs. As with the assembly, the provisional Glyma0.1b gene set will be replaced by a new improved gene set with the Glyma1 release coming in 2008. But we hope this interim set is useful. Please note again that Glyma0.1b gene identifiers will not be carried forward. The Glyma1 gene set will be named based on chromosomal position, following the model set by Arabidopsis AT#'s. How come my gene is only partially predicted? GenomeScan and FgenesH are two of the better homology-based gene predictors available, but like all computational gene modeling algorithms, they are imperfect. Similarly, EST and cDNA data are often incomplete.
Waiter, there's a repeat element in my gene! (Get it out!)
Since in the interest of speed we have only done a cursory initial masking of repeats, and since many unmasked repeats contain significant and/or mildly deteriorated open reading frames, GenomeScan sees these regions with good coding potential as too good to pass up in its automated gene prediction. These repeat-derived coding regions may end up in the intron your favorite gene. In some cases, this leads to truncation of the automated gene prediction when a stop codon is encountered.
We hope that the aggravation of an imperfect gene set is partially compensated by the rapid release of the data...
Future gene sets will improve as assembly quality improves along with associated expressed sequence data and genomic data from related species. But the lesson from the annotation of other well-curated genomes like Arabidopsis and rice is that it can take years to fine tune a gene set even given a high quality genome assembly. So please be patient!

What can I do with the soybean dataset?

I would like to use this data to help clone a gene, analyse a gene family, etc.
Wonderful! Please feel free to use this data to advance your studies of soybean and other legumes. Please reference "Soybean Genome Project, DoE Joint Genome Institute" as your citation.
I think I found an error. What should I do?
Unfortunately, we are unable to systematically address problems with the Glyma0 assembly or annotation. . We expect that these problems will mostly be resolved in the Glyma1 release by the end of 2008. If you would like to bring any items to our attention, please send email to phytozome@jgi-psf.org.
I would like to do a large-scale comparison of soybean to other genomes, and/or a global analysis of its gene content.
The Fort Lauderdale guidelines for large scale sequencing projects aims to balance the value of rapid data release for the user community with respect for the scientific interests of the generators of the data.  Our plans for rapid publication of the soybean genome are described below, and are focused on the large-scale analysis of the gene and repetitive content of the soybean genome and its evolutionary dynamics; we ask that you respect these scientific goals as discussed in the Fort Lauderdale guidelines. A plan for the coordinated submission of companion manuscripts to Genome Research will be developed.

What are the publication plans?

We expect that the initial manuscript describing the assembly, annotation, and first analysis of the soybean genome will be based on the 2008 Glyma assembly and will be submitted in early 2009. Our target is submission of the manuscript nine months after completion of chromosome-scale assembly. The soybean genome analysis group is actively working on the following topics for this initial manuscript:
  • Assembling a highly repetitive plant genome using a whole genome shotgun method, and comparison with filtration methods
  • Molecular evolution of coding and non-coding sequences informed by the sorghum genome sequence
  • Synteny and chromosome-scale evolution within the legumes and the larger eudicot clade.
  • Analysis of the impact of polyploidy on the soybean genome.
  • Patterns of plant gene family evolution as illuminated by the soybean gene set.
  • Tempo and mode of recent retrotransposon and other repeat activity in soybean.
©2007 University of California Regents. All rights reserved