Glycine max (Soybean)
About the genome:
Overview
The Soybean (Glycine max) genome project was initiated through the DOE-JGI Community Sequencing Program (CSP) by a consortium led by Gary Stacey, Randy Shoemaker, Scott Jackson, Jeremy Schmutz, and Dan Rokhsar.
Large-scale shotgun sequencing of soybean began in the middle of 2006 and was completed early in 2008. A total of ~13 million attempted Sanger shotgun reads were produced and deposited in the NCBI Trace Archive in accordance with our commitment to early access and the Fort Lauderdale genome data release policy . See below for our publication plans.
The present assembly (Glyma1) is the first chromosome-scale assembly of the soybean genome. The current gene set (Glyma1.0) integrates ~1.6 million ESTs with homology and ab initio-based gene predictions. Protein-coding genes have been given identifiers using the convention adopted by the Arabidopsis community. The identifiers are of the form Glyma%%g####, where %% is the chromosome number and #### is a numerical index that increases along each chromosome. We expect that these identifiers will be preserved in future releases.
Statistics
- Genome Size
- Approximately 975Mb is captured in 20 chromosomes, with a small additional amount of mostly repetitive sequence in unmapped scaffolds.
- Loci
- 66,153 protein-coding loci have been predicted. These genes were assigned a letter-code to indicate the level of support for each gene. The gene is assigned the letter code according to its highest level of support:
Code Genes with this code Code definition F 3305 full-length cDNA consistent E 32317 EST consistent Ei 7361 EST overlap, but model does not match all EST boundaries Ea 1832 gene generated from the longest ORF of EST evidence, as modeling programs failed to produce a model at locus Hs 13704 Homology and Solexa support H 7634 Homology to other plant peptide
Exploring Glycine max
- Glycine max in the context of Green Plant evolution
- Gene families containing Glycine max genes can be found by searching at the Viridiplantae, Embryophyte, Tracheophyte, Angiosperm, Rosid Pre- and Post-hexaploidy, Eurosid I and Legume nodes. Keyword searches (the default) can be performed using either a gene identifier, a gene symbol, ontology id (e.g, GO, PFAM, Panther) or full or partial defline. If you don't have this type of information for your gene of interest, but know something about its functional classification, you can perform an ontology search using a descriptive phrase (e.g., "developmental regulator", "apoptosis inhibition","alkaline phospatase"). This will search the selected ontologies for these phrases, and then use the resulting set of ontology identifiers (e.g, "PF05918", "KOG3913") to search for gene families whose members carry these classifications. Finally, if you don't have keyword or functional information but do have a gene or gene product sequence, you can search for homologous gene families by BLASTing against the (peptide) gene family consensus sequeuences.
- The Glycine max Genome
- Use the BROWSE GENOME button at the top of this page to view Glycine max gene models in their full genomic context. Alternatively, if you'd like to search the genome for regions homologous to a particular sequence, use the BLAST GENOME button. The browsing environment, Gbrowse, provides overview and detailed views of gene structure. Gbrowse provides a search interface allowing you to look up Glycine max models by gene name, external identifier, or genomic location. Once you've located a gene model of interest, click on it to go to its detail page. The detail page provides sequence information on the model and its translated peptide. From the detail page you can also jump to viewing the gene in its evolutionary context. To see this gene's orthologs at the Angiosperm node, click on the Cluster link on the detail page. If you'd like to see what other ancestral gene's share sequence similarity with this gene, click on the detail page's Phytozome BLAST link, which will pull up similar gene clusters at whichever Phytozome node you choose.
- Downloading Data
-
Bulk data downloads are available from a number of locations:
- BioMart: Choose "Phytozome Genomes" and use the Filter to restrict the data to Glycine max. You can use other filters to further restrict the data, and then choose which data Attributes to download.
- Our FTP Site: Phytozome Data. Contains Phytozome-specific data as well as annotation and assembly files for JGI-sourced genomes.
- Gene Family page
- Go to the "Get Data" tab and select either "Get Sequences" or "Get Cluster Data" (which launches BioMart).
- Go to the "Align Cluster Members" tab and select either "Load Member protein sequences" or "Load Member coding sequences." This will launch Jalview, from which sequences can be downloaded.
- Keyword and BLAST search results page: Data for one or more of the Gene Families found via a keyword search or a BLAST search against a gene family consensus database can be downloaded by selecting "ANALYZE RESULTS/GET DATA" and clicking "Get data for selected families". This will launch BioMart with filters set to retrieve data only for the selected families.
- Gbrowse region page: Use the "Download Alignments", "Download Decorated Fasta", or "Download Sequence" plugins under the "Reports and Analysis" pulldown.
FAQ
How was the genome sequenced?
How do I find my favorite genes?
How do I work with the soybean Gbrowser browser?
- How can I view the soybean sequence and various genomic features?
- To facilitate early use of the soybean genome the DOE-JGI and the UC Berkeley Center for Integrative Genomics have developed a simple genome browser using the Gmod/Gbrowse software. Due to the density of information, detailed features are only visible when looking at 70 kb or smaller regions. You may need to zoom in to get to this size. Typically, clicking on a feature will reveal its sequence and alignment to the genome. For gene models, you can also click to bo
- How do I retrieve soybean sequence of interest to me?
- From the browser, locate the region of interest. With your region in view, select "Download Sequence" from the menu above the Scroll/Zoom bar. Then click the "Go" button and you'll get your sequence on your browser to cut and paste. If you click on a gene model, you can retrieve the predicted peptide and coding sequencing.
- What happens when I click on a gene on the browser
-
You'll see a web page that displays the predicted peptide, genomic span
of the gene with coding exons shaded, and the (spliced) coding
sequence. From this page you can also launch BLAST vs. the NCBI
non-redundant protein database or Phytozome.
- Where do the various tracks on the genome browser come from?
- How were repeats identified?
- Sixteen-base-pair "words" (16-mers) that are over-represented in and clustered on the genome were used to define repetitive regions (J. Chapman, unpublished). These typically represent recently active retrotransposons and simple-sequence-repeats in the soybean genome. Nearly 40% of the genome appears to be covered by such clustered/over-represented regions. In a parallel effort, a catalog of DNA transposons and LTR transposable elements was produced by Jianxing Ma. A catalog of short, tandem repeats was provided by Steve Cannon. The characterization of these elements is in progress.
- How were ESTs aligned?
- We aligned the consensus EST sequences of Glycine max, Medicago trunculata and Lotus japonicus from the TIGR Plantta database to the soybean genome using Jim Kent's BLAT and filtered for best hit to the genome, along with any hit within 97% coverage of that hit to account for genome duplication. For final gene verification we aligned G. max ESTs using Brian Haas's PASA pipeline which aligns ESTs to the best place in the genome via gmap, then filters hits to ensure proper splice boundaries.
- How were rice and Arabidopsis peptides aligned?
- The Arabidopsis and rice peptides were downloaded from NCBI RefSeq and aligned to the (unmasked) genome by gapped BLASTX; high-scoring sequence pairs (HSP's) are shown. Note that gapped BLAST was used to increase sensitivity, so that in many cases the HSP (shown in yellow) spans adjacent exons and the intervening intron(s). Also, small exons (evident from the maize/sorghum/sugarcane ESTs) are often missed.
How did you determine the soybean gene set?
- Gene prediction
- To produce the current "Glyma1.0" gene set, we used the homology-based gene prediction program, GenomeScan from Chris Burge and FgenesH predictions provided by Asaf Salamov at JGI, along with the PASA program to integrate over 1.6 million soybean ESTs. The gene set shown on the browser was produced by Therese Mitros at UC Berkeley. Briefly, peptides from diverse angiosperms and TIGR legume EST assemblies were aligned to the genome, and their overlaps used to define putative protein-coding gene loci. The corresponding genomic region was submitted to GenomeScan and FgenesH, along with related angiosperm peptides and/or ORFs from the overlapping EST assemblies. GenomeScan identifies likely protein coding exons, favoring regions that align well to the given homologous peptides. These homology-based predictions were integrated with expressed sequence information using PASA (Haas et al. 2003) using legume ESTs. The results were filtered to remove genes identified as transposon-related. Genes with apparently truncated ORFs may be prediction errors or pseudogenes.
- How come my gene is wrong?
- GenomeScan and FgenesH are two of the better homology-based gene predictors available, but like all computational gene modeling algorithms, they are imperfect. Similarly, EST and cDNA data are often incomplete. We hope that the aggravation of an imperfect gene set is partially compensated by the rapid release of the data.
- Future gene sets will improve as assembly quality improves along with associated expressed sequence data and genomic data from related species. But the lesson from the annotation of other well-curated genomes like Arabidopsis and rice is that it can take years to fine tune a gene set even given a high quality genome assembly. So please be patient!
What can I do with the soybean dataset?
- I would like to use this data to help clone a gene, analyse a gene family, etc.
- Wonderful! Please feel free to use this data to advance your studies of soybean and other legumes. Please reference "Soybean Genome Project, DoE Joint Genome Institute" as your citation.
- I think I found an error. What should I do?
- If you would like to bring any items to our attention, please send email to phytozome@jgi-psf.org.
- I would like to do a large-scale comparison of soybean to other genomes, and/or a global analysis of its gene content.
- The Fort
Lauderdale guidelines for large scale sequencing projects aims to
balance the value of rapid data release for the user community with
respect for the scientific
interests of the generators of the data. Our plans for rapid
publication of the soybean genome are described below, and are focused
on the large-scale analysis of the gene and repetitive content of the
soybean genome and its evolutionary dynamics; we ask that you respect these scientific goals as discussed in the Fort Lauderdale guidelines. A plan for the coordinated submission of companion manuscripts to Genome Research will be developed.
What are the publication plans?
- We expect that the initial manuscript describing the assembly, annotation, and first analysis of the soybean genome will be based on this Glyma1 chromosome-scale assembly and will be submitted in summer 2009. Our target is submission of the manuscript nine months after completion of chromosome-scale assembly. The soybean genome analysis group is actively working on the following topics for this initial manuscript:
-
- Assembling a highly repetitive plant genome using a whole
genome shotgun method, and comparison with filtration methods
- Molecular evolution of coding and non-coding sequences informed by the sorghum genome sequence
- Synteny and chromosome-scale evolution within the legumes and the larger eudicot clade.
- Analysis of the impact of polyploidy on the soybean genome.
- Patterns of plant gene family evolution as illuminated by the soybean gene set.
- Tempo and mode of recent retrotransposon and other repeat
activity in soybean.
- Assembling a highly repetitive plant genome using a whole
genome shotgun method, and comparison with filtration methods




