JGI Joint Genome Institute CIG Center for Integrative Genomics

Sorghum bicolor v1.0

browse the genome blast against the genome download the data
 

About the Sorghum bicolor v1.0 genome:

 

Overview

The Sorghum bicolor genome project was initiated through the DOE-JGI Community Sequencing Program (CSP) by a consortium led by Andy Paterson, John Bowers, Steve Kresovich, C. Thomas Hash, Jo Messing, Daniel Peterson, Jeremy Schmutz, and Dan Rokhsar.

Large-scale shotgun sequencing of sorghum began at the end of 2005 and was completed on January 25th, 2007. A total of 10,717,203 shotgun reads were collected. All raw trace data is deposited in the NCBI Trace Archivein accordance with our commitment to early access and the Fort Lauderdale genome data release policy . See below for our publication plans.

The present v1.0 release, comprising the Sbi1 assembly and Sbi1.4 gene set, are the assembly and annotation used in the sorghum genome paper. In all subsequent releases, chromosome and gene identifiers will be mapped forward whenever possible.This assembly was built with Arachne v20070201 with a data freeze from January 25th, 2007. After the build, 28 breaks were made and 108 manuals joins were performed. Ten of these joins were across centromeres. The size of the centromere was estimated for each chromosome from the amount of centromeric sequence already assembled. The main genome is in 10 chromosomes with many small unmapped pieces, some of which contain annotated genes. coordinate.

Statistics


Genome Size
697,578,683 base pairs arranged in 2n=20 chromosomes
Loci
34,496 loci containing protein-coding transcripts
Transcripts
36,338 protein-coding transcripts

Exploring Sorghum

Sorghum in the context of Land Plant evolution
Sorghum genes will be found in clusters defined at the Land Plants, Angiosperms, and Grasses nodes. If you know something about the Sorghum gene you're interested in (e.g., its model name, a functional domain associated with the gene), you can select "search" from the menu at the top of the page, enter your search terms, pick one of the above nodes (or All nodes), and search. The gene clusters you find can then be examined in detail, with summary pages that include textual and functional (domain) annotation, syntenic context, and links to each clusters ancestors and descendants. If you don't have keyword information but do have a gene sequence, you can search for related gene clusters using BLAST (click on "search" and then select the BLAST tab).
The Sorghum Genome
Use the Browse the Genome button at the top of this page to view rice gene models in their full genomic context. Alternatively, if you'd like to search the genome for regions homologous to a particular sequence, use the BLAST against the genome button. The browsing environment, Gbrowse, provides overview and detailed views of gene structure. Gbrowse provides a search interface allowing you to look up Sorghum models by model name or location, as well as by model names associated with Arabidopsis models that align to the same region (see "How do I find my favorite genes?" section of the FAQ). Once you've located a gene model of interest, click on it to go to its detail page (e.g, here). The detail page provides sequence information on the model and its translated peptide. From the detail page you can also return to viewing Sorghum genes in their evolutionary context. To see this gene's orthologs at the Angiosperm node, click on the Cluster link on the detail page. If you'd like to see what other ancestral gene's share sequence similarity with this gene, click on the detail page's Phytozome BLAST link, which will pull up similar gene clusters at whichever Phytozome node you choose.
Downloading Data
CDS and peptide sequence for individual genes is available from the details page of the Sorghum Gbrowse environment. To get there from a Cluster Summary page, however, you'll need to click the "Display Options" tab and make sure that "Reference ID" column is selected. Then expand all the "Genes in the Cluster" rows (by clicking on the icon near the Org heading) and click on the "[PAC]" link, which will take you to the detail page for that model. You can also get gene sequences for all the members of one or more clusters from any Cluster Summary, Phytozome BLAST results, or Search Results page, by launching Jalview (by clicking links for Multiple Sequence Alignments"), which will provide access to CDS sequence or peptide sequence of the members of the gene clusters you selected. If you're are interested in bulk downloads of all data (gene models, proteomes, genomic sequence), this is available directly from TIGR.

FAQ

How was the genome sequenced?

Whole genome shotgun! What were you thinking?! Doesn't that produce a lousy genome sequence?
Although the first plant and animal genomes were sequenced by a BAC-by-BAC approach, almost all current animal and fungal genome sequencing projects use the whole genome shotgun strategy in which the entire genome is randomly sheared, subcloned, and redundantly sequenced. The ease, cost-efficiency, and speed of whole genome shotgun approach has made it the method of choice in many cases, but there are lingering concerns about its effectiveness for large repeat-rich plant genomes, especially grasses. Sorghum is the most complex plant genome sequenced to date by this strategy. Give it a spin. We hope you find it useful!
How was the assembly generated?
The Sbi1 release is a whole genome shotgun assembly produced by Jeremy Schmutz at JGI-Stanford Human Genome Center using the Arachne2 assembler in a mode tuned to the highly repetitive sorghum genome. The genome coverage is approximately 8x.
Why are there "super";s and chromosomes?
A "supercontig" (also known as a "super" or "scaffold") is a reconstructed genomic region that may contain modest gaps whose size is approximately known. Most of these supercontigs have been placed into 10 chromosome-size chunks, representing 90% of the genome and 99% of protein-coding regions.  
Is it complete?
Comparison with the physical and genetic map indicates that Sbi1 covers the vast majority of the available non-repetitive markers; comparison with the sorghum EST set suggests that more than 95% known sorghum protein-coding genes are represented in the assembly (many that aren't are turning out to be contamination of EST libraries). Both results support the claim that Sbi1 is largely complete with respect to "gene space."  You'll also find that vast tracts of repetitive sequence are also assembled. 
Is it accurate?
The vast majority of ESTs align to the genome at nearly 100% identity, suggesting that Sbi1 is highly accurate in genic regions. We are currently evaluating the base-pair-level accuracy in repetitive regions by comparing the assembly with BAC clones produced for the project. On a larger scale, we have identified approximately three dozen locations with apparent discrepancies between the shotgun assembly and the independently obtained maps. These will be reconciled in the chromosome-scale Sorbi1 release in spring 2007.

How do I find my favorite genes?

BLAST
To BLAST against the sorghum genome with peptide or nucleotide probes, click here. The default BLAST database is a sorghum genome assembly that has been masked for high fidelity repeats, and default BLAST parameters are suitable for use with grass peptides and coding sequences. You can view your blast alignment against the genome by clicking on the hit of interest to see the detailed alignment, and then clicking on the scaffold name (shown in blue). If you're interested in transposable element families in the sorghum genome, please DO NOT BLAST these, it'll just clog up our BLAST queue!  Similarly, please don't BLAST entire BACs.  
Search
We have pre-aligned known sorghum, maize, and sugarcane ESTs to the sorghum sequence, along with current proteomes of rice and Arabidopsis. If you enter into the Gbrowser "Search" box text keywords from common gene names like "zein" or "agamous", or gene identifiers like "At1g12340," the result will be a list of genomic regions that hit ESTs or rice/Arabidopsis genes that are associated with these words/identifiers. Clicking on the red diamonds will then bring you to the specific region of interest. Note that you may need to zoom in to see details, which are only shown over regions shorter than 50 kb.
Maize bins
by comparing (non-repetitive) sequence markers associated with "bins" on the maize genetic map, we have provisionally identified the syntenic sorghum regions for each maize bin. Note that since maize is paleotetraploid relative to sorghum, many genic regions of the sorghum genome are covered by two (and occasionally more) maize bins.
NOTE
The current "super" scaffolds bear no relation to the Sorbi0 release superscaffolds. To map forward, you can BLAST nucleotide or peptide sequences from the Sorbi0 release against the current genome.

How do I work with the sorghum Gbrowser browser?

How can I view the sorghum sequence and various genomic features?
To facilitate early use of the sorghum genome the DOE-JGI and the UC Berkeley Center for Integrative Genomics have developed a simple genome browser using the Gmod/Gbrowse software. Due to the density of information, detailed features are only visible when looking at 50 kb or smaller. You may need to zoom in to get to this size. Typically, clicking on a feature will reveal its sequence and alignment to the genome. For gene models, you can also click to bo
How do I retrieve sorghum sequence of interest to me?
From the browser, locate the region of interest. With your region in view, select "Download Sequence" from the menu above the Scroll/Zoom bar.  Then click the "Go" button and you'll get your sequence on your browser to cut and paste.  If you click on a gene model, you can retrieve the predicted peptide and coding sequencing.
What happens when I click on a gene on the browser
You'll see a web page that displays the predicted peptide, genomic span of the gene with coding exons shaded, and the (spliced) coding sequence.  From this page you can also launch BLAST vs. the NCBI non-redundant protein database or Phytozome
Where do the various tracks on the genome browser come from?
How were repeats identified?
The genome was masked using repeatMasker. Nearly 66% of the genome appears to be covered by such clustered/over-represented regions. This is clearly an underestimate of the repeat content of sorghum, as many older/more diverged transposable element "fossils", as well as low copy elements, have not been characterized yet.
What is a SAMI?
"Sorghum assembled methyl-filtered islands" represent assemblies of methyl-filtered sorghum shotgun sequences, obtained from Pat Schnable's MAGI/SAMI analysis. These are enriched for genic regions but only cover portions of genes.
How were ESTs aligned?
We aligned the consensus EST sequences of sorghum, sugarcane, and maize from the TIGR Plantta database to the sorghum genome using Jim Kent's BLAT and NCBI BLAST.
How were rice and Arabidopsis peptides aligned?
The Arabidopsis and rice peptides were downloaded from NCBI RefSeq and aligned to the (unmasked) genome by gapped BLASTX; high-scoring sequence pairs (HSP's) are shown. Note that gapped BLAST was used to increase sensitivity, so that in many cases the HSP (shown in yellow) spans adjacent exons and the intervening intron(s). Also, small exons (evident from the maize/sorghum/sugarcane ESTs) are often missed.

How did you get the gene set for sorghum?

Where did the gene set come from?
Consensus gene predictions were built around several evidence sources. TIGR transcript assemblies were mapped on repeat-masked genome sequences, applying GenomeThreader with a splice site model of maize. Assemblies and ESTs of the following species were mapped: Allium cepa, Ananas comosus, Avena sativa, Brachypodium distachyon, Curcuma longa, Hordeum vulgare, Oryza sativa, Saccharum officinarum, Secale cereale, Sorghum bicolor, Sorghum halapense, Sorghum propinquum Triticum aestivum, Zea mays and Zingiber officinale. We also generated optimal spliced alignments (OSAs) as well as blastX alignments for a reference set of proteins consisting of the SWISSPROT database , the Arabidopsis (TAIR6), Saccharomyces cerevisiae and Rice (RAP2) proteomes. For each OSA, possible reading frames of size ³50 amino acids were collected as candidates for gene models. In addition, we identified gene models on repeat masked genomic sequences by ab initio methods (Fgenesh++, GeneID, GenomeScan/PASA). Next, we applied Jigsaw as a statistical combiner of all the supporting information above. A decision tree has been trained on a set of 987 gene models that have been edited by human supervision in the Apollo Genome Browser. All models, including those obtained from the first analysis series, were scored by blastp against the UniREF90 protein database and for each locus the best fitting model, i.e. the model with the highest bitscore, has been selected. In our final step, these predictions have been rerun through the PASA pipeline in order (i) to predict UTRs from maize, sorghum and sugarcane ESTs, (ii) to identify possible alternative splicing patterns and (iii) to fit all predicted models to the splice sites suggested by EST evidences of closely related species. This pipeline yielded 36,338 transcript models at 34,496 loci. In addition to the 28,003 complete models, we predicted 6493 candidate genes that lack a start and/or stop codon. These are therefore assigned as partial models. We only included such models in our annotation if they were not overlapping with complete predictions. Note that partial gene models may result from several, not mutually exclusive reasons: (i) sequencing or assembly errors may hinder both ab initio and homology based predictors to deduce a correct ORF; (ii) transposon activity may have truncated gene models; (iii) we have insufficient evidences from ab initio predictions or EST matches to provide a complete gene model.
How were UTRs identified in gene predictions?
The Program to assemble Spliced alignments, PASA (B. Haas), was run on the gene prediction set with all available sorghum ESTs. This produced 1842 alternatively spliced alignments and added UTR to 17,744 transcripts.
Why do models sometimes disagree with "obvious" exons from ESTs or homologous rice genes?
Two reasons. First, while annotation prediction programs does take homology information into account, they also adheres to an internal statistical model for what coding sequences in maize and related grasses "should" look like. So homology evidence may be “overriden” if it is inconsistent with expected codon usage, etc. A second and related problem is that ESTs are imperfect and sometimes grossly wrong, as they may include unspliced (retained) introns and/or genomic contamination of the cDNA library. By using a statistical model, gene predictors are able to reject such false data in some cases.
Why don't all the open reading frames (ORFs) start with methionine? Why don't all the ORFs end with a stop codon? How come my gene is only partially predicted?
GenomeScan is one of the better homology-based gene predictors available, but like all computational gene modeling algorithms, it is imperfect. Also, to avoid "run-on" models that inappropriately join adjacent genes, we only provided GenomeScan with our best guess for the genomic extent of a locus. If the statistical model of GenomeScan does not encounter what it believes to be the true start or end of a gene in our locus, the initial ATG or terminal stop codon may not be present in the model. So its partially GenomeScan's fault, and partially ours.
Waiter, there's a repeat element in my gene! (Get it out!)
Since in the interest of speed we have only done a cursory initial masking of repeats, and since many unmasked repeats contain significant and/or mildly deteriorated open reading frames, GenomeScan sees these regions with good coding potential as too good to pass up in its automated gene prediction. These repeat-derived coding regions may end up in the intron your favorite gene. In some cases, this leads to truncation of the automated gene prediction when a stop codon is encountered.
We hope that the aggravation of an imperfect gene set is partially compensated by the rapid release of the data...
Future gene sets will use the complete JGI annotation pipeline, which treats these problems in a more sophisticated manner, and use a wide variety of gene annotation methods. But the lesson from the annotation of other genomes like Arabidopsis and rice is that it can take years to fine tune a gene set even given a high quality genome assembly. So please be patient!

What can I do with the sorghum dataset?

I would like to use this data to help clone a gene, analyse a gene family, etc.
Wonderful! Please feel free to use this data to advance your studies of sorghum, maize, and other grasses! Please reference "Sorghum Genome Project, DoE Joint Genome Institute" as your citation.
I think I found an error. What should I do?
Unfortunately, we are unable to systematically address problems with Sorbi0 assembly or annotation, especially in light of the known flaws outlined above and the impending release in Spring 2007 of a more accurate assembly and annotation. Depending on your attitude ...
  • "Thanks for doing this! How can I help you improve my favorite gene?" We will definitely be looking for expert assistance in refining the Sorbi1 gene set. So please hold on to your correction, and see if its still necessary when Sorbi1 comes out. If the problem is still there, we can try to correct it. The most useful “hint” to help us is your best approximation to a correct transcript from your favorite locus. If you believe that there are indels in our sequence that disrupt the correct reading frame, please adjust the transcript itself.  This'll be very helpful going forward.
  • "Damn you, you wasted my time ..." We apologise for any problems that inadvertent errors in this very provisional assembly and annotation may cause, but emphasize again that our goal with Sorbi0 is to accelerate the use of the sorghum genome by the sorghum, maize, and other communities!
I would like to do a large-scale comparison of sorghum to other genomes, and/or a global analysis of its gene content.
The Fort Lauderdale guidelines for large scale sequencing projects aims to balance the value of rapid data release for the user community with respect for the scientific interests of the generators of the data.  Our plans for rapid publication of the sorghum genome are described below, and are focused on the large-scale analysis of the gene and repetitive content of the sorghum genome and its evolutionary dynamics. A plan for the coordinated submission of companion manuscripts to Genome Research is described here.

What are the publication plans?

We expect that the initial manuscript describing the assembly, annotation, and initial analysis of the sorghum genome will be based on the spring 2007 Sorbi1 assembly and will be submitted by September 2007 -- only nine months after completion of data collection. The sorghum genome analysis group is actively working on the following topics for this initial manuscript:
  • Assembling a highly repetitive plant genome using a whole genome shotgun method, and comparison with filtration methods
  • Molecular evolution of coding and non-coding sequences informed by the sorghum genome sequence
  • Synteny and chromosome-scale evolution of grasses
  • Analysis of the whole genome duplication shared by rice, sorghum, and other grasses
  • Patterns of plant gene family evolution as illuminated by the sorghum gene set
  • Tempo and mode of recent retrotransposon and other repeat activity in sorghum
©2007 University of California Regents. All rights reserved