Citrus sinensis (Sweet orange)
About the genome:
Sweet orange (citrus, Citrus sinensis) represents the largest citrus cultivar group grown in the world, accounting for about 70% of the total. Brazil, Florida (USA), and China are the three largest sweet orange producers. Sweet orange is considered an introgression of a natural hybrid of mandarin and pummelo.
The goal of the Citrus Genome Project is to generate a draft sequence of the sweet orange genome using Next Generation (454) sequence generated by 454 Life Sciences (team headed by Chinnappa Kodira), University of Florida (team headed by Fred Gmitter) as well as Sanger sequence generated by JGI (team led by Daniel Rokhsar). EST sequence has been generated by JGI, University of Florida, and 454 Life Sciences. There is a separate deep Sanger sequencing project by the International Citrus Genome Consortium of a haploid derived from Clementine mandarin.
- Genome Size
- This version (v.1) of the assembly is 319 Mb spread over 12,574 scaffolds. Half the genome is accounted for by 236 scaffolds 251 kb or longer.
- The current gene set (orange1.1) integrates 3.8 million ESTs with homology and ab initio-based gene predictions (see below). 25,376 protein-coding loci have been predicted, each with a primary transcript. An additional 20,771 alternative transcripts have been predicted, generating a total of 46,147 transcripts. 16,318 primary transcripts have EST support over at least 50% of their length. Two-fifths of the primary transcripts (10,813) have EST support over 100% of their length.
How was the genome sequenced?
- Whole genome sequencing strategy
- Genomic sequence was generated using a whole genome shotgun approach with 2Gb sequence coming from GS FLX Titanium; 2.4 Gb from FLX Standard; 440 Mb from Sanger paired-end libraries; 2.0 Gb from 454 paired-end libraries
- How was the assembly generated?
- The 25.5 million 454 reads and 623k Sanger sequence reads were generated by a collaborative effort by 454 Life Sciences, University of Florida and JGI. The assembly was generated by Brian Desany at 454 Life Sciences using the Newbler assembler.
- How were repeats identified?
- A de novo repeat library was made by running RepeatModeler (Arian Smit, Robert Hubley) on the genome to produce a library of repeat sequences. Sequences with Pfam domains associated with non-TE functions were removed from the library of repeat sequences and the library was then used to mask 31% of the genome with RepeatMasker.
- How were ESTs aligned?
- We aligned the sweet orange EST sequences using Brian Haas's PASA pipeline which aligns ESTs to the best place in the genome via gmap, then filters hits to ensure proper splice boundaries.
- How were plant proteins aligned?
- Rice, Arabidopsis and grapevine proteins were downloaded from MSU, TAIR and Genoscope respectively. Soybean proteins were generated in our internal annotation pipeline at the JGI. All proteins were aligned to the soft-masked genome using gapped BLASTX; high-scoring sequence pairs (HSPs) are shown. Note that gapped BLAST was used to increase sensitivity, so that in many cases the HSP (shown in orange) spans adjacent exons and the intervening intron(s). Also, small exons are often missed.
How did you determine the sweet orange gene set?
- Gene prediction
- To produce the "orange1.1" gene set, we used the homology-based gene prediction program FgenesH, as well as "hybrid" GeneMark-ES+ (Paul Burns and Mark Borodovsky at Georgia Tech) predictions which integrate EST evidence into the ab initio gene predictions. The best gene predictions at each locus is picked and integrated with EST assemblies using the PASA program. The gene set shown on the browser was generated from the above input gene models by Simon Prochnik at JGI.
- The gene prediction pipeline has the following components: proteins from diverse angiosperms and 112,000 sweet orange EST assemblies (from 3.8M filtered ESTs assembled with PASA) were aligned to the genome, and their overlaps used to define putative protein-coding gene loci. The corresponding genomic regions were extended by 1kb in each direction and submitted to FgenesH (provided by Asaf Salamov at JGI), along with related angiosperm proteins and/or ORFs from the overlapping EST assemblies. Fgenesh identifies likely protein coding exons, favoring regions that align well to the given homologous proteins. In a separate gene prediction effort, Mark Borodovsky's group generated hybrid gene predictions that integrate EST information with ab initio predictions by GeneMark-ES+ (Lomsadze et al. 2005). These two sets of predictions were integrated with expressed sequence information using PASA (Haas et al. 2003) against 112,000 sweet orange EST assemblies. The results were filtered to remove genes identified as transposon-related.
- How come my gene is wrong?
- FgenesH and GeneMark-ES+ are good gene prediction tools, but like all computational gene modeling algorithms, is imperfect. In addition, EST and cDNA data are often incomplete. We hope that the aggravation of having an imperfect gene set is partially compensated by the rapid release of the data. Future gene sets will improve as assembly quality improves along with expressed sequence data and genomic data from related species. But the lesson from the annotation of other well-curated genomes like Arabidopsis and rice is that it can take years to fine tune a gene set even given a high quality genome assembly.
What can I do with the sweet orange dataset?
- I would like to use this data to help clone a gene, analyse a gene family, etc.
- I would like to use this data to help clone a gene, analyse a gene family, etc. Wonderful! Please feel free to use this data to advance your studies of sweet orange. Please cite "Sweet Orange Genome Project 2010, http::// www.phytozome.net/orange".
- I think I found an error. What should I do?
- If you would like to bring any items to our attention, please send email to firstname.lastname@example.org.
- I would like to do a large-scale comparison of sweet orange to other genomes, and/or a global analysis of its gene content.
- The Fort Lauderdale guidelines for large scale sequencing projects aim to balance the value of rapid data release for the user community with respect for the scientific interests of the generators of the data. We have released the data prior to any publication because of the importance of this data to the citrus community. Our plans for publication of the genome sequence and associated analyses are still developing. Please contact Fred Gmitter, ICGC at email@example.com if you plan to publish large-scale analysis derived from the sweet orange genome resources available in Phytozome.