Citrus clementina (Clementine)
About the genome:
The Clementine mandarin (Citrus clementina) is related to the sweet orange. Clementine is normally diploid, however this genome sequence was generated from a haploid plant greatly simplifying genomic sequencing and assembly. The haploid plant was generated by in situ parthenogenesis of clementine Clemenules induced by irradiated pollen of Fortune mandarin, followed by direct embryo germination in vitro. The sequencing project is the work of the International Citrus Genome Consortium (ICGC http://www.citrusgenome.ucr.edu/)
Draft sequencing of the genome by IGCG, Genoscope, IGA, HudsonAlpha and JGI using a Sanger whole genome shotgun approach generated a high-quality draft assembly which was subsequently integrated with a genetic map to generate a chromosome-scale (pseudo-molecule) assembly.
- Genome Size
- This version of the assembly (v1.0) is 301.4 Mb spread over 1,398 scaffolds with 2.1% gaps at 7.0x coverage. Over 96% of the assembly is accounted for by the 9 chromosome pseudo-molecules ~21-51 Mbp in length.
- The current gene set (clementine1.0) integrates 1.560 M ESTs with homology and ab initio-based gene predictions (by GenomeScan, Fgenesh, exonerate). 24,533 protein-coding loci have been predicted. Each encodes a primary transcript. There are an additional 9,396 alternative transcripts encoded on the genome generating a total of 33,929 transcripts. 16,963 primary transcripts have EST support over at least 50% of their length. A third of the primary transcripts (8,684) have EST support over 100% of their length.
How was the genome sequenced?
- Whole genome sequencing strategy
- Genomic sequence was generated by the IGCG, Genoscope, IGA and JGI using a whole genome shotgun approach using Sanger technology sequencing 2-3kb, 6-12kb insert libraries as well as a 39kb fosmid end library totaling 6x coverage.
- How was the assembly generated?
- The genome was assembled with Arachne by Jeremy Schmutz at HudsonAlpha. Scaffolds were placed on chromosome pseudo-molecules by aligning SSR markers from the Clementine genetic map (Ollitrault, Terol, Chen, et al., 2012) to scaffolds, followed by breaking and rejoining scaffolds. Over 96% of the genome is on a chromosome-scale pseudo-molecule.
- How were repeats identified?
- A repeat library was generated with the de novo repeat finding algorithm RepeatModeler (Smit & Hubley, 2011). Repetitive sequences shorter than 500nt long were removed, and the remainder was annotated with predicted protein domains with Pfam (Finn, Tate, et al, 2008 #72) and Panther (Mi, Dong, et al., 2010). Sequences that had been annotated with a protein domain not associated with transposable elements or other repeat sequences were removed from the library. The resulting repeat library was used to lower case mask the assembly with RepeatMasker (Smit, Hubley, Green, 1996-2004). To run RepeatMasker efficiently, the input assemblies were broken into 500kb segments with 1kb overlap. There is potentially a small had previously been generated from the sweet orange genome sequence. This library was used with RepeatMasker to mask 45% of the assembly.
- How were ESTs aligned?
- We obtained 770,602 EST sequences from LifeSequencing from the diploid Clementine var. Nules that is the parent of the haploid reference. To these, we added 210,567 C. x sinensis and 118,365 C. x clementina ESTs downloaded from GenBank, 58,656 EST assemblies that had been generated from sweet orange 454 ESTs assembled with Newbler (Mohammed Mohiuddin, Roche 454) and 401,708 454 EST reads from Life Technologies to make a total of 1,559,898 ESTs. These were aligned to the clementine genome (requiring 95% sequence identity and 50% coverage of the input sequence) and further assembled with PASA (Haas, Delcher, et al., 2003) to generate 76,372 EST assemblies.
- How were plant proteins aligned?
- We aligned predicted protein sequences from Arabidopsis (v. TAIR8); peach (JGI v. 1.0) and grapevine (Genoscope 12 x 05/10/10) to the softmasked Clementine v1.0 assembly with gapped BLASTX (Altschul, Gish, et al., 1990) and generated putative protein-coding gene loci from regions with EST assemblies and/or protein homology, extending to include overlap where necessary. High-scoring sequence pairs (HSPs) are shown in the Gbrowse genome browser on Phytozome. Note that gapped BLAST was used to increase sensitivity, so that in many cases the HSP (shown in orange) spans adjacent exons and the intervening intron(s). Also, small exons are often missed.
How did you determine the haploid clementine orange gene set?
- Gene prediction
- Gene predictions were generated from putative loci with FGenesH+ (Solovyev, Kosarev, et al., 2006), exonerate (Slater & Birney, 2005) (with option -model protein2genome) and GenomeScan (Yeh, Lim, Burge, 2001). The gene prediction at each locus with the highest amount of support from EST assemblies and protein homology was chosen to be improved using evidence from the EST assemblies with a second round of PASA. Gene models with homology to repeats were removed. This produced an annotation at each of 24,533 protein coding loci, with 9,396 alternative splice forms predicted making a total of 33,929. The gene set shown on the browser was generated from the above input gene models by Simon Prochnik at JGI.
- How come my gene is wrong?
- FgenesH, exonerate and GenomeScan are good gene prediction tools, but like all computational gene modeling algorithms are not perfect. In addition, EST and cDNA data are often incomplete. We hope that the aggravation of having an imperfect gene set is partially compensated by the rapid release of the data. Future gene sets will improve as assembly quality improves along with expressed sequence data and genomic data from related species. But the lesson from the annotation of other well-curated genomes like Arabidopsis and rice is that it can take years to fine tune a gene set even given a high quality genome assembly.
What can I do with the clementine dataset?
- I would like to use this data to help clone a gene, analyse a gene family, etc.
- Wonderful! Please feel free to use this data to advance your studies of clementine. Please cite "Haploid Clementine Genome, International Citrus Genome Consortium, 2011, http://int-citrusgenomics.org/, http:://www.phytozome.net/clementine".
- I think I found an error. What should I do?
- If you would like to bring any items to our attention, please send email to firstname.lastname@example.org.
- I would like to do a large-scale comparison of clementine to other genomes, and/or a global analysis of its gene content.
- The Fort Lauderdale guidelines for large scale sequencing projects aim to balance the value of rapid data release for the user community with respect for the scientific interests of the generators of the data. We have released the data prior to any publication because of the importance of this data to the citrus community. Our plans for publication of the genome sequence and associated analyses are still developing. Please contact Fred Gmitter, ICGC at email@example.com if you plan to publish large-scale analysis derived from the clementine genome resources available in Phytozome.