Gossypium raimondii (cotton)

About the genome:


This v2.1 annotation release is on genome assembly v2.0, a high quality version of the Cotton D (Gossypium raimondii) genome sequenced from DNA provided by Andrew Paterson at Univ. GA. It was sequenced with a combination of Sanger, Roche 454 pyrosequencing and Illumina read pairs. This release includes additional screening of small repetitive contigs and a new map integration that corrects several orientation issues within scaffolds.


Scaffold total: 1,033
Contig total: 19,735
Scaffold sequence total: 761.4 Mb
Contig sequence total: 748.1 Mb ( -> 1.7% gap)
Scaffold N50 (L50) = 6 (62.2 Mb)
Contig N50 (L50) = 1,596 (135.6kb)
41 scaffolds are > 50kb in size, representing approximately 99.0% of the genome


How was the genome sequenced?

How was the assembly generated?
This release is a high quality version of the Cotton D genome from DNA provided by Andrew Paterson at Univ. GA. It was sequenced with a combination of Sanger based sequence (1.52x assembled coverage with 0.95x coverage from BAC end sequence and fosmids end sequence) Roche 454 pyrosequencing (14.95x linear and 3.1x non-redundant pairs assembled coverage), and Illumina based short reads (primarily to correct 454 insertion/deletion errors) and assembled the genome using our modified version of Arachne2.

V2.0 includes the removal of small repetitive contigs within scaffolds and a new detailed map integration with the recently available tetraploid map that was used to correct several scaffold orientation issues.

The combination of:
  • BES/markers hybridized to FPC contigs (Lin et al, 2010)
  • Genetic map for the diploid (Rong, et al 2004)
  • Tetraploid map (Byers et al, 2012)
  • Vitis vinifera and Theobroma cacao synteny

was used to identify misjoins in the assembly. Misjoins were characterized by a combination of an abrupt change in the linkage group (or synteny) within a region of low BAC/Fosmid support. A total of 13 misjoins were identified and subsequently broken.

Scaffolds were oriented, ordered, and joined together using the aforementioned resources. A total of 51 joins were assembled to form a final assembly containing 13 chromosomes. This release is of suitably high quality to match our previous fully Sanger sequenced plant genomes.

How did you determine the Gossypium raimonddi gene set?

Gene prediction
85,746 transcript assemblies were made from about 1B pairs of D5 paired-end Illumina RNAseq reads, 55,294 transcript assemblies about 0.25B D5 single end Illumina RNAseq reads, 62,526 transcript assemblies from 0.15B TET single end Illumina RNAseq reads. All these transcript assemblies from RNAseq reads were made using PERTRAN (Shu et. al., manuscript in preparation). 120,929 transcript assemblies were constructed using PASA (Haas, 2003) from 56,638 D5 Sanger ESTs, 2.5M D5 454 RNAseq reads and all RNAseq transcript assemblies above. 133,073 transcript assemblies were constructed using PASA from 296,214 TET Sanger ESTs and about 2.9M TET 454 reads. The larger number of transcript asssemblies from fewer TET sequences is due to fragment nature of the assemblies. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabi (Arabidopsis thaliana), cacao, rice, soybean, grape and poplar proteins to repeat-soft-masked G. raimnondaii genome using RepeatMasker (Smit, 1996-2012) with up to 2K BP extension on both ends unless extending into another locus on the same strand. Gene models were predicted by homology-based predictors, FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan (Yeh, 2001). The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed. Final gene set has 37,505 protein coding genes and 77,267 protein coding transcripts.


Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. http://nar.oupjournals.org/cgi/content/full/31/19/5654 [Nucleic Acids Res, 31, 5654-5666].

Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2011 .

Yeh, R.-F., Lim, L. P., and Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816.

Salamov, A. A. and Solovyev, V. V. (2000). Ab initio gene finding in Drosophila genomic DNA. Genome Res 10, 516-22.

What can I do with the Gossypium raimonddi dataset

For public access, in agreement with Fort Lauderdale, we are making the Cotton D genome available from the DOE JGI and our collaborators prior to peer-reviewed publication of the data. We are making this data available with the expectation and desire to publish this data in a reasonable time without preemption by other groups. By accessing these data, you agree not to publish any articles containing analyses of genes or genomic data on a whole genome or chromosome scale prior to publication by the DOE JGI and/or its collaborators of a comprehensive genome analysis ("Reserved Analyses"). "Reserved analyses" include the identification of complete (whole genome) sets of genomic features such as genes, gene families, regulatory elements, repeat structures, GC content, or any other genome feature, and whole-genome- or chromosome- scale comparisons with other species including other cotton species and cultivars. For specific questions about data use please contact Andy Paterson (paterson AT plantbio.uga.edu) and Jeremy Schmutz (jschmutz AT hudsonalpha.org).

Work towards publication of the Cotton D genome is underway, and we plan to submit a manuscript within this calendar year. If you will be employing the data for non-reserved analyses, such as cloning a gene of interest, designing mapping panels or to analyze a gene family etc., please reference the "DOE Joint Genome Institute: Cotton D V2.0" as your citation.

  ©2006-2014 University of California Regents. All rights reserved  
Information on Accessibility/Section508