Manihot esculenta (Cassava)

About the genome:


Cassava (Manihot esculenta) is grown throughout tropical Africa, Asia and the Americas for its starchy storage roots, and feeds an estimated 750 million people each day. Farmers choose it for its high productivity and its ability to withstand a variety of environmental conditions (including significant water stress) in which other crops fail. However, it has very low protein content, and is susceptible to a range of biotic stresses. Despite these problems, the crop production potential for cassava is enormous, and its capacity to grow in a variety of environmental conditions makes it the plant of the future for emerging tropical nations. Cassava is also an excellent energy source - its roots contain 20-40% starch that costs 15-30% less to produce per hectare than starch from corn, making it an attractive and strategic source of renewable energy.

The goals of the Cassava Genome Project are to generate a draft sequence of the cassava genome, and because of the humanitarian importance of the crop, to make that sequence available to all - freely and rapidly. Much of the utility of the genome sequence will come from the development of breeding tools, and as such a perfect reference genome sequence is not needed. Our sequencing strategies have been selected accordingly. The project has built upon a pilot initiated through the DOE-JGI Community Sequencing Program (CSP) by a 14-member consortium led by Claude Fauquet, Joe Tohme and Pablo Rabinowicz. This pilot project produced a little under 1x coverage from over 700,000 Sanger shotgun reads using plasmid and fosmid libraries, and it provided insights into the overall characteristics of the cassava genome, and a valuable source of Sanger paired-end sequences to be used later.

The main phase of the project, led by Steve Rounsley, Dan Rokhsar, Chinnappa Kodira, and Tim Harkins began in Spring 2009 when 454 Life Sciences, a Roche company partnered with DOE-JGI to provide the resources for a whole genome shotgun sequencing of cassava using the 454 GS FLX Titanium platform. Nearly 61 million 454 reads (single and paired-end) were generated and combined with the Sanger data from the pilot project as input for genome assembly. The resulting assembly and its annotation is available through Phytozome and has also been deposited in GenBank in accordance with our commitment to early access and the Fort Lauderdale genome data release policy.

The University of Arizona has recently been awarded a 3-year $1.3 million grant by the Bill & Melinda Gates Foundation to expand and improve upon the initial cassava genome sequence. With its partners, DOE-JGI, 454 Life Sciences and University of Maryland, Baltimore, the newly funded project seeks targeted improvement of the genome assembly and SNP discovery via resequencing of many varieties of cassava. The SNP resource will also be accessible through Phytozome.


Genome Size
This version (Cassava4) of the assembly consists of 12,977 scaffolds spanning 533 Mb. Half of the assembled sequence is in the largest 487 scaffolds, each 258 kb or larger.
Although cassava has an estimated genome size of ~760Mb, this assembly spans only 533 Mb. We believe that the 533 Mb assembly represents nearly all of the genic regions of the genome, and that the missing portion is repetitive sequence that could not be assembled. This is supported by two pieces of evidence. 1. A large fraction of reads (both Sanger and 454) were not used by the assembly software, and were primarily repetitive in nature. 2. Transcripts assembled from publicly available cassava ESTs (P. Rabinowicz, unpublished) were mapped to the genome assembly. We were able to map 97% of the transcripts showing near-complete coverage of protein-coding genes in the assembly.
The current gene set (Cassava4.1) integrates 1.5 million ESTs with homology and ab initio-based gene predictions. 30,666 protein-coding loci encoding 34,151 transcripts have been predicted.


How was the genome sequenced?

Whole genome shotgun methodology
We used a whole genome shotgun strategy in which the entire genome is randomly sheared, subcloned, and redundantly sequenced. The ease, cost-efficiency, and speed of whole genome shotgun approach has made it the method of choice in many cases. This is one of the first publicly available plant genomes sequenced primarily with 454 technology.
How was the assembly generated?
The initial assembly (v1) of 61 million reads was produced by a collaboration between Steve Rounsley at University of Arizona and 454 Life Sciences using the Newbler assembly software. An incremental approach was taken where single end reads were assembled alone, and then paired end reads (Sanger and 454) were added. The current improved v4 assembly incorporates an additional 6x sequence data produced using 454 Life Sciences new longer read platform (GS FLX Titanium 1K-beta), and was assembled with Newbler v2.5. Scaffold N50 has decreased by 17%, and contigN50 by has decreased by 43%, indicating that the longer reads have contributing to the closing of many gaps present in the previous assembly.
Is it complete?
Comparison with a set of EST assemblies (sequenced from a different cassava genotype) that have homology to Arabidopsis proteins suggests that 97% of known cassava genes are represented in the v1 assembly. This result supports the claim that Cassava4.1 is largely complete with respect to "gene space." In addition to the gene space, repeat masking also shows that 200Mb of repetitive sequence was assembled. We will be pursuing further improvements in the assembly with improved assembly algorithms as they become available and additional sequencing.
Which germplasm was sequenced?
Both the Sanger and 454 sequence data were generated from a partially inbred line called AM560-2 which was generated at CIAT (International Center for Tropical Agriculture) in Cali, Colombia.

Methods for generating data in the cassava genome browse

How were repeats identified?
RepeatScout (Price & Pevzner, 2005) was used to generate a catalog of 599 over-represented sequences over 500nt long with homology to known transposons in the nr database at NCBI. Known transposon sequences from RepBase (version from 20090604) from Viridiplantae were added to make a custom library of repeats that masked 38% of the genome with RepeatMasker.
How were ESTs aligned?
We aligned the cassava EST sequences using Brian Haas's PASA pipeline which aligns ESTs to the best place in the genome via gmap.
How were plant proteins aligned?
Rice and castor bean proteins were downloaded from TIGR; Arabidopsis proteins were downloaded from TAIR; poplar and soybean proteins were generated in our annotation pipeline. All proteins were aligned to the soft-masked genome using gapped BLASTX; high-scoring sequence pairs (HSPs) are shown. Note that gapped BLAST was used to increase sensitivity, so that in many cases the HSP (shown in orange) spans adjacent exons and the intervening intron(s). Also, small exons are often missed.

How did you determine the cassava gene set?

Gene prediction
To produce the current "Cassava4.1" gene set, we used the homology-based gene prediction programs FgenesH and GenomsScan, along with the PASA program to integrate over 1.5 million cassava ESTs. The gene set shown on the browser was produced by Simon Prochnik at JGI. Briefly, proteins from diverse angiosperms and cassava EST assemblies (from PASA) were aligned to the genome, and their overlaps used to define putative protein-coding gene loci. The corresponding genomic regions were extended by 1kb in each direction and submitted to FgenesH and GenomeScan, along with related angiosperm proteins and/or ORFs from the overlapping EST assemblies. The homology-based prediction with the best homology and EST support at each locus was chosen and then integrated with cassava EST information using PASA (Haas et al. 2003). The results were filtered to remove genes with homology to transposable elements. Genes with apparently truncated ORFs may be prediction errors or pseudogenes.
How come my gene is wrong?
Sometimes errors arise from gene prediction programs. In addition, EST and cDNA data are often incomplete. We hope that the aggravation of having an imperfect gene set is partially compensated by the rapid release of the data.
Future gene sets will improve as assembly quality improves along with expressed sequence data and genomic data from related species. But the lesson from the annotation of other well-curated genomes like Arabidopsis and rice is that it can take years to fine tune a gene set even given a high quality genome assembly. So please be patient!

What can I do with the cassava dataset?
What are the publication plans?

I would like to use this data to help clone a gene, analyse a gene family, etc.
I would like to use this data to help clone a gene, analyse a gene family, etc. Wonderful! Please feel free to use this data to advance your studies of cassava and other malpighiales. Please cite the genome paper (see below).
I think I found an error. What should I do?
If you would like to bring any items to our attention, please send email to
I would like to do a large-scale comparison of cassava to other genomes, and/or a global analysis of its gene content.
The data in this release is freely available. Please cite the genome paper: Prochnik et al. (2012), J. Tropical Plant Biology in press. and note that you downloaded the v.4.1 data from
  ©2006-2014 University of California Regents. All rights reserved  
Information on Accessibility/Section508