Eucalyptus grandis (Eucalyptus)

Access the latest Eucalyputus grandis data and information at Phytozome v10

About the genome:


A major challenge for the achievement of a sustainable energy future is our understanding of the molecular basis of superior growth and adaptation in woody plants suitable for biomass production. Eucalyptus species are among the fastest growing woody plants in the world, with mean annual increments up to 100 cubic meter per hectare. Eucalyptus is the most valuable and most widely planted genus of plantation forest trees in the world (ca. 18 million hectares) due to its wide adaptability, extremely fast growth rate, good form, and excellent wood and fiber properties.

Eucalyptus is also listed as one of the U.S. Department of Energy's candidate biomass energy crops. Genome sequencing is essential for understanding the basis of its superior properties and to extend these attributes to other species. Genomics will also allow us to adapt Eucalyptus trees for green energy production in regions (such as the Southeastern USA) where it cannot currently be grown. The unique evolutionary history, keystone ecological status, and adaptation to marginal sites make Eucalyptus an excellent focus for expanding our knowledge of the evolution and adaptive biology of perennial plants.

(from JGI - The Joint Genome Institute)


This is a release of the initial 8X mapped Eucalyptus grandis BRASUZ1 genome assembly and a version 1.1 annotation.

The main genome assembly is approximately 691 Mb arranged in 4952 scaffolds
Approximately 641 Mb arranged in 32,762 contigs (~ 7.3% gap)
Scaffold N50 (L50) = 5 (53.9 Mb)
Contig N50 (L50) = 2261 (67.2 kb)
300 scaffolds are > 50kb in size, representing approximately 94.2% of the genome
36,376 total loci containing protein-coding transcripts
33,917 loci containing protein-coding transcripts on the 11 main linkage groups/chromosome assemblies (93% of total above)
Alternative Transcripts
9939 total alternatively spliced transcripts
9741 alternatively spliced transcripts on the 11 main linkage groups/chromosome assemblies (98% of total above)

This new annotation was produced by manually filtering 8620 low-confidence gene models from the original v1.0 annotation. Stricter c-score and protein homology coverage thresholds were employed in this case, especially when considering partial transcripts missing a modeled start or stop codon. EST support was also examined to check that aligned coverage followed the same intron splicing pattern as the gene model. Filtered gene models were removed from consideration in the Phytozome v8 gene family generation, but remain searchable in Gbrowse and can be displayed as an additional transcript track. Associated FASTA and annotation info files are available on the FTP site.

Note: As of August 22, 2011, accession IDs and transcript names have been updated to better reflect gene locus location in the current v1.0 assembly:

  1. for gene loci as defined by the primary transcript dataset on the 11 main chromosome linkage groups
        - scaffold_1 is designated A, scaffold_2 is designated B, ... scaffold_11 is designated K
        - loci are numbered sequentially on each linkage group, beginning with 00001
        - primary transcripts receive a .1 suffix
        - alternatively spliced transcripts receive the suffix .2, .3, etc. as needed
  2. for gene loci on the remaining scaffolds (12 and above)
        - all scaffolds are designated L to indicate they are not in the main chromosome-level assembly
        - all loci are numbers sequentially, beginning with 00001
        - primary transcripts receive a .1 suffix
        - alternatively spliced transcripts receive the suffix .2, .3, etc. as needed

The older accession IDs (e.g. Egrandis_v1_0.052539m) remain available for keyword searching and are displayed throughout Phytozome as E. grandis gene aliases. A comprehensive list of these old to new accession ID mappings is available here in the Eucalyptus grandis FTP site as a file named Egrandis_201_synonym.txt.


How was the genome sequenced?

How was the assembly generated?
The genome was assembled with Arachne by Jeremy Schmutz at HudsonAlpha.
How were repeats identified?
A de novo repeat library was made by running RepeatModeler (Arian Smit, Robert Hubley) on the genome to produce a library of repeat sequences. Sequences with Pfam domains associated with non-TE functions were removed from the library of repeat sequences and the library was then used to mask ~38% of the genome with RepeatMasker.
How were ESTs aligned?
We aligned ~2.9M E. grandis EST sequences and ~2.4M EST sequences from sister Eucalyptus species using Brian Haas's PASA pipeline which aligns ESTs to the best place in the genome via gmap, then filters hits to ensure proper splice boundaries.
How were plant proteins aligned?
Rice, Arabidopsis and grapevine proteins were downloaded from MSU, TAIR and Genoscope respectively. Soybean proteins were generated in our internal annotation pipeline at the JGI. All proteins were aligned to the soft-masked genome using gapped BLASTX; high-scoring sequence pairs (HSPs) are shown. Note that gapped BLAST was used to increase sensitivity, so that in many cases the HSP (shown in orange) spans adjacent exons and the intervening intron(s). Also, small exons are often missed.

How did you determine the Eucalyptus grandis gene set?

Gene prediction
To produce the current "egrandis1.0" gene set, we used the homology-based FgenesH and GenomeScan predictions. The best gene prediction at each locus is picked and integrated with EST assemblies using the PASA program. The gene set shown on the browser was generated from the above input gene models by Richard D. Hayes at JGI.
The gene prediction pipeline has the following components: proteins from diverse angiosperms and ~260,000 EST assemblies (from ~2.9M filtered E. grandis ESTs and ~2.4M EST sequences from sister Eucalyptus species, assembled with PASA) were aligned to the genome, and their overlaps used to define putative protein-coding gene loci. The corresponding genomic regions were extended by 1kb in each direction and submitted to FgenesH (provided by Asaf Salamov at JGI) and GenomeScan, along with related angiosperm proteins and/or ORFs from the overlapping EST assemblies. Fgenesh identifies likely protein coding exons, favoring regions that align well to the given homologous proteins.
These two sets of predictions were integrated with expressed sequence information using PASA (Haas et al. 2003) against ~260,000 Eucalyptus EST assemblies. The results were filtered to remove genes identified as transposon-related.
How come my gene is wrong?
FgenesH and GenomeScane are good gene prediction tools, but like all computational gene modeling algorithms, is imperfect. In addition, EST and cDNA data are often incomplete. We hope that the aggravation of having an imperfect gene set is partially compensated by the rapid release of the data. Future gene sets will improve as assembly quality improves along with expressed sequence data and genomic data from related species. But the lesson from the annotation of other well-curated genomes like Arabidopsis and rice is that it can take years to fine tune a gene set even given a high quality genome assembly.

What can I do with the Eucalyptus grandis dataset?

The Eucalyptus genome paper has been published and the data may be used without restriction. Please cite the following publication:

Myburg AA, Grattapaglia D, Tuskan GA, Hellsten U, Hayes RD, Grimwood J, Jenkins J, Lindquist E, Tice H, Bauer D, Goodstein DM, Dubchak I, Poliakov A, Mizrach E, Kullan ARK, Hussey SG, Pinard D, van der Merwe K, Singh P, van Jaarsveld I, Silva-Junior OB, Togawa RC, Pappas MR, Faria DA, Sansaloni CP, et al.,The genome of Eucalyptus grandis, Nature, published online 2014 June 11.

  ©2006-2014 University of California Regents. All rights reserved  
Information on Accessibility/Section508