Eucalyptus grandis (Eucalyptus)



About the genome:


Overview

A major challenge for the achievement of a sustainable energy future is our understanding of the molecular basis of superior growth and adaptation in woody plants suitable for biomass production. Eucalyptus species are among the fastest growing woody plants in the world, with mean annual increments up to 100 cubic meter per hectare. Eucalyptus is the most valuable and most widely planted genus of plantation forest trees in the world (ca. 18 million hectares) due to its wide adaptability, extremely fast growth rate, good form, and excellent wood and fiber properties.

Eucalyptus is also listed as one of the U.S. Department of Energy's candidate biomass energy crops. Genome sequencing is essential for understanding the basis of its superior properties and to extend these attributes to other species. Genomics will also allow us to adapt Eucalyptus trees for green energy production in regions (such as the Southeastern USA) where it cannot currently be grown. The unique evolutionary history, keystone ecological status, and adaptation to marginal sites make Eucalyptus an excellent focus for expanding our knowledge of the evolution and adaptive biology of perennial plants.

(from JGI - The Joint Genome Institute)

Statistics

This is a release of the initial 8X mapped Eucalyptus grandis BRASUZ1 genome assembly and a version 1.1 annotation.

Genome
The main genome assembly is approximately 691 Mb arranged in 4952 scaffolds
Approximately 641 Mb arranged in 32,762 contigs (~ 7.3% gap)
Scaffold N50 (L50) = 5 (53.9 Mb)
Contig N50 (L50) = 2261 (67.2 kb)
300 scaffolds are > 50kb in size, representing approximately 94.2% of the genome
Loci
36,376 total loci containing protein-coding transcripts
33,917 loci containing protein-coding transcripts on the 11 main linkage groups/chromosome assemblies (93% of total above)
Alternative Transcripts
9939 total alternatively spliced transcripts
9741 alternatively spliced transcripts on the 11 main linkage groups/chromosome assemblies (98% of total above)

This new annotation was produced by manually filtering 8620 low-confidence gene models from the original v1.0 annotation. Stricter c-score and protein homology coverage thresholds were employed in this case, especially when considering partial transcripts missing a modeled start or stop codon. EST support was also examined to check that aligned coverage followed the same intron splicing pattern as the gene model. Filtered gene models were removed from consideration in the Phytozome v8 gene family generation, but remain searchable in Gbrowse and can be displayed as an additional transcript track. Associated FASTA and annotation info files are available on the FTP site.

Note: As of August 22, 2011, accession IDs and transcript names have been updated to better reflect gene locus location in the current v1.0 assembly:

  1. for gene loci as defined by the primary transcript dataset on the 11 main chromosome linkage groups
        - scaffold_1 is designated A, scaffold_2 is designated B, ... scaffold_11 is designated K
        - loci are numbered sequentially on each linkage group, beginning with 00001
        - primary transcripts receive a .1 suffix
        - alternatively spliced transcripts receive the suffix .2, .3, etc. as needed
  2. for gene loci on the remaining scaffolds (12 and above)
        - all scaffolds are designated L to indicate they are not in the main chromosome-level assembly
        - all loci are numbers sequentially, beginning with 00001
        - primary transcripts receive a .1 suffix
        - alternatively spliced transcripts receive the suffix .2, .3, etc. as needed

The older accession IDs (e.g. Egrandis_v1_0.052539m) remain available for keyword searching and are displayed throughout Phytozome as E. grandis gene aliases. A comprehensive list of these old to new accession ID mappings is available here in the Eucalyptus grandis FTP site as a file named Egrandis_201_synonym.txt.

FAQ

How was the genome sequenced?

How was the assembly generated?
The genome was assembled with Arachne by Jeremy Schmutz at HudsonAlpha.
How were repeats identified?
A de novo repeat library was made by running RepeatModeler (Arian Smit, Robert Hubley) on the genome to produce a library of repeat sequences. Sequences with Pfam domains associated with non-TE functions were removed from the library of repeat sequences and the library was then used to mask ~38% of the genome with RepeatMasker.
How were ESTs aligned?
We aligned ~2.9M E. grandis EST sequences and ~2.4M EST sequences from sister Eucalyptus species using Brian Haas's PASA pipeline which aligns ESTs to the best place in the genome via gmap, then filters hits to ensure proper splice boundaries.
How were plant proteins aligned?
Rice, Arabidopsis and grapevine proteins were downloaded from MSU, TAIR and Genoscope respectively. Soybean proteins were generated in our internal annotation pipeline at the JGI. All proteins were aligned to the soft-masked genome using gapped BLASTX; high-scoring sequence pairs (HSPs) are shown. Note that gapped BLAST was used to increase sensitivity, so that in many cases the HSP (shown in orange) spans adjacent exons and the intervening intron(s). Also, small exons are often missed.

How did you determine the Eucalyptus grandis gene set?

Gene prediction
To produce the current "egrandis1.0" gene set, we used the homology-based FgenesH and GenomeScan predictions. The best gene prediction at each locus is picked and integrated with EST assemblies using the PASA program. The gene set shown on the browser was generated from the above input gene models by Richard D. Hayes at JGI.
The gene prediction pipeline has the following components: proteins from diverse angiosperms and ~260,000 EST assemblies (from ~2.9M filtered E. grandis ESTs and ~2.4M EST sequences from sister Eucalyptus species, assembled with PASA) were aligned to the genome, and their overlaps used to define putative protein-coding gene loci. The corresponding genomic regions were extended by 1kb in each direction and submitted to FgenesH (provided by Asaf Salamov at JGI) and GenomeScan, along with related angiosperm proteins and/or ORFs from the overlapping EST assemblies. Fgenesh identifies likely protein coding exons, favoring regions that align well to the given homologous proteins.
These two sets of predictions were integrated with expressed sequence information using PASA (Haas et al. 2003) against ~260,000 Eucalyptus EST assemblies. The results were filtered to remove genes identified as transposon-related.
How come my gene is wrong?
FgenesH and GenomeScane are good gene prediction tools, but like all computational gene modeling algorithms, is imperfect. In addition, EST and cDNA data are often incomplete. We hope that the aggravation of having an imperfect gene set is partially compensated by the rapid release of the data. Future gene sets will improve as assembly quality improves along with expressed sequence data and genomic data from related species. But the lesson from the annotation of other well-curated genomes like Arabidopsis and rice is that it can take years to fine tune a gene set even given a high quality genome assembly.

What can I do with the Eucalyptus grandis dataset?

I would like to use this data to help clone a gene, analyse a gene family, etc.
I would like to use this data to help clone a gene, analyse a gene family, etc. Wonderful! Please feel free to use this data to advance your studies of Eucalyptus. Please cite "Eucalyptus grandis Genome Project 2010, http:://www.phytozome.net/eucalyptus".
I think I found an error. What should I do?
If you would like to bring any items to our attention, please send email to phytozome@jgi-psf.org.
I would like to do a large-scale comparison of Eucalyptus grandis to other genomes, and/or a global analysis of its gene content.
As a public service, the Department of Energy's Joint Genome Institute (JGI) is making the completed Eucalyptus grandis genome sequence available before scientific publication according to the Ft. Lauderdale Accord. This balances the imperative of the DOE and the JGI that the data from its sequencing projects be made available as soon and as completely as possible with the desire of contributing scientists and the JGI to reserve a reasonable period of time to publish on the genome sequencing and analysis without concerns about preemption by other groups.
JGI policy is that early release should aid the progress of science. By accessing these data, you agree not to publish any articles containing analyses of genes or genomic data on a whole genome or chromosome scale prior to publication by JGI and/or its collaborators of a comprehensive genome analysis ("Reserved Analyses"). "Reserved analyses" include the identification of complete (whole genome) sets of genomic features such as genes, gene families, regulatory elements, repeat structures, GC content, or any other genome feature, and whole-genome- or chromosome- scale comparisons with other species.
The embargo on publication of Reserved Analyses by researchers outside of the Eucalyptus Genome Sequencing Project is expected to extend until the publication of the results of the sequencing project is accepted. Scientific users are free to publish papers dealing with specific genes or small sets of genes using the sequence data. If these data are used for publication, the following acknowledgment should be included: 'These sequence data were produced by the US Department of Energy Joint Genome Institute'. This letter has been circulated to Journal Editors so that they are aware of the conditions of access and publication detailed above.
These data may be freely downloaded and used by all who respect the restrictions in the previous paragraphs. The assembly and sequence data should not be redistributed or repackaged without permission from the JGI. Any redistribution of the data during the embargo period should carry this notice: "The Joint Genome Institute provides these data in good faith, but makes no warranty, expressed or implied, nor assumes any legal liability or responsibility for any purpose for which the data are used. Once the sequence is moved to unreserved status, the data will be freely available for any subsequent use."
We prefer that potential users of this sequence assembly contact us at phytozome@jgi-psf.org and the Eucalyptus Genome Sequencing Project co-directors
Zander Myburg: zander.myburg@up.ac.za (University of Pretoria)
Dario Grattapaglia: dario@cenargen.embrapa.br (EMBRAPA and Catholic University of Brasilia)
Jerry Tuskan: tuskanga@ornl.gov (Oak Ridge National Laboratory)
with their plans to ensure that proposed usage of sequence data are not Reserved Analyses.

  ©2006-2014 University of California Regents. All rights reserved  
Information on Accessibility/Section508