Thellungiella halophila (Salt cress a.k.a. Eutrema salsugineum)
About the genome:
The genome sequence of Thellungiella halophila, a salt-tolerant relative of both the genetic model Arabidopsis thaliana (Arabidopsis) and agriculturally important members of the genus Brassica (e.g., Oilseed Rape) will be critical for understanding the molecular mechanisms of salt tolerance and will be an essential part of strategies to engineer and breed more salt-tolerant crop plants.
PLEASE NOTE: the genus Thellungiella is now known as Eutrema, so that the species formerly known as Thellungiella halophila is now known as Eutrema halophilum. In addition, the species of Eutrema sequenced at the JGI has been determined to be salsugineum (a close relative of halophila).. Therefore, this genome is actually classified as Eutrema salsugineum. We will be updating all references labels referencing Thellungiella halophila to instead refer to Eutrema salsugineum in the next release of Phytozome.
This is a release of the initial 8X mapped Thellungiella halophila genome assembly and a preliminary annotation.
- The main genome assembly is approximately 243.1 Mb arranged in 639 scaffolds
- Approximately 238.5 Mb arranged in ~3,511 contigs (~ 1.9% gap)
- Scaffold N50 (L50) = 8 (13.4 Mb)
- Contig N50 (L50) = 272 (251.6 kb)
- 35 scaffolds are > 50kb in size, representing approximately 98.3% of the genome
- 26,351 total loci containing protein-coding transcripts
- Alternative Transcripts
- 2,933 total alternatively spliced transcripts
How was the genome sequenced?
- How was the assembly generated?
- The genome was assembled with Arachne by Jeremy Schmutz and Jerry Jenkins at HudsonAlpha.
- How were repeats identified?
- A de novo repeat library was made by running RepeatModeler (Arian Smit, Robert Hubley) on the genome assembly to produce a library of repeat sequences. Sequences with Pfam domains associated with non-TE functions were removed. Arabidopsis repeats were extracted from RepBase v 20090604 and added to the de novo repeats to make a custom repeat library. This library was then used to mask 45.5% of the genome with RepeatMasker.
- How were ESTs aligned?
- We obtained 1.6M 454 ESTs from 5 Thellungiella halophila cDNA libraries. These were filtered for contaminating sequences and aligned and assembled using Brian Haas's PASA pipeline which aligns ESTs to the best place in the genome using gmap, then joins hits into assemblies (transcripts) where they share splice sites.
- How were plant proteins aligned?
- Arabidopsis and grapevine proteins were downloaded from TAIR and Genoscope respectively. Soybean and Populus proteins were generated in our internal annotation pipeline at the JGI. All proteins were aligned to the soft-masked genome using gapped BLASTX; high-scoring sequence pairs (HSPs) are shown. Note that gapped BLAST was used to increase sensitivity, so that in many cases the HSP (shown in orange in Gbrowse) spans adjacent exons and the intervening intron(s). Also, small exons are often missed.
How did you determine the Thellungiella halophila gene set?
- Gene prediction
- To produce the Thellungiella v1.0 gene set, we used the homology-based gene prediction program FgenesH which integrate EST evidence into the ab initio gene predictions. The best gene prediction at each locus is picked and integrated with EST assemblies using the PASA program (see above). The gene set shown on the browser was generated from the above input gene models by Simon Prochnik at JGI. The gene prediction pipeline has the following components: proteins from diverse angiosperms and 61,797 Thellungiella EST assemblies (from 1.6M filtered ESTs assembled with PASA) were aligned to the genome, and their overlaps used to define putative protein-coding gene loci. The corresponding genomic regions were extended by up to 2kb in each direction and submitted to GenomeScan and FgenesH (provided by Asaf Salamov at JGI), along with related angiosperm proteins and/or ORFs from the overlapping EST assemblies. GenomeScan and Fgenesh identify likely protein coding exons, favoring regions that align well to the given homologous proteins. These predictions were integrated with expressed sequence information using PASA (Haas et al. 2003) against the 61,797 PASA EST assemblies. The results were filtered to remove genes identified as transposon-related.
- How come my gene is wrong?
- GenomeScan and FgenesH are good gene predictors, but like all computational gene modeling algorithms, are imperfect. In addition, EST and cDNA data are often incomplete. For these reasons, there can be errors in gene models. We hope that the inconvenience of having an imperfect gene set is partially compensated by the rapid release of the data. Future gene sets will improve as assembly quality improves along with expressed sequence data and genomic data from related species. But the lesson from the annotation of other well-curated genomes like Arabidopsis and rice is that it can take years to fine tune a gene set even given a high quality genome assembly.