Populus trichocarpa (Western poplar)



About the genome:


Overview

Populus trichocarpa was sequenced ten times over to attain the highest quality standards and to produce a relatively contiguous, high quality plant genome. Poplar was chosen as the first tree DNA to be sequenced because of its relatively compact genetic complement, some 50 times smaller than the genome of pine, making the poplar an ideal model system for trees.

The poplar genome, divided into 19 chromosomes, is four times larger than the genome of the first sequenced plant, Arabidopsis thaliana. Analysis of the assembled genome reveals a whole-genome duplication event; about 8,000 pairs of duplicated genes from that event survived in the Poplar genome. A second, older duplication event is indistinguishably coincident with the divergence of the Populus and Arabidopsis lineages (from JGI - The Joint Genome Insitute and Tuskan, et.al.).

This is the second improved version of poplar (v3 assembly) and includes 81 Mb of finished clone sequence combined with a new, high density poplar map. This genome release represents 6 years of genome improvement, advances in genome assembly, and poplar experimental work funded by the DOE.

Assembly

We constructed the v3 Populus genome assembly with Arachne version 20071016HA with an attempt to merge the outbred haplotypes and an extensive attempt to remove contaminating sequence. We also integrated 81 Mb of finished clone sequence, and the latest genetic map information to construct the 19 chromosome size scaffolds which contain 388 Mb of sequence, a majority of the assembled poplar sequence. We took care to minimize alternative haplotypes within the assembly. The first 19 scaffolds from the assembly correspond to the poplar chromosomes. The full release covers 423 Mb pairs of sequence with an average read depth of 9.44x assembled sequence.

Annotation

75,566 RNAseq transcript assemblies were constructed from about 0.6B pairs of tremulas paired-end Illumina RNAseq reads using PERTRAN (Shu et. al., manuscript in preparation). Transcript assemblies (86,677 from Populus trichocarpa and 151,316 from related poplar ESTs/mRNA) were constructed using PASA from Populus trichocarpa RNAseq transcript assemblies, ESTs/mRNAs, and ESTs/mRNAs of other Poplar species including >2.6M 454-sequenced Populus deltoides EST reads generated at JGI. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabi (Arabidopsis thaliana), rice, soybean or grape genomes to repeat-soft-masked P. trichocarpa genome using RepeatMasker. Gene models were predicated by homology-based predictors, mainly FGENESH+, FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan. The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed.

v2.2 loci were tentatively mapped to v3.0 loci by BLAT both v2.2 loci sequence including intron bounded by its CDS range and v2.2 loci sequence including intron bounded by its range extending up to 1K bp. For each loci pairing, their proteins were aligned to each other. When MBH protein is >= 70% identical, v2.2 locus name becomes v3.0 locus synonym (82% of v2.2 loci mapped this way). When MBH protein is >= 90% identical, v2.2 synonym and defLine from v1.1 become v3.0 locus synonym and defLine respectively (~90.5% of v2.2 loci with synonym or defLine mapped).

Statistics

This release of Phytozome includes the JGI v3.0 gene annotation of assembly v3.
Genome
The main genome assembly is approximately 422.9 Mb arranged in 1,446 scaffolds
Approximately 422.9 Mb arranged in 8,313 scaffolds (~2.585 gap)
Scaffold N50 (L50) = 8 (19.5 Mb)
Contig N50 (L50) = 206 (552.8 Kb)
181 scaffolds are > 50kb in size, representing approximately 97.3% of the genome
Loci
41,335 loci containing protein-coding transcripts
Transcripts
73,013 protein-coding transcripts

What can I do with the Populus trichocarpa v3 dataset

In agreement with Fort Lauderdale and in accordance with the DOE-JGI mission, we are making the poplar V3 genome available from the DOE JGI and our collaborators prior to publication of the data. We are making this data available with the expectation and desire to publish an updated whole genome poplar analysis in a reasonable time without preemption by other groups. By accessing these data, you agree not to publish any articles containing analyses of genes or genomic data on a whole genome or chromosome scale prior to publication by the DOE JGI and/or its collaborators without consulting the DOE JGI ("Reserved Analyses"). "Reserved analyses" include the identification of complete (whole genome) sets of genomic features such as genes, gene families, regulatory elements, repeat structures, GC content, or any other genome feature, and whole-genome- or chromosome- scale comparisons with other species including other poplar species and cultivars. You are welcome to publish gene level, local region, or gene family level analyses of poplar. Please cite Tuskan et al (Science, 2006) and the data source as "Populus trichocarpa v3.0, DOE-JGI, http:://www.phytozome.net/poplar". For specific questions about data use please contact Jeremy Schmutz (jschmutz@hudsonalpha.org).
  ©2006-2014 University of California Regents. All rights reserved  
Information on Accessibility/Section508