Overview

Phytozome is a joint project of the Department of Energy's Joint Genome Institute and the Center for Integrative Genomics to facilitate comparative genomic studies amongst green plants. Clusters of orthologous and paralogous genes that represent the modern descendents of ancestral gene sets are constructed at key phylogenetic nodes. These clusters allow easy access to clade specific orthology/paralogy relationships as well as clade specific genes and gene expansions. As of release v5.0, Phytozome provides access to twenty sequenced and annotated green plant genomes which have been clustered into gene families at fifteen evolutionarily significant nodes. Where possible, each gene has been annotated with PFAM, KOG, KEGG, and PANTHER assignments, and publicly available annotations from RefSeq, UniProt, TAIR, JGI are hyper-linked and searchable.

Included Organisms

The proteomes of the following organisms are clustered in release v5.0 of Phytozome:

Organismcommon nameSource
Arabidopsis thalianaMouse-ear cressTAIR release 9 acquired from TAIR
Arabidopsis lyrataLyre-leaved rock cressJGI release 1.0
Carica papayaPapayaASGPB release of 2007
Populus trichocarpaPoplarJGI v2.0 annotation of the v2 assembly
Medicago truncatulaBarrel medicRelease Mt3.0 from the Medicago Genome Sequence Consortium
Glycine maxSoybeanJGI Glyma1.0 annotation of the chromosome-based Glyma1 assembly
Ricinus communisCastor beanTIGR release 0.1
Manihot esculentaCassavaJGI/Roche v1.1 assembly and annotation
Cucumis sativusCucumberRoche 454-XLR assembly and JGI v1 annotation
Vitis viniferaGrapeSept 2007 annotation from Genoscope
Sorghum bicolorSweet SorghumSbi1.4 models from MIPS/PASA on v1.0 assembly.
Zea maysMaizeProtein coding models from Maizesequence.org release 4a.53
Oryza sativaRiceMSU Release 6.0 of the Rice Genome Annotation
Brachypodium distachyonPurple false bromeJGI v1.0 8x assembly of strain Bd21 with JGI/MIPS PASA annotation
Mimulus guttatusMonkey flowerJGI 7x assembly of strain IM62, annotation v1.0
Selaginella moellendorffiiSpikemossJGI v1.0 assembly and annotation
Phycomitrella patensMossJGI v1.1 assembly and annotation
Chlamydomonas reinhardtiiGreen algaeAugustus u9 annotation of JGI v4 assembly

Nodes

Clustering is used to group extant genes into sets representing the ancestral genes that existed just prior to various significant evolutionary events (nodes). Extant genes have been clustered at nodes representing the following speciation events:

Viridiplantae (~475 Mya):Genes representing the most recent common ancestor of Embryophytes and chlorophyta (represented by the algae Chlamydomonas).
Embryophyte (~450 Mya): Genes representing the most recent common ancestor of Tracheophytes and Bryophyta (represented by Physcomitrella).
Tracheophyte (~420 Mya): Genes representing the most recent common ancestor of Selaginella and the angiosperms.
Angiosperm(~160 Mya): Genes representing the most recent common ancestor of grasses and eudicots.
Pre-hexaploidy Rosid (~130 Mya)Genes representing the most recent common ancestor of Grapeand the eurosids before the rosid hexaploidy.
Post-hexaploidy Rosid (~120 Mya)Genes representing the most recent common ancestor of Grape and the eurosids after the rosid hexaploidy
Eurosid I (~110 Mya):Genes representing the most recent common ancestor Poplar and Legumes.
Eurosid II (~110 Mya):Genes representing the most recent common ancestor of Arabidopsis and Papaya.
Eudicot(~ 155 Mya):Genes representing the most recent common ancestor of Mimulus and the rosids.
Malphigiales(~ 78 Mya):Genes representing the most recent common ancestor of poplar, castor bean and cassaa
Legume (~50 Mya):Genes representing the most recent common ancestor of Medicago and Soybean.
Arabidopsis (~50 Mya):Genes representing the most recent common ancestor of Arabidopsis thaliana and Arabidopsis lyrata.
Pre-duplication Grass (~70 Mya):Genes representing the most recent common ancestor of Sorghum, Rice, Maize and Brachypodium before the grass whole genome duplication.
Post-duplication Grass (~50 Mya):Genes representing the most recent common ancestor of Sorghum, Rice, Maize and Brachypodium after the grass whole genome duplication.
Andropogoneae (~12 Mya):Genes representing the most recent common ancestor of Sorghum and Maize.

Clustering Methodology

All-against-all blastp alignments were performed for all 18 plants to be clustered. The bit score per unit peptide length is chosen as the similarity metric between two peptides. Clustering was performed hierarchically, from the crown nodes to the root, creating in-group paralogous clusters and merging ingroup and outgroup clusters across nodes. (all organisms reachable via the same branch from a given node are considered in-group with respoect to that node; organisms not reachable via the same branch from a given node constitute an outgroup to the ingroup organisms). First, paralogous single-organism ingroup clusters are constructed for each organism by comparing intra-organism similarity against inter-organism similarity; only those peptides more similar than either is to any outgroup peptides are joined into clusters (the actual thresholding rule is more complicated, to avoid spurious creation of large paralogous clusters of weakly similar peptides). Then, clusters are merged across nodes via mutual-best-hit criterion.

Synteny ortholog and paralog identification is incorporated in the following manner. For all nodes with significant synteny, orthologs are first assigned to syntenic segments. Syntenic segments are defined as regions bounded by two or more aligning genes with a maximum of 10 non-aligning genes between pairs of aligning genes (aligning genes are defined as those with E-value < 1e-25 and coverage, defined as the length of the alignment divided by the longer peptide, greater than 60%). The average 4DTv of these segments is examined to determine the relative timings of duplications/speciations (4DTv is a measure of the rate of transversions at fourfold degenerate coding sites in the gene.). Orthologs are assigned as 1:1 aligning genes that occur on syntenic segments from the correct 4DTv era for that node in which mutual-best hits account for at least 20% of hits on that segment. If aligning genes have other than a 1:1 relationship between segments (e.g. tandem duplication) genes are assigned as orthologs only if they are mutual-best hits. The methodology for identifying paralogs is the same as for orthologs, with the addition of including tandem duplications (genes assorting in other than 1:1 across segments) and requiring that they be more recent (lower 4DTv) than the maximum 4DTv for that node.

This process continues down to the root, with paralogous clusters being merged via comparison of ingroup to outgroup similarity, and mutual best hits being used to merge clusters across nodes. Minimum coverage thresholds are used to minimize the clustering of multi-domain proteins that may share only a single common domain, or the clustering of peptides from fragmentary gene predictions. Maximum 4DTV thresholds are used to eliminate pairwise homologies corresponding to divergence times more ancient than the node in question. The clustering complete algorithm will be discussed in detail in an upcoming publication.

Note that, by construction, every gene from an organism present at a particular node is in one and only one cluster at that node. Some clusters may contain only one extant gene (singletons). Singletons can come from "fast" evolution leading to so much sequence divergence that sequence-similarity based clustering is confounded, gene loss, or gene calling errors.

Phytozome Team


Software:Joni Fazo, David M. Goodstein, Richard D. Hayes, Rusty Howson, Rochak Neupane,Shengqiang Shu
Analysis: Uffe Hellsten, Therese Mitros, Simon Prochnik, Dan Rokhsar
©2010 University of California Regents. All rights reserved