MMETSP Review Publications

Keeling PJ et al. (2014) The Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP): Illuminating the Functional Diversity of Eukaryotic Life in the Oceans through Transcriptome Sequencing. PLoS Biol 12(6) PLoS Biol 12(6): e1001889. doi:10.1371/journal.pbio.1001889

For a full list of MMETSP-related publications please visit GBMF's MMETSP project page.

Dataset Access

dataset access

b. iMicrobe
c. SRA
d. PhyloDB via PhyloMetarep

*The combined assemblies originate from reads from the single assemblies
PhyloMetarep will have only one representation for each strain, either the combined assembly or a single assembly if only one sample was sequenced. Updated annotations have been completed and will be available soon. They include: BLAST against updated PhyloDB (including all MMETSP data) and KEGG and assignment of KEGG ortholog (KO) IDs, KOG, core eukaryotic gene annotation (CEGMA), Lineage Probability Index (LPI) assignment (to assist with filtering contaminants), transmembrane domain (tmhmm) and transporter annotation, pfam, tigrfam, superfamily, and GO. Combined assemblies have been integrated into a single annotation and read counts across conditions provided. Also, a 7-level taxonomy that is tied to PR2 has been implemented which will facilitate more meaningful comparative analyses at various levels of phylogenetic resolution. For questions, please contact Andy Allen ( or John McCrow (

Transcriptome Assembly Methods

Assembly methods were refined midway through the project in an attempt to provide less sequence redundancy in the final contigs. Samples submitted earlier in the project were assembled using BPA1.0, while samples submitted later in the project were assembled using BPA2.0. Version BPA2.0 can be downloaded from GitHub. To determine which version was used to assemble a specific sample please see the associated README_MMETSP. All combined assemblies were done using BPA2.0.

Detailed methods for BPA1.0 and BPA2.0:
BPA1.0 methods.pdf
BPA2.0 methods.pdf
BPA2.0 combined methods.pdf

Notes on How to Verify Strain Identity

Transcriptome datasets are complex and should be approached with awareness of the following potential features - especially in light of the very deep Illumina sequencing used to generate these data. While many researchers attempted to provide axenic and uni-algal total RNA extracts for sequencing, it did not always turn out to be that way. The very deep Illumina sequencing may have occasionally picked up very low levels of non-target RNA, and low levels of bacteria may have been present but not known to the laboratory that provided the sample.

The identity of the target alga should be verified through 18S rRNA sequence analysis of the rRNA reads included in the transcriptome. This analysis has been performed by National Center for Genome Resources (NCGR), but you may wish to repeat the analysis using your own protocols. Please contact NCGR for information about their 18S rRNA sequence analysis performed on a given sample. Many cultures were known to be non-axenic, which has been indicated on the sample page. Cultures that were believed to be axenic must be verified through transcriptome data analysis, however. Some cultures were known to contain multiple algal species (e.g. predator/prey experiments), which has been indicated on the sample page. Unanticipated very low levels of non-target algal species may be present, which has been revealed by the deep sequencing of the transcriptomes (upwards of 2.5 Gb per sample).

We recommend that you analyze the transcriptome datasets with these caveats in mind and encourage discussions in the research community regarding best practices and for sharing useful bioinformatics approaches to aid in the community's analyses.

Notes on Non-Axenic and Non-Clonal Samples

During the sample nomination process, researchers submitted a preliminary indication whether cultures were clonal and/or axenic. In some cases, the responses were definitively known. In other cases, however, the submitter may have thought a culture was clonal and/or axenic but, upon transcriptome sequencing, found that this was not the case. We strongly recommend analyzing transcriptomes for bacterial and non-target eukaryote reads before downstream analyses. Please see the general instructions for evaluating species 18S diversity in your data as an approach to evaluating non-targets reads.

Preliminary 18S sequences and taxonIDs may be found in the metadata forms for each sample. Researchers who submitted samples were requested to provide a freshly generated 18S sequence and confirm or obtain an accurate taxonID by communicating with NCBI. Users of the MMETSP datasets are strongly advised to confirm this information when using the transcriptome datasets, including bearing in mind that some samples were intentionally non-clonal (e.g. predator-prey experiments).

FASTA of 18S sequences provided by researchers who submitted samples to MMETSP, by sample: 18S by sample

FASTA of 18S sequences with redundant sequences removed (use this FASTA for alignments): 18S non-redundant

Lookup Table for Inconsistencies Between NCBI’s SRA and Other Data Hosts

Upon submission of sequence reads to the SRA, it was necessary to change a number of taxon IDs and a few organism names to fit into NCBI’s existing taxonomy database. Please use the table below to obtain the NCBI-provided taxon ID or organism synonym for any sample.

Taxon ID lookup table

Protist 2014 Meeting

Banff, Alberta August 3-8, 2014

The Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) Special Session at Protist 2014 will feature presentations and discussions focusing on novel biological insights garnered from the MMETSP. The MMETSP is a collaborative project supported by the Gordon and Betty Moore Foundation to sequence, assemble and annotate the transcriptomes of approximately 700 microeukaryotes representing hundreds of diverse species. Datasets are available through NCBI ( and CAMERA.


1. Dail Laughinghouse (Smith College) "Taxon-rich analyses using a phylogenomic pipeline resolve the eukaryotic tree of life and reveal the power of subsampling by sites"

2. Chris Lane (University of Rhode Island) “Using networks to facilitate transcriptome assembly and analysis”

3. Martin Kolisko (University of British Columbia) “Investigation of contamination levels in the Marine Microbial Eukaryote Transcriptome Sequencing Project data”

4. Sonya Dyhrman (Lamont Doherty Earth Observatory at Columbia University) “Transcriptome-based profiling of algal nutritional physiology; linking physiology to geochemistry in cultures and field populations”

5.Denis Lynn (University of British Columbia) "Large-scale phylogenomic analysis reveals the phylogenetic position of the problematic taxon Protocruzia and unravels the deep phylogenetic affinities of the ciliate lineages"

Workshop handout

Protist 2014 MMETSP handout.pdf

Microeukaryote Bioinformatics Workshop on Analyzing Protist Transcriptomes International Congress of Protistology XIV Meeting Vancouver July 31, 2013

This bioinformatics workshop was held during the July 2013 ICOP meeting and was designed to help scientists analyze protist transcriptomes. The workshop featured presentations and discussions focusing on bioinformatics methods to analyze microbial eukaryote transcriptomes. The basis for the workshop was the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) at, a collaborative project supported by the Gordon and Betty Moore Foundation to sequence, assemble and annotate the transcriptomes of approximately 700 microeukaryotes representing hundreds of diverse species. Datasets that have been public may be found via CAMERA.

Workshop agenda

MMETSP workshop agenda final.pdf

Presenters and organizers

1. Andy Allen (J. Craig Venter Institute)
2. Jon Kaye (Gordon and Betty Moore Foundation)
3. Stephanie Guida (National Center for Genome Resources): MMETSP project status and assembly techniques: Presentation and Handout
4. Debashish Bhattacharya (Rutgers University): Presentation
5. Martin Krzywinski (Genome Sciences Centre)
6. Aaron Tenney (J. Craig Venter Institute): PhyloMetarep Presentation, practice exercises, and PhyloMetarep Tutorial on YouTube
7. Robin Kodner (Western Washington University)

Recommended reading and resources

Experimental evolution with algae:

Lohbeck KT, Riebesell U, Reusch TBH. 2012. Adaptive evolution of a key phytoplankton species to ocean acidification. Nature Geosciences 5: 346–351.

Kawecki TJ, Lenski RE, Ebert D, Hollis B, Olivieri I, Whitlock MC. 2012. Experimental evolution. Trends in Ecology and Evolution 27: 547–560.

Collins S, Bell G. 2004. Phenotypic consequences of 1,000 generations of selection at elevated CO2 in a green alga. Nature 431: 566–569.

Transcriptome/genome analysis of algae:

Bhattacharya D, Price DC, Chan CX, Qiu H, Rose N, Ball S, Weber AP, Arias MC, Henrissat B, Coutinho PM, Krishnan A, Zäuner S, Morath S, Hilliou F, Egizi A, Perrineau MM, Yoon HS. 2013. Genome of the red alga Porphyridium purpureum. Nature Communications 4: 1941.

Chan CX, Soares MB, Bonaldo MF, Wisecaver JH, Hackett JD, Anderson DM, Erdner DL, Bhattacharya D. 2012. Analysis of Alexandrium tamarense (Dinophyceae) genes reveals the complex evolutionary history of a microbial eukaryote. Journal of Phycology 48: 1130–1142.

General RNA-seq background:

Garber M, Grabherr MG, Guttman M, Trapnell C. 2011. Computational methods for transcriptome annotation and quantification using RNA-seq. Nature Methods 8: 469–477.

Ozsolak F, Milos PM. 2011. RNA sequencing: advances, challenges and opportunities. Nature Reviews Genetics 12: 87-98.

De Wit P, Pespeni MH, Ladner JT, Barshis DJ, Seneca F, Jaris H, Therkildsen NO, Morikawa M, Palumbi SR. 2012. The simple fool's guide to population genomics via RNA-Seq: an introduction to high-throughput sequencing data analysis. Molecular Ecology Resources 12: 1058–1067.

Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter. 2012. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 7: 562–578.

Martin JA, Wang Z. 2011. Next-generation transcriptome assembly. Nature Reviews Genetics 1: 671-682.

Wolf JBW. 2013. Principles of transcriptome analysis and gene expression quantification: an RNA-seq tutorial. Molecular Ecology Resources 13: 559–572.

Additional bioinformatics resources:

1. Bioconductor
Tools for the analysis of high-throughput genomic and transcriptomic data, using the R statistical programming language. RNA-seq specific packages are available.

2. Gene Pattern (Broad Institute)
Suite of tools for short-read mapping, identification of splice junctions, transcript and isoform detection, quantitation, differential expression, quality control metrics, visualization, and file utilities.

3. Galaxy
Open, web-based platform for accessible, reproducible, and transparent computational research. The public server hosts a suite of NGS: RNA analyses or community-hosted instances are available.

Please contact Andy Allen, Stef Guida, Robin Kodner, Jon Kaye for additional information or questions.

NCGRCAMERAMoore FoundationiMicrobe