Frequently Asked Questions
rev. April 10, 2015
How do I know the status of my approved sample(s)?
You can track whether metadata forms have been QCed and entered in our database
(usually within a week of receipt) and also the exact stage your sample is at
(bar-code sample tube sent, sample QC, library construction, sequencing, assembly
& annotation) by visiting:
When do I submit environmental and experimental metadata?
Once a nomination is accepted, PIs will be notified by e-mail. At this stage, PIs
should download the metadata form (excel spreadsheet) from
and rename the downloaded file after its unique ID (MMETSPXXXX.xlsx) that can be found
in the e-mail approval notice. The instructions are in the metadata form itself.
Completed metadata form (one metadata file for each transcriptome) should be submitted to
Dr. Callum Bell via e-mail
NCGR will then send a bar-coded sample tube.
This is a good time to start growing your organism and harvest total RNA. If not already in hand, obtaining an NCBI taxon ID is required and is the responsibility of the PI. A recently generated (within the last 2-3 months) full-length 18S rRNA gene sequence is required, even if the organism was obtained from a culture collection. Since this requirement is to make sure that we are sequencing the intended organism, we have been strict about it. If you are facing technical difficulties generating a fresh 18S rRNA, please contact Dr. Jon Kaye at GBMF .
If the organism is not available in a culture collection, it must be deposited in one. To seek an exception for very difficult-to-grow organisms that cannot be continuously propagated, please contact Julia Metzner at GBMF .
How often are the bar-coded sample tubes shipped out to PIs?
As soon as the metadata forms are checked and updated in our database, bar-coded sample
tubes are sent out once a week (every Friday) by USPS (national) and FedEx (International).
Shipping address used by NCGR is the same as the one provided by the PIs at the time of filling out the nomination forms online. Please check your user accounts and make sure the correct MAILING ADDRESS is there in our database, not a PO Box. If you see an error, please log into your account on https://www.marinemicroeukaryotes.org/users/sign_in and click on the "Edit your registration information" tab. Once on that webpage, you should be able to edit your personal information.
Please note that two tubes will be sent to you for each sample. Please use the tube marked "QC" for a 5 µl aliquot of your sample. This will be used to check the sample quality and quantity upon arrival without the need to thaw the main sample. Use the other tube (not marked QC) for the remainder of your sample. Please make sure that the QC tube and the main sample tube represent the same batch having the same concentration.
In case you are planning to send dry RNA pellets, please let us know the original volumes you had in both tubes before drying them. We will add exactly the same amount you started with so that the RNA concentration of both QC & main tube is the same.
- What should I send to NCGR? You should submit total RNA suspended in RNAse-free water at an approximate concentration of >60 ng/µl. In certain circumstances (see FAQ 11), cDNA may be sent to NCGR, but you must first discuss this possibility with NCGR for approval.
- How should I isolate RNA from my sample? Any standard RNA isolation protocol is acceptable (e.g. Qiagen RNeasy). The total RNA must be treated with RNase-free DNase before quantitation.
- Does the RNA have to be completely free of DNA or is some DNA contamination OK? All samples must be treated with RNase-free DNase to ensure they are free of DNA contamination, which would otherwise hinder in the correct quantification of total RNA. There is also a small chance that the contaminating DNA may get incorporated in the library, which is not desirable.
- How much total RNA do I need to send to NCGR? For the standard Tru-Seq RNA library protocol, it is ideal to send 5-10 µg of total RNA.
How should I quantify my RNA?
Total RNA can be stained with ribogreen and the fluorescence measured with a fluorometer.
Note: ribogreen binds to both RNA and DNA and thus RNase-free DNase step is essential for
accurate quantification. If your total RNA sample appears too viscous, it may be due to
carbohydrate and/or protein contamination. Spectrophotometric (nanodrop) analysis will be
useful. Ideally the 260:280 nm ratio for RNA will be close to 2. Feel free to refer to
"nucleic acids analysis" on Wikipedia:
- If you have access to a Qubit, you may use it to quantify the concentration of total RNA.
- Although all samples will undergo quality control (QC) at NCGR (using Invitrogen Qubit Q32855), researchers are strongly encouraged to perform robust QC in their own laboratories to maximize the likelihood that their samples will pass QC at NCGR.
What should I do if, after several attempts, I only have a few hundred nanograms of
Such samples will be processed through the total RNA-Seq protocol using DSN (duplex-specific
nuclease) normalization. If you want us to process your sample through the DSN protocol,
please send ~500 ng of total RNA at an approximate concentration of ~100 ng/µl.
This protocol can also detect genes expressed at low levels as well as non-coding RNAs, including contaminated bacterial transcripts, if any. The DSN protocol is not recommended for non-axenic samples (see #20 below).
- How much poly-A+ RNA goes into the library making? 1-10 µg of total RNA is recommended for the standard Illumina Tru-Seq RNA protocol (poly-A+ selection). NCGR will prepare the poly-A+ RNA as part of the standard protocol.
What about organisms with 'spliced leaders' on their mRNAs?
'Target amplified' cDNA submission is permitted for organisms that put specific leaders
on their mRNAs. An example would be where this protocol would be useful is when the
transcriptome of ONLY the eukaryote host or ONLY its endosymbiont is of interest to the
researcher and one of those carries out trans-splicing.
We prefer the cDNA size >500 bp. Please send up to 1 microgram of the amplified cDNA suspended in TE (preferably a concentration of ~100-200 ng/µl). When sending your cDNA sample, please make sure to specify very clearly that the sample is 'cDNA'. For all cDNA samples received from PIs, NCGR will proceed from the end repair step of Illumina's TruSeq RNA protocol.
How should I send a sample to NCGR?
Samples are due to NCGR 3 months following receipt of the e-mail approval notice. Total RNA
sample should be shipped frozen by overnight express mail on dry ice to:
Attn: Sequencing Lab, 2935 Rodeo Park Drive East, Santa Fe, NM 87505, USA
If the shipment is from outside USA and the PI is concerned about samples held at customs, they may send samples as a dry pellet i.e. ethanol precipitated & pellet dried using a speed-vac or lyophilizer (shipment at room temperature). A good way to avoid problems during customs/quarantine declaration is to stay away from descriptions like "biological sample", which could potentially catch attention. Examples of labels you could use are "RNA extract for research use only, non-hazardous, non-toxic, non-infectious RNA only, RNA isolate, not known to be hazardous". Please ship the samples early in the week because no shipment is received at NCGR during the weekend (Saturday / Sunday).
- How will I know if my sample has been received by NCGR and has passed QC procedures? NCGR will notify you by email upon sample receipt and whether your sample has passed QC.
- If my sample fails QC, can I resubmit it? You may resubmit a sample within six weeks following notification from NCGR that it failed QC. However the maximum number of re-submissions allowed is two.
- How will samples be sequenced? RNA libraries with an insert size of ~200 bp will be made and sequenced from both ends (paired-end reads 2 x 50-nt) on the Illumina Hi-Seq 2000. The total sequence data generated for each sample will be approximately 2.5 Gbp.
What is the workflow of assembly and annotation at NCGR?
Sequence reads will first be assembled using the ABySS (Simpson et al., 2009) at various
k-mers. These intermediate contigs, which may differ within a sample for different parameterized
runs of the same assembly algorithm, will be grouped using an OLC (overlap layout consensus)
assembler like CAP3 (Huang & Madan, 1999). The final contigs will be annotated with protein motifs
(HMM search against PFAM, SUPERFAMILY & TIGRFAMs; BLASTP against SwissProt)
Assembly methods were refined midway through the project in an attempt to provide less sequence redundancy in the final contigs. Samples submitted earlier in the project were assembled using BPA1.0, while samples submitted later in the project were assembled using BPA2.0. Version BPA2.0 can be downloaded from GitHub. To determine which version was used to assemble a specific sample please see the associated README_MMETSP. A formal methods writeup for both assembly versions can be downloaded from Resources.
The assembly and annotation data will be available as a compressed file. The final format of the MMETSP bundle is:
| |-- pfam.gff3
| |-- superfamily.gff3
| |-- swissprot.gff3
| `-- tigrfams.gff3
| |-- blastp_swissprot.xml
| |-- hmmer3_pfam.hits
| |-- hmmer3_superfam.hits
| `-- hmmer3_tigrfams.hits
| |-- cds.dat
| `-- contigs.dat
The 'README' file in each bundle will contain detailed information about each file.
Annotation files based on hits to Swiss-Prot, Pfam-A, and TIGRFAMs include InterPro associations in the Ontology term attribute, when available. These are based on assignments of protein accessions and hmm models to InterPro terms as published by the InterPro group.
In addition to the assembly / annotation bundle, CAMERA will also make available the raw reads. The 2 compressed fastq files will have sequence data for read-1 and read-2, respectively.
- How long will it take NCGR to sequence, assemble and annotate my sample? Depending on the number of samples in the queue, NCGR expects to sequence, assemble and annotate samples within three months of receipt of samples that pass QC. However, it may take longer for some researchers to receive their datasets if NCGR receives a large number of samples during any given time period.
- Can I use the read counts as an expression profile? It depends. In most cases, the transcriptome datasets are just a starting point for further analysis and experimentation and are not designed to provide quantitative information. However, depending on experimental design (such as pooled biological replicates), it may be possible to obtain quantitative information.
- Do the raw sequence reads under go any filtering step? NCGR's sequencing pipeline automatically filters reads for Illumina primers/adaptors & for the control DNA (phiX174). Moreover before assembly, the reads undergo an additional filtering step i.e. processed for quality trimming (>Q15) using SGA preprocess (SGA preprocess).
- Since the Illumina TruSeq RNA protocol has a poly-A+ selection, should we expect ribosomal RNA, chloroplast or mitochondrial sequences in the assembled contigs? Yes, ribosomal RNA passes the poly-A selection stage simply due to its huge abundance. Organelle (chloroplast & mitochondrial) reads also come through the poly-A step due to the same reason. It is unlikely for the bacterial transcripts (no poly-A tail) to pass the oligo dT column (TruSeq RNA library) unless they are very highly abundant matching the levels of main sample rRNA. However if processed through the DSN protocol, bacterial transcripts will definitely be present. Therefore we do not recommend DSN for non-axenic samples.
Do I need to be concerned about proportion of bacterial contamination in
Among the first few transcriptome assemblies we have in hand, a potential comparative
data set was two samples, one without any bacteria and the other mixed 50:50 with a
ciliate culture containing bacteria. BLAST was carried out between assembled contigs
from both samples against the greengenes 16S-rDNA database. Due to high relative abundance,
it is expected that ribosomal RNA from any contaminants will be above the coverage
threshold to get assembled. An insignificant number of contigs from the contaminated sample
hit bacterial 16S sequences. Although the interpretation needs to be taken with a grain of
salt, this analysis definitely helps in getting a better picture of the proportion of bacterial
contigs in samples contaminated with bacteria i.e. the proportion is expected to be low.
A similar observation was made when BLAST was repeated against a database of bacterial
full genomes downloaded from NCBI.
We encourage PIs who are concerned about their non-axenic cultures, to carry out additional rounds of polyA selection in their labs before sending cDNA samples to NCGR. Fixed library kit components from Illumina as well as high-throughput restraints will make additional rounds of polyA selection at NCGR difficult.
- Is alternative-splicing being reported? NCGR does not attempt to separate alternate spliced forms.
- How will the project deal with overlapping genes? Our approach will be to sort these out bioinformatically; reading frames on either strand having strong coding potential will be annotated. The researcher is strongly encouraged to submit information about overlapping genes in the organism, if known.
Has any QC been carried out with the initial round of assembled contigs?
We have used the genome sequences of Guillardia theta
(sequenced & assembled at JGI, PI: John Archibald) for QC. Depending on what alignment
criteria one is comfortable with, more than 85-96% of the transcriptome contigs can be
mapped on to the genome assemblies:
- How do I access my data? Sample metadata form, Illumina raw reads, transcriptome assembly & annotation will be deposited to the Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA). CAMERA will notify the respective PIs when their data is available for download from its FTP site.
- What is the embargo policy? Researchers will have six months of exclusive use of their data following the date of notification from CAMERA that transcriptome datasets are available for download. The embargo policy is described here.
- Are there exceptions to the embargo policy? No. For questions, please contact Dr. Jon Kaye at or +1 650-213-3122.
- Will my nomination be kept confidential? Yes. Sample nominations will be not available or viewable through the website. For samples accepted into the program, only the information described in FAQ 29 will be made publicly viewable.
What sample information will be publicly visible?
The "approved transcriptomes" page will
show the investigator name, taxonomy, genus, species, strain and status of all
accepted samples with respect to bar-code sample tube sent, sample QC, library
construction, sequencing, assembly and annotation. Sample-specific statistics
from some of the first few samples processed so far are shown below:
Whom should I contact with additional questions?
For questions about metadata, please contact Dr. Callum Bell at NCGR (+1 505-995-4428; ).
For scientific/technical questions about library preparation, sequencing, or assembly, please contact Dr. Stephanie Guida via e-mail ().
For questions about iMicrobe, please contact Dr. Bonnie Hurwitz () or Kenneth Youens-Clark ().
You may also contact Dr. Jon Kaye at GBMF with any questions (+1 650-213-3122; ).
See also: iMicrobe's MMETSP Homepage with Access to Datasets
Regarding detailed information on any Illumina protocol, please call Illumina Tech support at 1-800-809-4566. In order to get direct access to product literature and protocols, you can also register on their website https://icom.illumina.com
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19: 1117-23.
Huang X, Madan A (1999) CAP3: A DNA sequence assembly program. Genome Res 9: 868-77.