Model SIRVs - Spike-In RNA Variant Control Mixes

Biases in RNA Sequencing: RNA sequencing (RNA-Seq) workflows comprise RNA purification, library generation, the sequencing itself, and the evaluation of the sequenced fragments. The initial steps impose biases for which the data processing algorithms try to compensate afterwards. Key tasks for data evaluation algorithms are the concordant assignment of fragments to the transcript variants, robustness towards annotation flaws and the subsequent deduction of the corresponding abundance values. Unless the quality of all individual processing steps can be unequivocally determined, subsequent comparisons of experimental data remain ambiguous.

Most popular related searches

advanced sequencing

metabolic state

Spike-in Transcripts in RNA-Seq
The proliferation of different RNA-Seq platforms and protocols as well as the ongoing efforts to translate NGS (Next Generation sequencing) into clinical diagnosis has created the need for multi-functional spike-in controls. These are integrated and processed with real samples to enable the monitoring and comparison of key performance parameters like sensitivity and input-output correlation as well as the detection and quantification of transcript variants. The external controls are RNA molecules of known sequence that are added in pre-determined amounts to a sample. They are then subjected to the same protocol steps (with equal restrictions and biases) as the endogenous RNA to be separated only at the final step of NGS data analysis (Figure 1).

Figure 1 | Workflow for using spike-in controls in RNA-Seq. Spike-In RNA Variants (SIRVs) are defined synthetic RNA molecules that mimic the main aspects of transcriptome complexity. They are added in minuscule amounts to samples before library preparation to undergo the very same processing steps as the endogenous RNA. After mapping the reads to the combined artificial genome, the spike-in data are used to analyze quality metrics and to categorize the experiments. The dotted lines show the decision-making processes of deciding i) if the complete data set is worthy of further processing (or if an experiment needs to be repeated), and ii) which data sets have concordance that will permit meaningful comparison of the full data sets to each other.

SIRV Modules: Isoforms, ERCCs, and long SIRVs Modules

Transcriptomes are complex and consist of several RNA classes with specific properties. Spike-in RNA controls must reflect these to be representative for a given experimental design. The Spike-In RNA Variants (SIRV) were conceived as a family of modules to offer tailored solutions for the control of RNA-seq experiments. SIRVs are available as an isoform module, which contains a group of synthetic transcripts that mimic transcriptome complexity, and as a length module to cover transcript lengths of up to 12 kb. While the SIRV isoform module is available as a stand-alone module (Cat. No. 050) or mixed with ERCCs to additionally mimic abundance complexity (Cat. No. 051), the long SIRVs module is provided in a mix together with the SIRV isoform module and the ERCC module (Cat. No. 141). See Modular Design for more information on the spike-in concept and SIRV Sets for details on the mixes available to users.

Spike-in Experiment Rationales
Here, we describe considerations for planning RNA-Seq experiments. However, the SIRV mixes are not only suitable for assessing NGS setups but also for quantification on microarray platforms and in qPCR assays.

Spiking of samples

SIRVs are spiked into samples before library preparation, either to purified RNA or at an upstream processing stage such as homogenization (e.g. RNA extraction from tissues or fluids) or lysis (e.g. single cell applications). Due to their sequences being non-identical to genomic and transcriptomic database entries they can be combined with RNA from any organism (see Modular Design for details). Since the SIRV RNAs are polyadenylated, library preparation can start from poly(A)-selected fractions as well as from total RNA, depleted RNA, etc.

Typically, the amount of spike-in RNA is adjusted to have only 1 % of all NGS reads mapping to the SIRV genome, the “SIRVome”. This might be increased to 2-5% for setups with low read depth (< 5M reads) that analyze SIRV-Set 3 or SIRV-Set 4, which contain more than one SIRV module. The spike-in amounts are best tailored to the RNA fractions of interest (e.g. total RNA, ribosomal depleted RNA or poly(A)-enriched RNA) and to the amount of sample. Alternatively, spike-in amounts might be kept constant to measure variations in the sample like the mRNA content or metabolic states.

Library Preparation and Sequencing

The SIRVs can be analyzed with almost any RNA-Seq protocol and any NGS platform (e.g., Illumina®, IonTorrent®, PacBio™, or Oxford Nanopore Technologies™). Being part of one sample, SIRVs undergo the very same reaction steps of library preparation and sequencing as the endogenous RNA. The sequencing data file then contains reads from SIRVs and endogenous RNA.

Evaluation

The origin of reads is determined by mapping to a combined index consisting of the reference genome and the SIRVome, the spike-in genome detailing the transcript sequences and annotations. While the SIRV data is linked to the data stemming from the endogenous RNA, it is only a fraction of its size enabling a very fast evaluation of the SIRV data subset.

The data from the SIRVs can be used for the quality control of the NGS experiment, to asses sequencing errors and biases, and for troubleshooting. The quality of RNA-Seq experiments can be determined by calculating unique quality metrics in the form of

coefficient of deviation (CoD), calculated by comparing the measured coverage with the expected coverage,
precision, a measure for the statistical variability, and
accuracy, a measure of the statistical bias.

These quality metrics are derived from the spike-in transcripts but reflect the situation in the endogenous RNA data set.

Comparison

Because SIRV data sets are well defined and compact, all comparisons require proportionally less computational power ensuring fast processing. Differences between the linked SIRV data sets mirror proportionally the main data of the endogenous RNA. Concordance is independent of the accuracy but describes the coherence of data sets and identifies endogenous RNA data sets that are suitable for meaningful comparisons, e.g., for differential expression analyses.

Background

At present, comparisons are carried out only for exemplary inter-laboratory studies using reference RNA samples, which investigate different RNA treatments, NGS platforms and data evaluation algorithms (SEQC/MAQC-III Consortium 2014; Li et al. 2014b).

Model SIRVs - Spike-In RNA Variant Control Mixes

Product Details

Process Diagram