RNASeq is a protocol to quantitate the transcriptome and also to detect novel genes, transcripts, splice-sites, exons, ncRNAs. The focus here is on estimating the relative number of mRNA transcripts for differential gene expression between sample phenotypes.
The protocol, in brief, is as follows. cDNA from reverse transcription of poly-A captured transcripts is sequenced and fragment reads are aligned to a reference genome or transcriptome. The relative number of reads (after normalization) for each gene is a measure used for differential expression between samples.
This normalized measure, Reads Per Kilo-base per Million mapped reads, RPKM is calculated as follows.
Say gene G has a length of 1000bp.
Sample A has 100 reads mapping to this gene with a total of 10 million mappable reads for the entire sample gives RPKM=100/1*10=10
Sample B has 25 reads mapping to this gene with a total of 5 million mappable reads for the entire sample gives RPKM=25/1*5=5
However this 2-fold change in expression might be misleading. For example, Gene G in sample A might be expressed as a long 1000bp isoform while Sample B expresses a shorter isoform of length say 500bp.
So sample B RPKM=25/0.5*5=10, and so there is no change in differential expression.
A better way to estimate gene abundance is to add up individual isoform totals. However this brings up the problem of assigning short reads to specific isoforms of a gene.
In addition, there is the problem of short reads mapping to different genes in the same family or isoforms. Some algorithms discard these while others assign them in proportion to uniquely mapped reads. The discards reportedly are significant – for example 17% of reads for the mouse.
RSEM (http://bioinformatics.oxfordjournals.org/content/26/4/493.full) is an improvement over earlier methods (http://www.ncbi.nlm.nih.gov/pubmed/18516045) to estimate gene abundance by assigning reads to specific isoforms based on a rigorous statistical approach. Gene quantities are then estimated by summing up isoform values.
The TCGA portal provides RNASeq data for gene relative abundances for various cancer samples that have been processed through both methods. Older data is available under the data type RNASeq while data output from RSEM is available as data type RNASeqV2. The same data in a consolidated format is available on the Broad website.