🧬RNA Sequencing (RNA-seq)
20251217
RNA sequencing (RNA-seq) is a high-throughput transcriptomic technology based on next-generation sequencing (NGS) that enables comprehensive profiling of RNA molecules within a biological sample. RNA-seq allows both quantitative and qualitative interrogation of the transcriptome, including gene expression levels, alternative splicing, transcript isoforms, gene fusions, and diverse classes of non-coding RNAs. It has become a core methodology in functional genomics, systems biology, and translational biomedical research.
Compared with hybridization-based platforms such as microarrays, RNA-seq provides superior dynamic range, nucleotide-level resolution, and the ability to detect previously unannotated transcripts, making it the current gold standard for transcriptome analysis.
1. Experimental Workflow of RNA-seq
1.1 RNA Extraction and Library Preparation
Isolation of high-quality total RNA from cells or tissues
Evaluation of RNA quantity and integrity (e.g., RIN or DV200)
RNA enrichment strategies:
Poly(A) selection for mRNA-focused profiling
rRNA depletion for total RNA and non-coding RNA analysis
Reverse transcription of RNA into complementary DNA (cDNA)
Fragmentation, adapter ligation, and incorporation of sample-specific indices
1.2 Sequencing
Sequencing is most commonly performed on Illumina platforms
Paired-end sequencing (e.g., 2 × 100 bp or 2 × 150 bp) is widely adopted
Sequencing output is generated in FASTQ format containing read sequences and quality scores
2. Bioinformatics Analysis Pipeline
RNA-seq analysis transforms raw sequencing reads into biologically interpretable measurements through a multi-step computational workflow.
2.1 Quality Control and Preprocessing
Assessment of raw read quality using FastQC
Aggregation of quality metrics across samples with MultiQC
Removal of adapter sequences and low-quality bases (e.g., Cutadapt, Trimmomatic, fastp)
2.2 Read Alignment or Mapping
Two primary strategies are used for RNA-seq read mapping:
Genome alignment
Splice-aware aligners such as STAR and HISAT2
Enables detection of exon–intron boundaries and novel splice junctions
Transcriptome-based mapping (pseudo-alignment)
Lightweight approaches such as Salmon and Kallisto
Provides rapid and computationally efficient transcript quantification
2.3 Expression Quantification
Gene-level quantification:
featureCounts
HTSeq-count
Transcript-level quantification:
Salmon
Kallisto
Quantification outputs may be expressed as raw counts, TPM, or FPKM, depending on downstream analytical requirements.
3. Expression Units and Normalization
RNA-seq expression values can be represented using multiple normalization schemes, each designed for a specific purpose. Correct usage of these metrics is critical for valid biological interpretation.
3.1 Expression Units (Library-size / Length–based Normalization)
Metric
Description
Accounted factors
Typical use
Not recommended for
CPM (Counts Per Million)
Raw gene counts scaled by the total number of mapped reads per sample
Sequencing depth
Between-sample comparison of the same gene across biological replicates; quality control and filtering
Within-sample gene comparisons; differential expression (DE) analysis
TPM (Transcripts Per Million)
Gene counts normalized by transcript length (kb) and scaled per million mapped reads
Sequencing depth, gene length
Within-sample gene expression comparison; cross-sample visualization within the same condition
DE analysis
RPKM / FPKM
Length-normalized read counts scaled by library size; similar to TPM but calculated in a different order
Sequencing depth, gene length
Comparison of expression levels between genes within a single sample
Between-sample comparisons; DE analysis
3.2 Differential Expression–oriented Normalization
Method
Description
Accounted factors
Recommended use
DESeq2 (median of ratios)
Estimates sample-specific size factors using the median ratio of gene counts relative to a geometric mean
Sequencing depth, RNA composition
Between-sample normalization for DE analysis using raw counts
edgeR (TMM)
Uses a trimmed mean of log expression ratios between samples to estimate normalization factors
Sequencing depth, RNA composition (robust to composition bias)
Between-sample normalization for DE analysis using raw counts
Reference: DGE_workshop from Harvard Chan Bioinformatics Core
4. Differential Expression Analysis
Differential expression (DE) analysis aims to identify genes whose expression levels differ significantly between experimental conditions.
DESeq2
Negative binomial modeling of count data
Shrinkage estimation for dispersion and fold change
edgeR
Negative binomial framework with empirical Bayes dispersion estimation
Highly flexible for complex experimental designs
limma-voom
Linear modeling with precision weights derived from count data
Efficient for large sample cohorts
Robust DE analysis requires appropriate normalization, adequate biological replication, and careful control of multiple testing.
5. Functional Interpretation
Downstream analyses are performed to translate DE results into biological insights:
Gene Ontology (GO) enrichment analysis
Pathway analysis (KEGG, Reactome)
Gene set enrichment analysis (GSEA)
Commonly used tools include:
clusterProfiler
fgsea
DAVID
6. Common Output Files and Data Formats
FASTQ
Raw sequencing reads with base quality scores
SAM / BAM
Aligned reads in text or binary format
Count matrix
Gene-by-sample read count table
DE results
log2 fold change, p-values, adjusted p-values
7. Applications of RNA-seq
Genome-wide gene expression profiling
Cancer transcriptomics and fusion gene discovery
Developmental and stem cell biology
Disease mechanism studies and biomarker discovery
Extensions to single-cell RNA-seq for cellular heterogeneity analysis
8. Strengths and Limitations
Strengths
High sensitivity and wide dynamic range
Detection of novel transcripts and alternative splicing
Applicability to non-model organisms via de novo assembly
Limitations
Computationally intensive data processing
Sensitivity to RNA quality and batch effects
Transcript abundance does not directly reflect protein activity
9. Experimental Design Considerations
Adequate biological replication and statistical power
Appropriate sequencing depth and read length selection
Library preparation strategy (poly(A) selection vs rRNA depletion)
Mitigation of batch effects and rigorous quality control
RNA-seq is a powerful and versatile approach for transcriptome-wide analysis. When paired with thoughtful experimental design and rigorous bioinformatics workflows, RNA-seq provides critical insights into gene regulation, disease mechanisms, and molecular phenotypes that support precision medicine and translational research.
Last updated