🧬RNA Sequencing (RNA-seq)

20251217

RNA sequencing (RNA-seq) is a high-throughput transcriptomic technology based on next-generation sequencing (NGS) that enables comprehensive profiling of RNA molecules within a biological sample. RNA-seq allows both quantitative and qualitative interrogation of the transcriptome, including gene expression levels, alternative splicing, transcript isoforms, gene fusions, and diverse classes of non-coding RNAs. It has become a core methodology in functional genomics, systems biology, and translational biomedical research.

Compared with hybridization-based platforms such as microarrays, RNA-seq provides superior dynamic range, nucleotide-level resolution, and the ability to detect previously unannotated transcripts, making it the current gold standard for transcriptome analysis.


1. Experimental Workflow of RNA-seq

1.1 RNA Extraction and Library Preparation

  • Isolation of high-quality total RNA from cells or tissues

  • Evaluation of RNA quantity and integrity (e.g., RIN or DV200)

  • RNA enrichment strategies:

    • Poly(A) selection for mRNA-focused profiling

    • rRNA depletion for total RNA and non-coding RNA analysis

  • Reverse transcription of RNA into complementary DNA (cDNA)

  • Fragmentation, adapter ligation, and incorporation of sample-specific indices

1.2 Sequencing

  • Sequencing is most commonly performed on Illumina platforms

  • Paired-end sequencing (e.g., 2 × 100 bp or 2 × 150 bp) is widely adopted

  • Sequencing output is generated in FASTQ format containing read sequences and quality scores


2. Bioinformatics Analysis Pipeline

RNA-seq analysis transforms raw sequencing reads into biologically interpretable measurements through a multi-step computational workflow.

2.1 Quality Control and Preprocessing

  • Assessment of raw read quality using FastQC

  • Aggregation of quality metrics across samples with MultiQC

  • Removal of adapter sequences and low-quality bases (e.g., Cutadapt, Trimmomatic, fastp)

2.2 Read Alignment or Mapping

Two primary strategies are used for RNA-seq read mapping:

  • Genome alignment

    • Splice-aware aligners such as STAR and HISAT2

    • Enables detection of exon–intron boundaries and novel splice junctions

  • Transcriptome-based mapping (pseudo-alignment)

    • Lightweight approaches such as Salmon and Kallisto

    • Provides rapid and computationally efficient transcript quantification

2.3 Expression Quantification

  • Gene-level quantification:

    • featureCounts

    • HTSeq-count

  • Transcript-level quantification:

    • Salmon

    • Kallisto

Quantification outputs may be expressed as raw counts, TPM, or FPKM, depending on downstream analytical requirements.


3. Expression Units and Normalization

RNA-seq expression values can be represented using multiple normalization schemes, each designed for a specific purpose. Correct usage of these metrics is critical for valid biological interpretation.

3.1 Expression Units (Library-size / Length–based Normalization)

Metric

Description

Accounted factors

Typical use

Not recommended for

CPM (Counts Per Million)

Raw gene counts scaled by the total number of mapped reads per sample

Sequencing depth

Between-sample comparison of the same gene across biological replicates; quality control and filtering

Within-sample gene comparisons; differential expression (DE) analysis

TPM (Transcripts Per Million)

Gene counts normalized by transcript length (kb) and scaled per million mapped reads

Sequencing depth, gene length

Within-sample gene expression comparison; cross-sample visualization within the same condition

DE analysis

RPKM / FPKM

Length-normalized read counts scaled by library size; similar to TPM but calculated in a different order

Sequencing depth, gene length

Comparison of expression levels between genes within a single sample

Between-sample comparisons; DE analysis

3.2 Differential Expression–oriented Normalization

Method

Description

Accounted factors

Recommended use

DESeq2 (median of ratios)

Estimates sample-specific size factors using the median ratio of gene counts relative to a geometric mean

Sequencing depth, RNA composition

Between-sample normalization for DE analysis using raw counts

edgeR (TMM)

Uses a trimmed mean of log expression ratios between samples to estimate normalization factors

Sequencing depth, RNA composition (robust to composition bias)

Between-sample normalization for DE analysis using raw counts

Reference: DGE_workshop from Harvard Chan Bioinformatics Core


4. Differential Expression Analysis

Differential expression (DE) analysis aims to identify genes whose expression levels differ significantly between experimental conditions.

  • DESeq2

    • Negative binomial modeling of count data

    • Shrinkage estimation for dispersion and fold change

  • edgeR

    • Negative binomial framework with empirical Bayes dispersion estimation

    • Highly flexible for complex experimental designs

  • limma-voom

    • Linear modeling with precision weights derived from count data

    • Efficient for large sample cohorts

Robust DE analysis requires appropriate normalization, adequate biological replication, and careful control of multiple testing.


5. Functional Interpretation

Downstream analyses are performed to translate DE results into biological insights:

  • Gene Ontology (GO) enrichment analysis

  • Pathway analysis (KEGG, Reactome)

  • Gene set enrichment analysis (GSEA)

  • Commonly used tools include:

    • clusterProfiler

    • fgsea

    • DAVID


6. Common Output Files and Data Formats

File type
Description

FASTQ

Raw sequencing reads with base quality scores

SAM / BAM

Aligned reads in text or binary format

Count matrix

Gene-by-sample read count table

DE results

log2 fold change, p-values, adjusted p-values


7. Applications of RNA-seq

  • Genome-wide gene expression profiling

  • Cancer transcriptomics and fusion gene discovery

  • Developmental and stem cell biology

  • Disease mechanism studies and biomarker discovery

  • Extensions to single-cell RNA-seq for cellular heterogeneity analysis


8. Strengths and Limitations

Strengths

  • High sensitivity and wide dynamic range

  • Detection of novel transcripts and alternative splicing

  • Applicability to non-model organisms via de novo assembly

Limitations

  • Computationally intensive data processing

  • Sensitivity to RNA quality and batch effects

  • Transcript abundance does not directly reflect protein activity


9. Experimental Design Considerations

  • Adequate biological replication and statistical power

  • Appropriate sequencing depth and read length selection

  • Library preparation strategy (poly(A) selection vs rRNA depletion)

  • Mitigation of batch effects and rigorous quality control


RNA-seq is a powerful and versatile approach for transcriptome-wide analysis. When paired with thoughtful experimental design and rigorous bioinformatics workflows, RNA-seq provides critical insights into gene regulation, disease mechanisms, and molecular phenotypes that support precision medicine and translational research.

Last updated