Skip to content

PCA & Quality Control Plots

Results & Interpretation

Use principal component analysis and other QC visualisations to assess sample relationships, detect outliers, and verify that your experimental design drives the dominant variance in the data.

When to Use

  • Before interpreting differential expression results, to verify that samples group as expected by experimental condition.
  • You suspect batch effects, sample outliers, or possible sample swaps that could compromise your analysis.
  • You want to confirm that the primary factor of interest (e.g., Treatment) drives the largest component of variance rather than technical artifacts.

Required Inputs

  • A completed RNA-seq run (PCA is computed from variance-stabilising transformed count data).

What to Expect

  • PCA plot: samples are coloured by the selected factor, showing PC1 versus PC2 with the percentage of variance explained by each component.
  • Samples from the same experimental group should cluster together. Clear separation between groups suggests a strong biological signal.
  • The gene selection mode controls which genes are used to compute PCA: significant genes, most variable genes, or both.
  • A volcano plot of log2FoldChange versus -log10(padj) provides a complementary view of the differential expression landscape.

Interpretation

  • If samples from the same group cluster tightly and groups are well separated, the experimental factor is the dominant source of variation.
  • If batch drives PC1 instead of your factor of interest, consider adding the batch variable to the model formula (~batch + condition).
  • A single outlier sample that separates from its group along PC1 or PC2 may indicate a technical problem (low RNA quality, contamination). Consider removing it and re-running the analysis.
  • When PC1 + PC2 together explain less than 50% of the total variance, the data structure is complex and may require additional components or covariates to interpret.
  • The variance-stabilising transformation (VST) used for PCA removes the mean-variance dependence inherent in count data, making Euclidean distances between samples meaningful.

Common Pitfalls

  • If batch dominates PC1 instead of your biological factor, the simple model will confound batch with condition. Add batch to the model formula.
  • A single outlier sample can dominate the PCA projection and compress all other samples together. Remove or flag the outlier and re-run.
  • Running PCA with too few genes may not capture the true variance structure. Use at least the top 500 most variable genes.
  • Variance explained by PC1 + PC2 below 50% suggests the data has many independent sources of variation -- proceed with caution and check higher components.

Citations

References

  • Jolliffe, I.T. (2002). Principal Component Analysis (2nd ed.). Springer.
  • Love, M.I., Huber, W. & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.