PCA & Quality Control Plots
Results & InterpretationUse principal component analysis and other QC visualisations to assess sample relationships, detect outliers, and verify that your experimental design drives the dominant variance in the data.
When to Use
- Before interpreting differential expression results, to verify that samples group as expected by experimental condition.
- You suspect batch effects, sample outliers, or possible sample swaps that could compromise your analysis.
- You want to confirm that the primary factor of interest (e.g., Treatment) drives the largest component of variance rather than technical artifacts.
Required Inputs
- A completed RNA-seq run (PCA is computed from variance-stabilising transformed count data).
What to Expect
- PCA plot: samples are coloured by the selected factor, showing PC1 versus PC2 with the percentage of variance explained by each component.
- Samples from the same experimental group should cluster together. Clear separation between groups suggests a strong biological signal.
- The gene selection mode controls which genes are used to compute PCA: significant genes, most variable genes, or both.
- A volcano plot of log2FoldChange versus -log10(padj) provides a complementary view of the differential expression landscape.
Interpretation
- If samples from the same group cluster tightly and groups are well separated, the experimental factor is the dominant source of variation.
- If batch drives PC1 instead of your factor of interest, consider adding the batch variable to the model formula (~batch + condition).
- A single outlier sample that separates from its group along PC1 or PC2 may indicate a technical problem (low RNA quality, contamination). Consider removing it and re-running the analysis.
- When PC1 + PC2 together explain less than 50% of the total variance, the data structure is complex and may require additional components or covariates to interpret.
- The variance-stabilising transformation (VST) used for PCA removes the mean-variance dependence inherent in count data, making Euclidean distances between samples meaningful.
Common Pitfalls
- If batch dominates PC1 instead of your biological factor, the simple model will confound batch with condition. Add batch to the model formula.
- A single outlier sample can dominate the PCA projection and compress all other samples together. Remove or flag the outlier and re-run.
- Running PCA with too few genes may not capture the true variance structure. Use at least the top 500 most variable genes.
- Variance explained by PC1 + PC2 below 50% suggests the data has many independent sources of variation -- proceed with caution and check higher components.
Citations
References
- Jolliffe, I.T. (2002). Principal Component Analysis (2nd ed.). Springer.
- Love, M.I., Huber, W. & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.