Skip to content

Designing Sample Metadata

Data Preparation

The metadata CSV maps each sample to its experimental conditions, batch variables, and covariates. Its first column must contain sample IDs that match the count-matrix column headers.

When to Use

  • You are setting up a new RNA-seq analysis and need to define which samples belong to which experimental groups.
  • You want to include additional covariates such as batch, RIN score, sequencing lane, or patient age in your model.
  • Your experiment has a multi-factor design (e.g., Drug x CellLine x TimePoint) and you need each factor recorded in its own column.

Required Inputs

  • First column: sample identifiers that match the column headers in the count matrix (trimmed, case-insensitive matching is applied).
  • At least one categorical factor column (e.g., Treatment, Genotype, CellLine) defining the comparison of interest.
  • Optional: continuous covariates (e.g., RIN_score, age, library_size) that can be included in the model.

What to Expect

  • easyCris auto-detects categorical versus numeric columns from the metadata file.
  • All detected factor columns appear as options in the model configuration dropdowns.
  • Continuous covariates can be added to the design formula to control for confounding variables.

Common Pitfalls

  • Mismatched sample IDs between the count matrix and the metadata file trigger a mismatch warning and can block analysis until resolved. Double-check for trailing spaces, different cases, or extra characters.
  • Spaces or special characters in sample IDs (e.g., "Sample #1") can cause silent matching failures. Use underscores or simple alphanumeric names.
  • A factor column with only one level provides no contrast -- the model has nothing to compare, so it will error during fitting.
  • Numeric-coded factors (1, 2, 3) may be interpreted as continuous covariates. Use string labels (Group_A, Group_B) to ensure correct treatment as categorical variables.

Citations

References

  • Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.