Skip to content

Preparing a Count Matrix

Data Preparation

The count matrix is a CSV where rows are genes and columns are samples. Every value must be a raw, un-normalised integer count straight from your quantification pipeline.

When to Use

  • You have aligned reads quantified by a tool such as featureCounts, HTSeq, or STAR --quantMode GeneCounts and need to format them for import.
  • You downloaded count data from a public repository (GEO, recount3, ENCODE) and need to confirm it is in the correct shape.
  • You want to verify your file before importing it into easyCris so the pipeline does not reject it.

Required Inputs

  • First column: gene identifiers (Ensembl IDs, Entrez IDs, UniProt accessions, or user-provided gene symbols).
  • Remaining columns: one column per biological sample, each containing integer counts.
  • No normalised values (FPKM, TPM, RPKM) -- differential expression methods require raw counts to model variance correctly.

What to Expect

  • easyCris validates that every sample column contains integers and flags any column with decimal or non-numeric values.
  • Gene identifiers in the first column are matched against your selected ID type during configuration.
  • Low-count genes can be filtered during run configuration using the Min Count threshold, so you do not need to pre-filter your matrix.

Common Pitfalls

  • Normalised data (FPKM, TPM, RPKM) violates the statistical model used for differential expression and will produce unreliable results. Always use raw counts.
  • Decimal values in the count columns indicate that the data has been normalised or estimated -- round or re-quantify from BAM files.
  • Duplicate gene identifiers require a merge policy (sum counts or keep the first occurrence), which is configured separately in easyCris.
  • Excel may silently convert gene names like SEPT1, MARCH1, and DEC1 to calendar dates. Export your matrix as CSV directly from your bioinformatics pipeline to avoid this.

Citations

References

  • Love, M.I., Huber, W. & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.