Preparing a Count Matrix

Data Preparation

The count matrix is a CSV where rows are genes and columns are samples. Every value must be a raw, un-normalised integer count straight from your quantification pipeline.

When to Use

You have aligned reads quantified by a tool such as featureCounts, HTSeq, or STAR --quantMode GeneCounts and need to format them for import.
You downloaded count data from a public repository (GEO, recount3, ENCODE) and need to confirm it is in the correct shape.
You want to verify your file before importing it into easyCris so the pipeline does not reject it.

Required Inputs

First column: gene identifiers (Ensembl IDs, Entrez IDs, UniProt accessions, or user-provided gene symbols).
Remaining columns: one column per biological sample, each containing integer counts.
No normalised values (FPKM, TPM, RPKM) -- differential expression methods require raw counts to model variance correctly.

What to Expect

easyCris validates that every sample column contains integers and flags any column with decimal or non-numeric values.
Gene identifiers in the first column are matched against your selected ID type during configuration.
Low-count genes can be filtered during run configuration using the Min Count threshold, so you do not need to pre-filter your matrix.

Common Pitfalls

Normalised data (FPKM, TPM, RPKM) violates the statistical model used for differential expression and will produce unreliable results. Always use raw counts.
Decimal values in the count columns indicate that the data has been normalised or estimated -- round or re-quantify from BAM files.
Duplicate gene identifiers require a merge policy (sum counts or keep the first occurrence), which is configured separately in easyCris.
Excel may silently convert gene names like SEPT1, MARCH1, and DEC1 to calendar dates. Export your matrix as CSV directly from your bioinformatics pipeline to avoid this.

Citations

References

Love, M.I., Huber, W. & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.

Moderated Mediation (Hayes Model 7)

Designing Sample Metadata