Preparing a Count Matrix
Data PreparationThe count matrix is a CSV where rows are genes and columns are samples. Every value must be a raw, un-normalised integer count straight from your quantification pipeline.
When to Use
- You have aligned reads quantified by a tool such as featureCounts, HTSeq, or STAR --quantMode GeneCounts and need to format them for import.
- You downloaded count data from a public repository (GEO, recount3, ENCODE) and need to confirm it is in the correct shape.
- You want to verify your file before importing it into easyCris so the pipeline does not reject it.
Required Inputs
- First column: gene identifiers (Ensembl IDs, Entrez IDs, UniProt accessions, or user-provided gene symbols).
- Remaining columns: one column per biological sample, each containing integer counts.
- No normalised values (FPKM, TPM, RPKM) -- differential expression methods require raw counts to model variance correctly.
What to Expect
- easyCris validates that every sample column contains integers and flags any column with decimal or non-numeric values.
- Gene identifiers in the first column are matched against your selected ID type during configuration.
- Low-count genes can be filtered during run configuration using the Min Count threshold, so you do not need to pre-filter your matrix.
Common Pitfalls
- Normalised data (FPKM, TPM, RPKM) violates the statistical model used for differential expression and will produce unreliable results. Always use raw counts.
- Decimal values in the count columns indicate that the data has been normalised or estimated -- round or re-quantify from BAM files.
- Duplicate gene identifiers require a merge policy (sum counts or keep the first occurrence), which is configured separately in easyCris.
- Excel may silently convert gene names like SEPT1, MARCH1, and DEC1 to calendar dates. Export your matrix as CSV directly from your bioinformatics pipeline to avoid this.
Citations
References
- Love, M.I., Huber, W. & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.