Gene ID Lookup & Duplicate Handling
Model ConfigurationConfigure how gene identifiers in your count matrix are resolved to gene symbols and how duplicate gene entries are merged before analysis.
When to Use
- Your count matrix uses Ensembl, Entrez, UniProt, or UniProt Swiss-Prot accession numbers and you want human-readable gene symbols in the results.
- Your Ensembl gene IDs have version suffixes (e.g., ENSG00000141510.12) that need to be stripped before lookup.
- Multiple rows in your count matrix map to the same gene and you need to decide how to handle them.
Required Inputs
- Gene label source: "user_provided" (use identifiers as-is) or "id_lookup" (map through an organism-specific bundled cache).
- Organism: human, mouse, rat, or another supported species (required for ID lookup mode).
- Gene ID type: ensembl, entrez, uniprot, or uniprot_swissprot (required for ID lookup mode).
- Duplicate policy: "sum_duplicates" (sum counts across duplicates) or "keep_first" (retain only the first occurrence).
What to Expect
- Version suffixes on Ensembl IDs are automatically stripped (ENSG00000141510.12 becomes ENSG00000141510) before lookup.
- ID lookup uses the bundled local annotation cache distributed with easyCris.
- Duplicate genes are merged according to the selected policy, and the number of merged entries is reported in a diagnostic message.
- Rows with empty or unmappable gene labels are dropped and reported so you can verify no important genes were lost.
Common Pitfalls
- Selecting the wrong organism maps gene IDs incorrectly -- always verify that the organism matches your sequencing data.
- The sum_duplicates policy can inflate counts if duplicates arise from annotation artifacts rather than genuine multi-mapping. Use it when you are confident the duplicates represent the same biological gene.
- The keep_first policy is safer for initial exploration; switch to sum_duplicates only when you understand why duplicates exist in your data.
Citations
References
- Cunningham, F. et al. (2022). Ensembl 2022. Nucleic Acids Research, 50(D1), D988-D995.