Gene ID Lookup & Duplicate Handling

Model Configuration

Configure how gene identifiers in your count matrix are resolved to gene symbols and how duplicate gene entries are merged before analysis.

When to Use

Your count matrix uses Ensembl, Entrez, UniProt, or UniProt Swiss-Prot accession numbers and you want human-readable gene symbols in the results.
Your Ensembl gene IDs have version suffixes (e.g., ENSG00000141510.12) that need to be stripped before lookup.
Multiple rows in your count matrix map to the same gene and you need to decide how to handle them.

Gene label source: "user_provided" (use identifiers as-is) or "id_lookup" (map through an organism-specific bundled cache).
Organism: human, mouse, rat, or another supported species (required for ID lookup mode).
Gene ID type: ensembl, entrez, uniprot, or uniprot_swissprot (required for ID lookup mode).
Duplicate policy: "sum_duplicates" (sum counts across duplicates) or "keep_first" (retain only the first occurrence).

Version suffixes on Ensembl IDs are automatically stripped (ENSG00000141510.12 becomes ENSG00000141510) before lookup.
ID lookup uses the bundled local annotation cache distributed with easyCris.
Duplicate genes are merged according to the selected policy, and the number of merged entries is reported in a diagnostic message.
Rows with empty or unmappable gene labels are dropped and reported so you can verify no important genes were lost.

Selecting the wrong organism maps gene IDs incorrectly -- always verify that the organism matches your sequencing data.
The sum_duplicates policy can inflate counts if duplicates arise from annotation artifacts rather than genuine multi-mapping. Use it when you are confident the duplicates represent the same biological gene.
The keep_first policy is safer for initial exploration; switch to sum_duplicates only when you understand why duplicates exist in your data.

Cunningham, F. et al. (2022). Ensembl 2022. Nucleic Acids Research, 50(D1), D988-D995.