Skip to content

Gene ID Lookup & Duplicate Handling

Model Configuration

Configure how gene identifiers in your count matrix are resolved to gene symbols and how duplicate gene entries are merged before analysis.

When to Use

  • Your count matrix uses Ensembl, Entrez, UniProt, or UniProt Swiss-Prot accession numbers and you want human-readable gene symbols in the results.
  • Your Ensembl gene IDs have version suffixes (e.g., ENSG00000141510.12) that need to be stripped before lookup.
  • Multiple rows in your count matrix map to the same gene and you need to decide how to handle them.

Required Inputs

  • Gene label source: "user_provided" (use identifiers as-is) or "id_lookup" (map through an organism-specific bundled cache).
  • Organism: human, mouse, rat, or another supported species (required for ID lookup mode).
  • Gene ID type: ensembl, entrez, uniprot, or uniprot_swissprot (required for ID lookup mode).
  • Duplicate policy: "sum_duplicates" (sum counts across duplicates) or "keep_first" (retain only the first occurrence).

What to Expect

  • Version suffixes on Ensembl IDs are automatically stripped (ENSG00000141510.12 becomes ENSG00000141510) before lookup.
  • ID lookup uses the bundled local annotation cache distributed with easyCris.
  • Duplicate genes are merged according to the selected policy, and the number of merged entries is reported in a diagnostic message.
  • Rows with empty or unmappable gene labels are dropped and reported so you can verify no important genes were lost.

Common Pitfalls

  • Selecting the wrong organism maps gene IDs incorrectly -- always verify that the organism matches your sequencing data.
  • The sum_duplicates policy can inflate counts if duplicates arise from annotation artifacts rather than genuine multi-mapping. Use it when you are confident the duplicates represent the same biological gene.
  • The keep_first policy is safer for initial exploration; switch to sum_duplicates only when you understand why duplicates exist in your data.

Citations

References

  • Cunningham, F. et al. (2022). Ensembl 2022. Nucleic Acids Research, 50(D1), D988-D995.