Hidden Hazards: Why Machine Translation Can't Be Trusted with Numbers

S

Steven W. Halim

E

Eko J. Salim

The Discovery

At Omni, we recently stumbled upon a surprising issue: our AI translation models were struggling with numbers. After several users reported inconsistencies, we dug deeper and found that in our tests, numerical translations were incorrect 22% of instances (n=396). This wasn’t just a small glitch—it was a significant problem that needed attention.

Our Investigation

This piqued our curiosity about how other LLMs (large language models) handle numerical translations. We tested various models like ChatGPT, Gemini, and Claude, and discovered that this isn’t a problem unique to us—there’s a broader phenomenon at play across different AI models: numerical mistranslations.

Experiment 1: Simple Number Translation with ChatGPT

experiment-1.png

We started with a basic test: translating “1个亿” from Chinese to English. ChatGPT successfully rendered this as “100 million,” the correct interpretation. This gave us some initial confidence that numerical accuracy might be more reliable than we thought.

Experiment 2: Story-Based Translations

experiment-2.png

Next, we attempted to translate a story with “1个亿” embedded within the passage. Surprisingly, ChatGPT’s output changed dramatically—translating it as “1 billion” instead of “100 million.” You can checkout the result yourself at this link. This inconsistency highlighted a critical issue: context significantly affects numerical translation accuracy.

Experiment 3: More Examples

experiment-3a.png

experiment-3b.png

We also tested another numerical term, “万年”, which translates to “ten thousand years.” When presented as a standalone phrase, all models translated it correctly. However, when embedded within a narrative context, the output shifted to “a thousand years”. It’s clear that context significantly impacts how these models interpret numbers, often leading to misinterpretation.

The Real Issue: Inconsistent and Unreliable Outputs

Our research suggests that these errors aren’t random but follow specific patterns:

  1. Longer sentences and paragraphs are more prone to numerical mistranslations
  2. Literary or complex contexts increase the likelihood of errors
  3. The issue appears consistent across different LLMs

We hypothesize that these models may be overfitting to web translations or training data where numbers were incorrectly translated or rounded off. The most concerning aspect is the unpredictability - users cannot reliably predict when a model might produce an incorrect numerical translation.

What We’re Doing at Omni

At Omni, we’re not just highlighting this challenge—we’re actively solving it. While our current system faces the same hurdles as others in the industry, our team of language and AI experts is developing groundbreaking solutions for precise numerical translations. We’re excited to share that our upcoming release will feature specialized algorithms designed to handle numbers with unprecedented accuracy, setting a new industry standard for reliable translations.

The Bottom Line

AI translation has come a long way, but it’s not perfect—especially with numbers. For now, we recommend users exercise caution when translating content containing important numerical information and verify critical numbers through multiple channels.