Catching AI Lies in Translation: An OTEL-Powered Quality Loop
Here’s a sentence from an AI-translated biography of a Brazilian dance artist:
“A jornada que começou na Holanda, atravessou oceanos e agora retorna ao lar” (“The journey that began in the Netherlands, crossed oceans, and now returns home”)
Beautiful, right? Poetic, even. One problem: the artist never went to the Netherlands. The English source doesn’t mention the Netherlands. Nobody mentioned the Netherlands. The AI made it up, wove it into flawless Portuguese, and kept going like nothing happened.
If you’ve ever trusted an AI translation without reading it against the source — and let’s be honest, who has the time — this is what’s slipping through. Not garbled grammar. Not obvious errors. Confident fiction that reads like fact.
We built a pipeline that catches it. Here’s how.
Fluent Doesn’t Mean Faithful
Old-school machine translation was bad in obvious ways. Mangled grammar, robotic phrasing, wrong idioms. You could tell at a glance that something was off.
Modern LLMs have the opposite problem. The grammar is perfect. The tone is natural. The idioms land. And sometimes the model quietly invents information that doesn’t exist in the source — a date, a statistic, a biographical detail — and buries it in prose so fluent you’d never question it.
For client-facing business reports, this isn’t a style issue. A fabricated statistic in a market analysis misleads decisions. An invented biographical detail damages professional relationships. You can’t catch it by reading the translation alone. You have to compare every claim against the source.
Nobody does that manually. So we automated it.
The Loop
The translation-improvement agent runs a continuous quality loop over translated documents:
- Read both files — English source and PT-BR translation
- Line-by-line audit — flag translated content with no source counterpart
- Classify — is it a hallucination (fabricated fact, P0) or an embellishment (tone-only, safe)?
- Fix — remove or correct fabricated content
- Score — emit OTEL evaluations for every quality dimension
- Check thresholds — pass or fail against criteria
- Iterate — repeat until everything passes
Step 3 is the key insight. Not all deviations are problems. Brazilian Portuguese is expressive — a good translation should add flavor. “Energia demais!” (So much energy!) isn’t a hallucination. It’s localization. A pipeline that flags every deviation produces noise. Ours distinguishes between cultural embellishments and fabricated facts.
What Gets Measured
Every run emits gen_ai.evaluation.result records — structured OTEL events — tracking seven dimensions per file:
| Metric | What It Measures | Target |
|---|---|---|
translation.hallucination | Rate of fabricated factual content | <= 0.01 |
translation.faithfulness | Fidelity to source document facts | >= 0.98 |
translation.readability_parity | Reading level gap (FK grade) | <= 2 grades |
translation.coherence | Natural flow in target language | >= 0.98 |
translation.relevance | All source content preserved | >= 0.98 |
translation.length_ratio | Word count EN-to-PT ratio | 1.05 - 1.25 |
translation.quality.score | Weighted composite | >= 0.90 |
The composite formula weights hallucination prevention highest:
composite = (1 - hallucination) * 0.30
+ faithfulness * 0.25
+ relevance * 0.20
+ coherence * 0.15
+ readability_parity * 0.10
The Numbers: From 17% to Zero
The pipeline has run across multiple sessions since early February 2026, translating three PT-BR documents for a dance company. Here’s the trajectory.
Feb 11: First Pass
Early LLM-as-Judge evaluations surfaced warning signals immediately:
| Session | Faithfulness | Hallucination | Status |
|---|---|---|---|
a75991f7 (early) | 0.828 | 0.172 | Warning |
a75991f7 (mid) | 0.901 | 0.099 | Improving |
a75991f7 (late) | 0.957 | 0.043 | Near-pass |
bab0a186 | 0.906 | 0.095 | Warning |
Hallucination rates between 5-17%. The session also burned 1.75M tokens with a 998:1 input-to-output ratio, and 63% of tool calls went to task management overhead rather than actual translation. Expensive, noisy, and untrustworthy.
Feb 13-14: Post-Mortem
We found the Netherlands sentence, traced it to likely LLM confabulation from training data about Brazilian dance communities in Europe, and designed specialized agents to replace the bloated general-purpose approach.
Session evaluations showed the improvement work paying off:
| Session | Faithfulness | Hallucination | Context |
|---|---|---|---|
4c6726ce | 0.857 | 0.143 | Pre-fix session |
b95c2314 | 0.966 | 0.034 | Post-fix session |
b95c2314 (re-eval) | 0.968 | 0.032 | Confirmed |
Feb 17: Clean Sweep
The mature pipeline hit its stride. Per-file evaluation data from session trans-quality-1771322194:
| File | Quality | Hallucination | Faithfulness | Coherence | Relevance | Length Ratio |
|---|---|---|---|---|---|---|
edghar_nadyne_perfil_artista.html | 0.999 | 0.000 | 1.000 | 0.990 | 1.000 | 1.169 |
analise_mercado_zouk.html | 0.999 | 0.000 | 1.000 | 0.990 | 1.000 | 1.204 |
analise_mercado_austin.html | 0.999 | 0.000 | 1.000 | 0.990 | 1.000 | 1.247 |
All thresholds pass. One fix applied: the Netherlands sentence, removed.
The full trajectory:
Session Hallucination Rate
───────────── ──────────────────
Feb 11 (early) ████████████████░░░░ 17.2%
Feb 11 (mid) █████████░░░░░░░░░░░ 9.9%
Feb 11 (late) ████░░░░░░░░░░░░░░░░ 4.3%
Feb 14 (pre) ██████████████░░░░░░ 14.3%
Feb 14 (post) ███░░░░░░░░░░░░░░░░░ 3.2%
Feb 17 (final) ░░░░░░░░░░░░░░░░░░░░ 0.0%
Why the Netherlands Sentence Matters
That fabricated sentence is instructive beyond this one project.
The model drew on training data — probably real information about Brazilian dance communities in Europe — inferred a plausible biographical detail, and inserted it with the same tone as verified content. It reads beautifully. It sounds authentic. It is entirely made up.
This is the invisible failure mode that makes AI translation dangerous for business content. Not wrong grammar. Not awkward phrasing. Plausible fiction presented as fact. And the pipeline caught it the only way it can be caught at scale: comparing every sentence in the translation against the source, flagging content with no counterpart.
Embellishment Is Not Hallucination
Ten colloquial additions survived the audit — intentionally:
- “Energia demais!” (So much energy!)
- “So gente top!” (Only top-notch people!)
- “Maravilhoso!” (Wonderful!)
- “Incrivel!” (Incredible!)
- “Gratidao!” (Gratitude!)
These carry no factual claims. They’re characteristic of natural Brazilian Portuguese. A translation that stripped them would be technically faithful and culturally dead.
This is the hard part of automated translation evaluation: you need a system smart enough to know that “Energia demais!” is localization and “The journey began in the Netherlands” is fabrication. Flag everything, and you drown in false positives. Flag nothing, and the Netherlands sentence ships.
The OTEL Architecture
Each evaluation emits a structured JSON record using the OpenTelemetry gen_ai.evaluation.result semantic convention:
{
"timestamp": "2026-02-17T09:56:34.000Z",
"name": "gen_ai.evaluation.result",
"attributes": {
"gen_ai.evaluation.name": "translation.hallucination",
"gen_ai.evaluation.score.value": 0.000,
"gen_ai.evaluation.evaluator": "translation-improvement",
"gen_ai.evaluation.evaluator.type": "llm",
"gen_ai.agent.name": "translation-improvement",
"gen_ai.evaluation.file": "edghar_nadyne_perfil_artista.html",
"session.id": "trans-quality-1771322194"
}
}
Written to daily JSONL files, aggregatable across sessions, files, and metrics. This gives you:
- Regression detection — a future edit re-introduces a hallucination? Next run catches it.
- Cross-file comparison — quality scores comparable across documents and languages
- Trend analysis — is the pipeline improving or degrading over time?
- Alerting — threshold breaches trigger notifications before bad translations reach production
What’s Next
Three of 19 English reports have PT-BR translations (15.8% coverage). The architecture scales from here:
- New languages: Evaluation dimensions are language-agnostic; only the length ratio target changes
- New documents: Each session automatically emits evaluations for all files processed
- Voice matching: A planned voice style matcher will score translations against the client’s actual communication style (target: > 0.85 fidelity)
- Continuous monitoring: Translation quality feeds into the same observability stack as everything else
The lesson is the same one that keeps surfacing across AI work: generation is the easy part. Verification is the competitive advantage. Without structured telemetry, the Netherlands sentence ships to production. With it, you catch the fabrication — and keep the “Energia demais!”
Translation quality is one instance of a broader problem: AI systems that appear to work perfectly while producing invisible failures. At Integrity Studio, we build the measurement systems that make quality visible. If your AI outputs look right but you can’t prove it, that’s where we start.
Appendix: Readability Analysis (textstat)
Readability scores computed via the textstat MCP server on the full article body and individual sections.
Full Article
| Metric | Score | Interpretation |
|---|---|---|
| Flesch Reading Ease | 43.4 | Difficult (college level) |
| Flesch-Kincaid Grade | 9.6 | 10th grade |
| SMOG Index | 11.7 | 12th grade |
| Gunning Fog | 12.5 | College freshman |
| Automated Readability Index | 10.3 | 10th grade |
| Dale-Chall | 11.1 | College |
| Coleman-Liau Index | 13.6 | College freshman |
| Linsear Write | 8.4 | 8th grade |
| Consensus | 10th-11th grade | High school junior |
| Statistic | Value |
|---|---|
| Word count | 785 |
| Sentence count | 81 |
| Syllable count | 1,425 |
| Polysyllabic words | 183 (23.3%) |
| Difficult words | 193 (24.6%) |
| Est. reading time | 1.1 min |
Per-Section Breakdown
| Section | Flesch Ease | FK Grade | Gunning Fog | Dale-Chall | Consensus |
|---|---|---|---|---|---|
| Introduction | 63.3 | 7.2 | 9.7 | 9.9 | 9th-10th |
| Fluent vs. Faithful | 49.3 | 8.4 | 11.1 | 10.3 | 10th-11th |
| Netherlands Analysis | 50.2 | 8.8 | 11.9 | 11.0 | 11th-12th |
| What’s Next / Conclusion | 38.2 | 10.3 | 13.7 | 11.1 | 11th-12th |
Target audience: Technical professionals and AI practitioners; readability is consistent with industry case studies and technical blog posts (typically 10th-14th grade). The introduction scores most accessible (9th-10th grade, Flesch 63.3 “Standard”) due to short punchy sentences and direct address. The conclusion scores highest difficulty due to compound technical noun phrases.
Methodology
- Scores generated by the
textstatPython library via MCP server integration - Analysis run on article body text only (front matter, code blocks, and table markup excluded)
- Section boundaries follow the H2 headings in the article structure
- Flesch Reading Ease scale: 0-29 Very Confusing, 30-49 Difficult, 50-59 Fairly Difficult, 60-69 Standard, 70+ Easy