Catching AI Lies in Translation: An OTEL-Powered Quality Loop

Here’s a sentence from an AI-translated biography of a Brazilian dance artist:

“A jornada que começou na Holanda, atravessou oceanos e agora retorna ao lar” (“The journey that began in the Netherlands, crossed oceans, and now returns home”)

Beautiful, right? Poetic, even. One problem: the artist never went to the Netherlands. The English source doesn’t mention the Netherlands. Nobody mentioned the Netherlands. The AI made it up, wove it into flawless Portuguese, and kept going like nothing happened.

If you’ve ever trusted an AI translation without reading it against the source — and let’s be honest, who has the time — this is what’s slipping through. Not garbled grammar. Not obvious errors. Confident fiction that reads like fact.

We built a pipeline that catches it. Here’s how.

Fluent Doesn’t Mean Faithful

Old-school machine translation was bad in obvious ways. Mangled grammar, robotic phrasing, wrong idioms. You could tell at a glance that something was off.

Modern LLMs have the opposite problem. The grammar is perfect. The tone is natural. The idioms land. And sometimes the model quietly invents information that doesn’t exist in the source — a date, a statistic, a biographical detail — and buries it in prose so fluent you’d never question it.

For client-facing business reports, this isn’t a style issue. A fabricated statistic in a market analysis misleads decisions. An invented biographical detail damages professional relationships. You can’t catch it by reading the translation alone. You have to compare every claim against the source.

Nobody does that manually. So we automated it.

The Loop

The translation-improvement agent runs a continuous quality loop over translated documents:

Read both files — English source and PT-BR translation
Line-by-line audit — flag translated content with no source counterpart
Classify — is it a hallucination (fabricated fact, P0) or an embellishment (tone-only, safe)?
Fix — remove or correct fabricated content
Score — emit OTEL evaluations for every quality dimension
Check thresholds — pass or fail against criteria
Iterate — repeat until everything passes

Step 3 is the key insight. Not all deviations are problems. Brazilian Portuguese is expressive — a good translation should add flavor. “Energia demais!” (So much energy!) isn’t a hallucination. It’s localization. A pipeline that flags every deviation produces noise. Ours distinguishes between cultural embellishments and fabricated facts.

What Gets Measured

Every run emits gen_ai.evaluation.result records — structured OTEL events — tracking seven dimensions per file:

Metric	What It Measures	Target
`translation.hallucination`	Rate of fabricated factual content	<= 0.01
`translation.faithfulness`	Fidelity to source document facts	>= 0.98
`translation.readability_parity`	Reading level gap (FK grade)	<= 2 grades
`translation.coherence`	Natural flow in target language	>= 0.98
`translation.relevance`	All source content preserved	>= 0.98
`translation.length_ratio`	Word count EN-to-PT ratio	1.05 - 1.25
`translation.quality.score`	Weighted composite	>= 0.90

The composite formula weights hallucination prevention highest:

composite = (1 - hallucination) * 0.30
          + faithfulness * 0.25
          + relevance * 0.20
          + coherence * 0.15
          + readability_parity * 0.10

The Numbers: From 17% to Zero

The pipeline has run across multiple sessions since early February 2026, translating three PT-BR documents for a dance company. Here’s the trajectory.

Feb 11: First Pass

Early LLM-as-Judge evaluations surfaced warning signals immediately:

Session	Faithfulness	Hallucination	Status
`a75991f7` (early)	0.828	0.172	Warning
`a75991f7` (mid)	0.901	0.099	Improving
`a75991f7` (late)	0.957	0.043	Near-pass
`bab0a186`	0.906	0.095	Warning

Hallucination rates between 5-17%. The session also burned 1.75M tokens with a 998:1 input-to-output ratio, and 63% of tool calls went to task management overhead rather than actual translation. Expensive, noisy, and untrustworthy.

Feb 13-14: Post-Mortem

We found the Netherlands sentence, traced it to likely LLM confabulation from training data about Brazilian dance communities in Europe, and designed specialized agents to replace the bloated general-purpose approach.

Session evaluations showed the improvement work paying off:

Session	Faithfulness	Hallucination	Context
`4c6726ce`	0.857	0.143	Pre-fix session
`b95c2314`	0.966	0.034	Post-fix session
`b95c2314` (re-eval)	0.968	0.032	Confirmed

Feb 17: Clean Sweep

The mature pipeline hit its stride. Per-file evaluation data from session trans-quality-1771322194:

File	Quality	Faithfulness	Coherence	Relevance	Length Ratio
`edghar_nadyne_perfil_artista.html`	0.999	1.000	0.990	1.000	1.169
`analise_mercado_zouk.html`	0.999	1.000	0.990	1.000	1.204
`analise_mercado_austin.html`	0.999	1.000	0.990	1.000	1.247

All thresholds pass. One fix applied: the Netherlands sentence, removed.

The full trajectory:

Session        Hallucination Rate
─────────────  ──────────────────
Feb 11 (early)  ████████████████░░░░  17.2%
Feb 11 (mid)    █████████░░░░░░░░░░░   9.9%
Feb 11 (late)   ████░░░░░░░░░░░░░░░░   4.3%
Feb 14 (pre)    ██████████████░░░░░░  14.3%
Feb 14 (post)   ███░░░░░░░░░░░░░░░░░   3.2%
Feb 17 (final)  ░░░░░░░░░░░░░░░░░░░░   0.0%

Why the Netherlands Sentence Matters

That fabricated sentence is instructive beyond this one project.

The model drew on training data — probably real information about Brazilian dance communities in Europe — inferred a plausible biographical detail, and inserted it with the same tone as verified content. It reads beautifully. It sounds authentic. It is entirely made up.

This is the invisible failure mode that makes AI translation dangerous for business content. Not wrong grammar. Not awkward phrasing. Plausible fiction presented as fact. And the pipeline caught it the only way it can be caught at scale: comparing every sentence in the translation against the source, flagging content with no counterpart.

Embellishment Is Not Hallucination

Ten colloquial additions survived the audit — intentionally:

“Energia demais!” (So much energy!)
“So gente top!” (Only top-notch people!)
“Maravilhoso!” (Wonderful!)
“Incrivel!” (Incredible!)
“Gratidao!” (Gratitude!)

These carry no factual claims. They’re characteristic of natural Brazilian Portuguese. A translation that stripped them would be technically faithful and culturally dead.

This is the hard part of automated translation evaluation: you need a system smart enough to know that “Energia demais!” is localization and “The journey began in the Netherlands” is fabrication. Flag everything, and you drown in false positives. Flag nothing, and the Netherlands sentence ships.

The OTEL Architecture

Each evaluation emits a structured JSON record using the OpenTelemetry gen_ai.evaluation.result semantic convention:

{
  "timestamp": "2026-02-17T09:56:34.000Z",
  "name": "gen_ai.evaluation.result",
  "attributes": {
    "gen_ai.evaluation.name": "translation.hallucination",
    "gen_ai.evaluation.score.value": 0.000,
    "gen_ai.evaluation.evaluator": "translation-improvement",
    "gen_ai.evaluation.evaluator.type": "llm",
    "gen_ai.agent.name": "translation-improvement",
    "gen_ai.evaluation.file": "edghar_nadyne_perfil_artista.html",
    "session.id": "trans-quality-1771322194"
  }
}

Written to daily JSONL files, aggregatable across sessions, files, and metrics. This gives you:

Regression detection — a future edit re-introduces a hallucination? Next run catches it.
Cross-file comparison — quality scores comparable across documents and languages
Trend analysis — is the pipeline improving or degrading over time?
Alerting — threshold breaches trigger notifications before bad translations reach production

What’s Next

Three of 19 English reports have PT-BR translations (15.8% coverage). The architecture scales from here:

New languages: Evaluation dimensions are language-agnostic; only the length ratio target changes
New documents: Each session automatically emits evaluations for all files processed
Voice matching: A planned voice style matcher will score translations against the client’s actual communication style (target: > 0.85 fidelity)
Continuous monitoring: Translation quality feeds into the same observability stack as everything else

The lesson is the same one that keeps surfacing across AI work: generation is the easy part. Verification is the competitive advantage. Without structured telemetry, the Netherlands sentence ships to production. With it, you catch the fabrication — and keep the “Energia demais!”

Translation quality is one instance of a broader problem: AI systems that appear to work perfectly while producing invisible failures. At Integrity Studio, we build the measurement systems that make quality visible. If your AI outputs look right but you can’t prove it, that’s where we start.

Appendix: Readability Analysis (textstat)

Readability scores computed via the textstat MCP server on the full article body and individual sections.

Full Article

Metric	Score	Interpretation
Flesch Reading Ease	43.4	Difficult (college level)
Flesch-Kincaid Grade	9.6	10th grade
SMOG Index	11.7	12th grade
Gunning Fog	12.5	College freshman
Automated Readability Index	10.3	10th grade
Dale-Chall	11.1	College
Coleman-Liau Index	13.6	College freshman
Linsear Write	8.4	8th grade
Consensus	10th-11th grade	High school junior

Statistic	Value
Word count	785
Sentence count	81
Syllable count	1,425
Polysyllabic words	183 (23.3%)
Difficult words	193 (24.6%)
Est. reading time	1.1 min

Per-Section Breakdown

Section	Flesch Ease	FK Grade	Gunning Fog	Dale-Chall	Consensus
Introduction	63.3	7.2	9.7	9.9	9th-10th
Fluent vs. Faithful	49.3	8.4	11.1	10.3	10th-11th
Netherlands Analysis	50.2	8.8	11.9	11.0	11th-12th
What’s Next / Conclusion	38.2	10.3	13.7	11.1	11th-12th

Target audience: Technical professionals and AI practitioners; readability is consistent with industry case studies and technical blog posts (typically 10th-14th grade). The introduction scores most accessible (9th-10th grade, Flesch 63.3 “Standard”) due to short punchy sentences and direct address. The conclusion scores highest difficulty due to compound technical noun phrases.

Methodology

Scores generated by the textstat Python library via MCP server integration
Analysis run on article body text only (front matter, code blocks, and table markup excluded)
Section boundaries follow the H2 headings in the article structure
Flesch Reading Ease scale: 0-29 Very Confusing, 30-49 Difficult, 50-59 Fairly Difficult, 60-69 Standard, 70+ Easy