LayoutTranslateBench — Methodology Roadmap

This document tracks methodology critiques against LTB and their status. The benchmark is a research artifact; methodology is expected to improve with each release. This file is the public, honest record of what’s been fixed and what remains.

Status as of v0.1.7.2

# Critique Status Notes
1 Sample size N=5 statistically meaningless ✅ Largely fixed (v0.1.7) Core 8: N=20 (en-es/de/ar/fr/th/ms), N=27 (en-zh), N=28 (en-ja). Extension 8 (v0.1.7): N=10 for en-ru (certified-only); N=13 each for en-ko/vi/id/ur/uz/kk/zh-tw (10 certified-FLORES + 3 ml-curated rileykim). Industry-grade FLORES-200 refs back every pair.
2 Identity baseline = chrF Latin-bleed artifact ✅ Fixed (v0.1.1) Language-detection gate (methodology.md)
3 chrF wrong text metric (paraphrase / adequacy blind) ✅ Fixed (v0.1.2 round-2) COMET-Kiwi-22 ships as a second leaderboard, run on all systems. See docs/comet-setup.md.
4 Oracle vs end-to-end conflation in headline scores ✅ Fixed (v0.1.1) Separate leaderboard tables
5 DeepL doc-context unfairness ✅ Clarified (v0.1.1) Both DeepL Text API and NLLB are per-region; this is a fair comparison. DeepL Documents (with context) is a v0.2 runner.
6 Composite weights 50/30/20 unvalidated ✅ Empirically defended (v0.1.1) Weight ablation script shows ranking stable (τ_norm = 1.0) across (50/30/20), (40/40/20), (60/20/20), (33/33/33). See results/weight_ablation.json.
7 Kendall τ partial-coverage hole ✅ Fixed (v0.1.1) Coverage-aware τ (methodology.md)
8 Single author-curated reference ⚠️ Partially addressed (v0.1.5 + v0.1.7) FLORES-200 adds industry-grade refs alongside the author-curated set; v0.1.7 extends FLORES coverage to all 16 LTB pairs. Still single-ref per region — multi-ref scoring deferred to v0.2.
9 Qwen-VL parser/model conflation ✅ Fixed (v0.1.2) --exclude-parser-failures flag on ltbench score. With-failures = 14.82; restricted = 16.94 (v0.1.7.2 scoring).
10 No human evaluation correlation ⚠️ Infrastructure shipped (v0.1.2) ltbench.human_eval module + ltbench export-eval-prompts + ltbench correlate-human CLI commands. Collecting judgments still requires paid raters (v0.2).
11 Reference corruption on mined refs (rileykim) ✅ Fixed (v0.1.6.4) Script-validation gate at ingest + post-hoc scripts/validate_extension_refs.py. Refs whose tgt-text script doesn’t match the expected target-language script are dropped. Surfaced en-ru rileykim rows were Chinese-tgt-text — en-ru dropped to N=0.
12 Open-source MT licensing (NLLB CC-BY-NC) ✅ Fixed (v0.1.6.2) Helsinki-NLP/opus-mt runner ships commercial-safe MT at 16/16 coverage (Apache-2.0 / CC-BY-4.0).
13 Per-pair quality variance for open MT ✅ Documented (v0.1.6.3) opus-mt is competitive with NLLB on European pairs, dramatically weaker on Asian / Central Asian. Per-pair gaps recorded in methodology.md.

Nine critiques fully fixed (#1, #2, #3, #4, #6, #7, #9, #11, #12), one clarified (#5), two mitigated (#8, #10-data, #10-infrastructure shipped), one documented (#13). No critiques remain truly open at v0.1.7.2 — v0.2 work (#8 multi-ref, #10 human eval, plus visual-fidelity / OCR round-trip metrics) is funded scope, not methodology debt.

What blocks the remaining items

Item What’s needed Cost Effort
#8 Multi-reference + certified translators 2–3 certified translators producing 1 reference each for 59 docs × 16 pairs ~€20k–50k Calendar weeks
#10 Human evaluation (data) 5 raters × ~50 DA judgments × 4–5 systems = ~1000 judgments ~€2k–5k via Mechanical Turk / Prolific, or €5k–10k via certified translators Calendar weeks
Visual fidelity (v0.2) LPIPS implementation on non-text regions of rendered output $0 ~1 week engineering
OCR round-trip (v0.2) Tesseract / PaddleOCR on rendered output, compare to predicted text $0 ~1 week engineering
DeepL coverage expansion Run DeepL on remaining 10 pairs; ~$5 API cost ~$5 1 hour

Sequencing

v0.1.2 shipped:

  • Parser-failure-excluded scoring--exclude-parser-failures flag on ltbench score. Restricts aggregation to documents whose submission did NOT trigger the runner’s parser fallback (1-region empty placeholder). Per-doc scores still include them with parser_failure=True so the count is visible. Demonstrated on Qwen-VL: with-failures = 14.82, without = 16.94 (v0.1.7.2 scoring).
  • Human-evaluation infrastructureltbench.human_eval module + ltbench export-eval-prompts (export CSV for raters) + ltbench correlate-human (compute Kendall τ / Pearson r between human DA and automatic LTB-100) + judgment JSONL schema. v0.1.2 ships the scaffolding; collecting actual judgments still requires paid raters (v0.2).
  • COMET-Kiwi-22 integration — initially deferred in v0.1.2 round 1 due to a numpy version conflict on Windows; resolved in round 2 with the .comet-env workflow. All 4 systems scored on both chrF and COMET-Kiwi; both leaderboards published.

v0.1.3 shipped: Dataset doubled to N=10 (5 new English-source real-world-template docs).

v0.1.4 shipped: rileykim/multilingual-document ML-curated expansion (10 → 25 docs, +8 en-ja, +7 en-zh).

v0.1.5 shipped: FLORES-200 industry-grade reference expansion — N=20 on core 8 (10 author-curated + 10 FLORES-200), N=27/28 on en-zh/en-ja.

v0.1.5.1 shipped: NLLB-200-distilled-600M + identity-baseline re-scored on the full N=35 corpus with chrF + COMET-Kiwi.

v0.1.6 shipped: LangPair extended from 8 to 16 (added en-ru, en-ko, en-vi, en-id, en-ur, en-uz, en-kk, en-zh-tw); NLLB runner covers all 16; language-detection gate gets script rules for non-Latin extension pairs; CORE_LANG_PAIRS frozen at the v0.1 core 8.

v0.1.6.1 shipped: NLLB extension rerun + macro/micro aggregation consistency fix.

v0.1.6.2 shipped: Helsinki-NLP/opus-mt runner — 14 models, 16/16 coverage, commercial-safe (Apache-2.0 / CC-BY-4.0). Ship-able where NLLB-200’s CC-BY-NC-4.0 is not.

v0.1.6.3 shipped: Methodology note documenting opus-mt vs NLLB per-pair quality variance (competitive on European pairs, dramatically weaker on Asian / Central Asian).

v0.1.6.4 shipped: fugumt en-ja swap (replaced opus-mt-en-jap with staka/fugumt-en-ja); ingest-time and post-hoc script validation drops rileykim refs whose tgt_text script doesn’t match expected. After cleanup, en-ru has N=0; other extension pairs retain ≥1.

v0.1.7 shipped: FLORES-200 extension-pair refs (scripts/add_flores_extension_refs.py) back-fill the 10 FLORES-derived docs (doc_026–doc_035) with certified-translator refs for all 8 v0.1.6 extension pairs. Closes the en-ru gap (0 → 10 docs) and upgrades the 7 surviving extension pairs from ml-curated rileykim refs to a mix of certified-FLORES + ml-curated-rileykim (13 docs each). Now every LTB pair has at least 10 certified-translator-grade reference documents.

v0.1.7.1 shipped: All systems re-scored against v0.1.7 certified-translator FLORES refs. NLLB and opus-mt re-run for all 8 extension pairs (101 actual translations each, skip_no_ref guard, 16/16 coverage). COMET-Kiwi-22 re-scored for all four systems. Fixed skip_no_ref bug in run_nllb (parameter declared but not applied in loop body). Leaderboard B staleness warning removed.

Key v0.1.7.1 findings:

  • chrF (oracle-layout): NLLB 73.71, opus-mt 68.60 (both 16/16)
  • COMET-Kiwi (oracle-layout): NLLB 86.51, DeepL 84.89, opus-mt 79.35 (NLLB/opus-mt now 16/16)
  • en-uz structural zero confirmed: NLLB chrF 0.07 / opus-mt chrF 0.04; COMET 0.31 — Uzbek-Latin unsupported at 600M-class scale
  • chrF–COMET gap largest for non-Latin pairs: NLLB en-kk COMET 83.5 vs chrF 52.2 (+31 pts), en-th COMET 76.0 vs chrF 45.2 (+31 pts), en-ko COMET 77.8 vs chrF 36.9 (+41 pts) — validates dual-metric benchmark design

v0.1.7.2 shipped: Region-matching bug fix. The scorer’s match_regions accepted an exact region_id string match before checking spatial overlap. Both end-to-end runners (Florence-NLLB, Qwen3-VL-2B-Instruct) emitted 0-indexed region ids (r0, r1, …) that collided with the ground-truth r1, r2, … namespace, so end-to-end regions were force-paired off-by-one to non-overlapping ground-truth regions — corrupting chrF, IoU, τ and LTB-100 for every end-to-end system. Oracle-layout and identity systems were unaffected: they copy ground-truth ids and boxes, so the exact-id match coincides with the correct spatial match. Fix: (a) match_regions now honors an exact-id match only when the boxes also overlap (IoU ≥ 0.10), so a coincidental id collision falls through to greedy IoU matching; (b) the end-to-end runners emit non-colliding p{i} ids and the committed end-to-end submissions were re-indexed. Both end-to-end systems re-scored: Florence-NLLB chrF-board LTB-100 21.22 → 39.79, COMET-board 34.35 → 48.11; Qwen3-VL-2B-Instruct 16.35 → 14.82 and 22.79 → 18.43 (its earlier reading-order τ had been inflated by the bad matching). Leaderboard regenerated; the leaderboard generator’s stale v0.1.6 sample-size caveat was corrected to v0.1.7.

v0.2 (next):

  • LPIPS visual-fidelity metric (10% weight) + OCR round-trip metric (10% weight) — engineering only, no budget
  • Held-out split rotation (20% private, refreshed quarterly) — engineering only
  • DeepL coverage expansion to all 16 pairs (currently 6/16) — ~$5 API cost
  • Multi-reference scoring (2 references per doc on the core 8, certified-translator quality) — requires budget
  • DA / SQM human evaluation on 50 doc-pair outputs across 4+ systems — requires budget

Why this is published openly

A benchmark whose methodology weaknesses are not publicly tracked is harder to trust. By writing this file we accept that:

  • Readers will use this list to argue against specific findings. That’s correct — they should.
  • Reviewers will use it as a checklist when scoring future submissions.
  • Contributors can pick an open item and propose a PR addressing it.

If you have a proposed fix for any open critique, open an issue tagged methodology with a concrete proposal before opening a PR — methodology changes need discussion before code.