LayoutTranslateBench — Methodology Roadmap
LayoutTranslateBench — Methodology Roadmap
This document tracks methodology critiques against LTB and their status. The benchmark is a research artifact; methodology is expected to improve with each release. This file is the public, honest record of what’s been fixed and what remains.
Status as of v0.1.7.2
| # | Critique | Status | Notes |
|---|---|---|---|
| 1 | Sample size N=5 statistically meaningless | ✅ Largely fixed (v0.1.7) | Core 8: N=20 (en-es/de/ar/fr/th/ms), N=27 (en-zh), N=28 (en-ja). Extension 8 (v0.1.7): N=10 for en-ru (certified-only); N=13 each for en-ko/vi/id/ur/uz/kk/zh-tw (10 certified-FLORES + 3 ml-curated rileykim). Industry-grade FLORES-200 refs back every pair. |
| 2 | Identity baseline = chrF Latin-bleed artifact | ✅ Fixed (v0.1.1) | Language-detection gate (methodology.md) |
| 3 | chrF wrong text metric (paraphrase / adequacy blind) | ✅ Fixed (v0.1.2 round-2) | COMET-Kiwi-22 ships as a second leaderboard, run on all systems. See docs/comet-setup.md. |
| 4 | Oracle vs end-to-end conflation in headline scores | ✅ Fixed (v0.1.1) | Separate leaderboard tables |
| 5 | DeepL doc-context unfairness | ✅ Clarified (v0.1.1) | Both DeepL Text API and NLLB are per-region; this is a fair comparison. DeepL Documents (with context) is a v0.2 runner. |
| 6 | Composite weights 50/30/20 unvalidated | ✅ Empirically defended (v0.1.1) | Weight ablation script shows ranking stable (τ_norm = 1.0) across (50/30/20), (40/40/20), (60/20/20), (33/33/33). See results/weight_ablation.json. |
| 7 | Kendall τ partial-coverage hole | ✅ Fixed (v0.1.1) | Coverage-aware τ (methodology.md) |
| 8 | Single author-curated reference | ⚠️ Partially addressed (v0.1.5 + v0.1.7) | FLORES-200 adds industry-grade refs alongside the author-curated set; v0.1.7 extends FLORES coverage to all 16 LTB pairs. Still single-ref per region — multi-ref scoring deferred to v0.2. |
| 9 | Qwen-VL parser/model conflation | ✅ Fixed (v0.1.2) | --exclude-parser-failures flag on ltbench score. With-failures = 14.82; restricted = 16.94 (v0.1.7.2 scoring). |
| 10 | No human evaluation correlation | ⚠️ Infrastructure shipped (v0.1.2) | ltbench.human_eval module + ltbench export-eval-prompts + ltbench correlate-human CLI commands. Collecting judgments still requires paid raters (v0.2). |
| 11 | Reference corruption on mined refs (rileykim) | ✅ Fixed (v0.1.6.4) | Script-validation gate at ingest + post-hoc scripts/validate_extension_refs.py. Refs whose tgt-text script doesn’t match the expected target-language script are dropped. Surfaced en-ru rileykim rows were Chinese-tgt-text — en-ru dropped to N=0. |
| 12 | Open-source MT licensing (NLLB CC-BY-NC) | ✅ Fixed (v0.1.6.2) | Helsinki-NLP/opus-mt runner ships commercial-safe MT at 16/16 coverage (Apache-2.0 / CC-BY-4.0). |
| 13 | Per-pair quality variance for open MT | ✅ Documented (v0.1.6.3) | opus-mt is competitive with NLLB on European pairs, dramatically weaker on Asian / Central Asian. Per-pair gaps recorded in methodology.md. |
Nine critiques fully fixed (#1, #2, #3, #4, #6, #7, #9, #11, #12), one clarified (#5), two mitigated (#8, #10-data, #10-infrastructure shipped), one documented (#13). No critiques remain truly open at v0.1.7.2 — v0.2 work (#8 multi-ref, #10 human eval, plus visual-fidelity / OCR round-trip metrics) is funded scope, not methodology debt.
What blocks the remaining items
| Item | What’s needed | Cost | Effort |
|---|---|---|---|
| #8 Multi-reference + certified translators | 2–3 certified translators producing 1 reference each for 59 docs × 16 pairs | ~€20k–50k | Calendar weeks |
| #10 Human evaluation (data) | 5 raters × ~50 DA judgments × 4–5 systems = ~1000 judgments | ~€2k–5k via Mechanical Turk / Prolific, or €5k–10k via certified translators | Calendar weeks |
| Visual fidelity (v0.2) | LPIPS implementation on non-text regions of rendered output | $0 | ~1 week engineering |
| OCR round-trip (v0.2) | Tesseract / PaddleOCR on rendered output, compare to predicted text | $0 | ~1 week engineering |
| DeepL coverage expansion | Run DeepL on remaining 10 pairs; ~$5 API cost | ~$5 | 1 hour |
Sequencing
v0.1.2 shipped:
- ✅ Parser-failure-excluded scoring —
--exclude-parser-failuresflag onltbench score. Restricts aggregation to documents whose submission did NOT trigger the runner’s parser fallback (1-region empty placeholder). Per-doc scores still include them withparser_failure=Trueso the count is visible. Demonstrated on Qwen-VL: with-failures = 14.82, without = 16.94 (v0.1.7.2 scoring). - ✅ Human-evaluation infrastructure —
ltbench.human_evalmodule +ltbench export-eval-prompts(export CSV for raters) +ltbench correlate-human(compute Kendall τ / Pearson r between human DA and automatic LTB-100) + judgment JSONL schema. v0.1.2 ships the scaffolding; collecting actual judgments still requires paid raters (v0.2). - ✅ COMET-Kiwi-22 integration — initially deferred in v0.1.2 round 1 due to a numpy version conflict on Windows; resolved in round 2 with the
.comet-envworkflow. All 4 systems scored on both chrF and COMET-Kiwi; both leaderboards published.
v0.1.3 shipped: Dataset doubled to N=10 (5 new English-source real-world-template docs).
v0.1.4 shipped: rileykim/multilingual-document ML-curated expansion (10 → 25 docs, +8 en-ja, +7 en-zh).
v0.1.5 shipped: FLORES-200 industry-grade reference expansion — N=20 on core 8 (10 author-curated + 10 FLORES-200), N=27/28 on en-zh/en-ja.
v0.1.5.1 shipped: NLLB-200-distilled-600M + identity-baseline re-scored on the full N=35 corpus with chrF + COMET-Kiwi.
v0.1.6 shipped: LangPair extended from 8 to 16 (added en-ru, en-ko, en-vi, en-id, en-ur, en-uz, en-kk, en-zh-tw); NLLB runner covers all 16; language-detection gate gets script rules for non-Latin extension pairs; CORE_LANG_PAIRS frozen at the v0.1 core 8.
v0.1.6.1 shipped: NLLB extension rerun + macro/micro aggregation consistency fix.
v0.1.6.2 shipped: Helsinki-NLP/opus-mt runner — 14 models, 16/16 coverage, commercial-safe (Apache-2.0 / CC-BY-4.0). Ship-able where NLLB-200’s CC-BY-NC-4.0 is not.
v0.1.6.3 shipped: Methodology note documenting opus-mt vs NLLB per-pair quality variance (competitive on European pairs, dramatically weaker on Asian / Central Asian).
v0.1.6.4 shipped: fugumt en-ja swap (replaced opus-mt-en-jap with staka/fugumt-en-ja); ingest-time and post-hoc script validation drops rileykim refs whose tgt_text script doesn’t match expected. After cleanup, en-ru has N=0; other extension pairs retain ≥1.
v0.1.7 shipped: FLORES-200 extension-pair refs (scripts/add_flores_extension_refs.py) back-fill the 10 FLORES-derived docs (doc_026–doc_035) with certified-translator refs for all 8 v0.1.6 extension pairs. Closes the en-ru gap (0 → 10 docs) and upgrades the 7 surviving extension pairs from ml-curated rileykim refs to a mix of certified-FLORES + ml-curated-rileykim (13 docs each). Now every LTB pair has at least 10 certified-translator-grade reference documents.
v0.1.7.1 shipped: All systems re-scored against v0.1.7 certified-translator FLORES refs. NLLB and opus-mt re-run for all 8 extension pairs (101 actual translations each, skip_no_ref guard, 16/16 coverage). COMET-Kiwi-22 re-scored for all four systems. Fixed skip_no_ref bug in run_nllb (parameter declared but not applied in loop body). Leaderboard B staleness warning removed.
Key v0.1.7.1 findings:
- chrF (oracle-layout): NLLB 73.71, opus-mt 68.60 (both 16/16)
- COMET-Kiwi (oracle-layout): NLLB 86.51, DeepL 84.89, opus-mt 79.35 (NLLB/opus-mt now 16/16)
- en-uz structural zero confirmed: NLLB chrF 0.07 / opus-mt chrF 0.04; COMET 0.31 — Uzbek-Latin unsupported at 600M-class scale
- chrF–COMET gap largest for non-Latin pairs: NLLB en-kk COMET 83.5 vs chrF 52.2 (+31 pts), en-th COMET 76.0 vs chrF 45.2 (+31 pts), en-ko COMET 77.8 vs chrF 36.9 (+41 pts) — validates dual-metric benchmark design
v0.1.7.2 shipped: Region-matching bug fix. The scorer’s match_regions accepted an exact region_id string match before checking spatial overlap. Both end-to-end runners (Florence-NLLB, Qwen3-VL-2B-Instruct) emitted 0-indexed region ids (r0, r1, …) that collided with the ground-truth r1, r2, … namespace, so end-to-end regions were force-paired off-by-one to non-overlapping ground-truth regions — corrupting chrF, IoU, τ and LTB-100 for every end-to-end system. Oracle-layout and identity systems were unaffected: they copy ground-truth ids and boxes, so the exact-id match coincides with the correct spatial match. Fix: (a) match_regions now honors an exact-id match only when the boxes also overlap (IoU ≥ 0.10), so a coincidental id collision falls through to greedy IoU matching; (b) the end-to-end runners emit non-colliding p{i} ids and the committed end-to-end submissions were re-indexed. Both end-to-end systems re-scored: Florence-NLLB chrF-board LTB-100 21.22 → 39.79, COMET-board 34.35 → 48.11; Qwen3-VL-2B-Instruct 16.35 → 14.82 and 22.79 → 18.43 (its earlier reading-order τ had been inflated by the bad matching). Leaderboard regenerated; the leaderboard generator’s stale v0.1.6 sample-size caveat was corrected to v0.1.7.
v0.2 (next):
- LPIPS visual-fidelity metric (10% weight) + OCR round-trip metric (10% weight) — engineering only, no budget
- Held-out split rotation (20% private, refreshed quarterly) — engineering only
- DeepL coverage expansion to all 16 pairs (currently 6/16) — ~$5 API cost
- Multi-reference scoring (2 references per doc on the core 8, certified-translator quality) — requires budget
- DA / SQM human evaluation on 50 doc-pair outputs across 4+ systems — requires budget
Why this is published openly
A benchmark whose methodology weaknesses are not publicly tracked is harder to trust. By writing this file we accept that:
- Readers will use this list to argue against specific findings. That’s correct — they should.
- Reviewers will use it as a checklist when scoring future submissions.
- Contributors can pick an open item and propose a PR addressing it.
If you have a proposed fix for any open critique, open an issue tagged methodology with a concrete proposal before opening a PR — methodology changes need discussion before code.