LayoutTranslateBench v0.1 — Findings
LayoutTranslateBench v0.1 — Findings
This document records what we measured in v0.1 of the LayoutTranslateBench dataset. It is intentionally written as research — observations and numbers, not recommendations. Methodology lives in BENCHMARK.md and docs/methodology.md; raw scores live under results/.
TL;DR
On the v0.1 sample dataset (5 documents × 8 language pairs), four systems were scored:
| Rank | System | LTB-100 | chrF | Layout IoU | Reading-order τ | Coverage |
|---|---|---|---|---|---|---|
| 1 | deepl-text-oracle | 84.52 | 69.04 | 1.000 | 1.000 | 6/8 |
| 2 | nllb-text-oracle | 77.58 | 55.17 | 1.000 | 1.000 | 8/8 |
| 3 | identity-baseline | 63.71 | 27.41 | 1.000 | 1.000 | 8/8 |
| 4 | qwen3-vl-2b-instruct | 20.83 | 4.28 | 0.040 | 0.875 | 8/8 |
All four runs cost $0 (DeepL free tier, the rest CPU-local). Total wall-clock to reproduce all four rows: approximately 3 hours.
How LTB-100 decomposes
The composite score combines three signals, weighted 50/30/20:
LTB-100 = 100 × ( 0.50 × chrF/100 + 0.30 × Layout-IoU + 0.20 × Reading-order-τ )
| Metric | Range | What it captures |
|---|---|---|
| chrF (text quality) | 0–100 | Character-level F-score against reference translations |
| Layout IoU | 0–1 | Mean IoU of predicted text-region bounding boxes vs ground truth |
| Reading-order Kendall τ | 0–1 | Normalised order correlation between source and predicted regions |
The composite is biased toward text quality (50%) but the other 50% is layout — a system that translates well and places text poorly does not get full credit.
Observation 1 — open-weight MT vs commercial MT, oracle layout
When both systems are given oracle bounding boxes (predicted bboxes = ground truth) and asked only to translate, the gap between commercial DeepL and open-source NLLB-200-distilled-600M is smaller than common assumption:
| Pair | DeepL | NLLB-200 | Δ |
|---|---|---|---|
| en-es | 91.23 | 87.15 | −4.1 |
| en-de | 88.46 | 82.32 | −6.1 |
| en-zh | 71.16 | 72.01 | +0.9 |
| en-ar | 87.11 | 77.97 | −9.1 |
| en-ja | 79.59 | 66.89 | −12.7 |
| en-fr | 89.59 | 81.26 | −8.3 |
| en-th | — | 74.22 | DeepL n=0 |
| en-ms | — | 78.85 | DeepL n=0 |
On en-zh, NLLB-200 scores higher than DeepL on this sample. On the other shared pairs, the average gap is roughly 7 LTB-100 points. The largest gap (en-ja, 12.7 points) and the smallest (en-zh, −0.9) bracket the chrF spread of this small sample.
Observation 2 — DeepL coverage on en-th and en-ms
DeepL’s Text API supports the following target languages as of v0.1 of this benchmark: AR, BG, CS, DA, DE, EL, EN, ES, ET, FI, FR, HU, ID, IT, JA, KO, LT, LV, NB, NL, PL, PT, RO, RU, SK, SL, SV, TR, UK, ZH (source: DeepL API documentation, retrieved 2026-05-18). The pairs en-th (Thai) and en-ms (Bahasa Melayu) are not in this list. Both pairs are supported by NLLB-200 and were scored on the LTB v0.1 sample.
Observation 3 — zero-shot VLM bbox grounding on documents
A popular open-weight 2B vision-language model (Qwen3-VL-2B-Instruct) was scored end-to-end (no oracle layout): the model was given the source image and prompted to output both translated text and bounding boxes per region.
| Pair | LTB-100 | chrF | Layout IoU | Reading-order τ |
|---|---|---|---|---|
| en-es | 20.13 | 5.60 | 0.044 | 0.800 |
| en-de | 19.55 | 4.66 | 0.041 | 0.800 |
| en-zh | 17.85 | 1.62 | 0.035 | 0.800 |
| en-ar | 19.10 | 4.13 | 0.034 | 0.800 |
| en-ja | 22.16 | 1.63 | 0.045 | 1.000 |
| en-fr | 20.02 | 5.88 | 0.036 | 0.800 |
| en-th | 23.16 | 3.97 | 0.039 | 1.000 |
| en-ms | 24.65 | 6.78 | 0.042 | 1.000 |
Overall LTB-100 = 20.83. Layout IoU averages 0.040 across pairs (vs 1.000 for oracle-layout runners), which dominates the composite. Reading order is mostly preserved (mean τ = 0.875). The translation per region is competent on average; the bounding boxes are not.
Why vision-language models struggle with document bbox grounding
This is an architectural observation, not a quality judgement of the model. Four contributing factors:
- Image tokens are coarse. A vision encoder reduces a typical full-page document (≈ 880k pixels) to 256–1024 patch tokens. Each token covers 850–3,400 source pixels. Pixel-precise bounding boxes are not recoverable at this granularity through a text-decoder.
- Coordinates are predicted as text, not regressed. The model emits coordinate strings character-by-character via autoregressive decoding. There is no spatial inductive bias — no convolutional detection head, no anchor boxes, no IoU loss.
- Document layouts are out-of-distribution. Public VLM pretraining is dominated by natural images, captions, VQA, and free-form OCR. The “identify every text region with pixel-precise bbox + translation” task is rare in pretraining data.
- Output coordinate conventions vary. Qwen-VL family models sometimes emit coordinates normalised to 1000-unit space, sometimes to the model’s internal vision resolution. Without explicit rescaling, the output can be off by a constant factor.
The runner in ltbench/runners/qwen_vl.py implements bbox normalisation that accepts both (x, y, w, h) and (x1, y1, x2, y2) formats; the residual IoU error after that conversion is what is reported.
Limitations of v0.1
- Sample size. 5 documents × 8 pairs = 40 doc-pair combinations per system. Numbers are indicative; significance testing requires more documents.
- Author-curated references. v0.1 references were produced by the project authors with reasonable care, not by certified human translators. v0.2 will replace them.
- Single VLM size class. Only Qwen3-VL-2B was scored. Larger Qwen3-VL variants (4B, 8B) and grounding-tuned models (e.g. Florence-2) may score very differently and are deferred to v0.2.
- Visual fidelity not yet scored. LPIPS / SSIM on non-text regions and OCR round-trip are documented as v0.2 metrics in
BENCHMARK.md. - One reference per pair. Multiple-reference scoring is a future option for noisier targets.
One-paragraph summary
On LayoutTranslateBench v0.1 — a public, reproducible benchmark for document translation that scores layout fidelity and reading order alongside translation quality — four systems were scored on a 5-document × 8-language-pair sample. Open-source NLLB-200-distilled-600M scored 77.58 LTB-100 across all 8 pairs; commercial DeepL scored 84.52 across the 6 pairs it supports (en-th and en-ms are not in DeepL’s supported set). An identity baseline (source text returned unchanged) scored 63.71. A 2B zero-shot vision-language model (Qwen3-VL-2B-Instruct) scored 20.83 end-to-end, with the composite drop attributable primarily to bounding-box predictions (mean IoU 0.040 vs the oracle 1.000). All scores reproduce at zero cost from the repository.
For methodology, reproduction steps, or submitting a new system, see BENCHMARK.md, docs/methodology.md, and docs/submission.md.