LayoutTranslateBench Leaderboard

The first public benchmark for document translation that scores layout fidelity and reading order alongside translation quality. Composite score LTB-100 combines chrF (text), layout IoU (visual), and reading-order Kendall tau.

Generated 2026-05-22T05:20:16+00:00 from LayoutTranslateBench v0.1.7.2.
Sample size — v0.1.5. Per-pair counts: N=20 for en-es/en-de/en-ar/en-fr/en-th/en-ms (10 author-curated + 10 FLORES-200), N=28 for en-ja (+8 rileykim), N=27 for en-zh (+7 rileykim). The LTB-100 cell shows point [95% CI] via 1000-resample percentile bootstrap. CIs at N=20 are roughly √2× tighter than v0.1.3's N=10. v0.2 will scale to N≥25 author-curated per pair.
Two metrics shown. chrF (character-level F-score, fast, deterministic) and COMET-Kiwi-22 (reference-free neural QE, slower but better correlated with human judgment). Rankings can differ — paraphrasing systems often score lower on chrF than on COMET.

End-to-end systems — chrF

Runners that produce their own bounding boxes — the realistic score.

# System LTB-100 [95% CI] chrF Layout IoU Reading-order τ Coverage Parser fails Median s/doc Cost USD Hardware
1 identity-baseline v0.1.0 50.38 [50.2, 50.6] 0.62 1.0000 1.0000 16/16 cpu
2 florence-nllb v0.1.0 39.79 [36.6, 42.9] 37.93 0.1541 0.8104 8/16 cpu
3 qwen3-vl-2b-instruct v0.1.0 14.82 [12.1, 17.4] 5.13 0.1084 0.4504 8/16 5 184.90 $0.0000 cpu

Oracle-layout reference — chrF

Runners given ground-truth bounding boxes; text-quality ceiling only.

# System LTB-100 [95% CI] chrF Layout IoU Reading-order τ Coverage Parser fails Median s/doc Cost USD Hardware
1 deepl-text-oracle v0.1.0 78.20 [75.2, 81.7] 56.40 1.0000 1.0000 6/16 0.40 $0.0000 api
2 nllb-text-oracle v0.1.0 73.71 [72.5, 74.9] 47.61 1.0000 1.0000 16/16 cpu
3 opus-mt-text-oracle v0.1.0 68.60 [67.1, 70.0] 37.26 1.0000 1.0000 16/16 cpu

End-to-end systems — COMET-Kiwi

Same systems scored with reference-free COMET-Kiwi-22.

# System LTB-100 [95% CI] COMET-Kiwi Layout IoU Reading-order τ Coverage Parser fails Median s/doc Cost USD Hardware
1 identity-baseline v0.1.0 50.47 [50.3, 50.7] 0.81 1.0000 1.0000 16/16 cpu
2 florence-nllb v0.1.0 48.11 [44.1, 51.7] 54.56 0.1541 0.8104 8/16 cpu
3 qwen3-vl-2b-instruct v0.1.0 18.43 [14.6, 22.3] 12.33 0.1084 0.4504 8/16 5 184.90 $0.0000 cpu

Oracle-layout reference — COMET-Kiwi

Same systems, oracle-layout, COMET-Kiwi.

# System LTB-100 [95% CI] COMET-Kiwi Layout IoU Reading-order τ Coverage Parser fails Median s/doc Cost USD Hardware
1 nllb-text-oracle v0.1.0 86.51 [85.3, 87.7] 72.15 1.0000 1.0000 16/16 cpu
2 deepl-text-oracle v0.1.0 84.89 [81.4, 88.0] 69.77 1.0000 1.0000 6/16 0.40 $0.0000 api
3 opus-mt-text-oracle v0.1.0 79.35 [77.9, 80.8] 57.36 1.0000 1.0000 16/16 cpu