LayoutTranslateBench Leaderboard
The first public benchmark for document translation that scores layout fidelity and reading order alongside translation quality. Composite score LTB-100 combines chrF (text), layout IoU (visual), and reading-order Kendall tau.
Sample size — v0.1.5. Per-pair counts: N=20 for en-es/en-de/en-ar/en-fr/en-th/en-ms (10 author-curated + 10 FLORES-200), N=28 for en-ja (+8 rileykim), N=27 for en-zh (+7 rileykim).
The LTB-100 cell shows
point [95% CI] via 1000-resample percentile bootstrap. CIs at N=20 are roughly √2× tighter than v0.1.3's N=10. v0.2 will scale to N≥25 author-curated per pair.
Two metrics shown. chrF (character-level F-score, fast, deterministic) and COMET-Kiwi-22 (reference-free neural QE, slower but better correlated with human judgment). Rankings can differ — paraphrasing systems often score lower on chrF than on COMET.
End-to-end systems — chrF
Runners that produce their own bounding boxes — the realistic score.
| # | System | LTB-100 [95% CI] | chrF | Layout IoU | Reading-order τ | Coverage | Parser fails | Median s/doc | Cost USD | Hardware |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | identity-baseline v0.1.0 | 50.38 [50.2, 50.6] | 0.62 | 1.0000 | 1.0000 | 16/16 | — | — | — | cpu |
| 2 | florence-nllb v0.1.0 | 39.79 [36.6, 42.9] | 37.93 | 0.1541 | 0.8104 | 8/16 | — | — | — | cpu |
| 3 | qwen3-vl-2b-instruct v0.1.0 | 14.82 [12.1, 17.4] | 5.13 | 0.1084 | 0.4504 | 8/16 | 5 | 184.90 | $0.0000 | cpu |
Oracle-layout reference — chrF
Runners given ground-truth bounding boxes; text-quality ceiling only.
| # | System | LTB-100 [95% CI] | chrF | Layout IoU | Reading-order τ | Coverage | Parser fails | Median s/doc | Cost USD | Hardware |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | deepl-text-oracle v0.1.0 | 78.20 [75.2, 81.7] | 56.40 | 1.0000 | 1.0000 | 6/16 | — | 0.40 | $0.0000 | api |
| 2 | nllb-text-oracle v0.1.0 | 73.71 [72.5, 74.9] | 47.61 | 1.0000 | 1.0000 | 16/16 | — | — | — | cpu |
| 3 | opus-mt-text-oracle v0.1.0 | 68.60 [67.1, 70.0] | 37.26 | 1.0000 | 1.0000 | 16/16 | — | — | — | cpu |
End-to-end systems — COMET-Kiwi
Same systems scored with reference-free COMET-Kiwi-22.
| # | System | LTB-100 [95% CI] | COMET-Kiwi | Layout IoU | Reading-order τ | Coverage | Parser fails | Median s/doc | Cost USD | Hardware |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | identity-baseline v0.1.0 | 50.47 [50.3, 50.7] | 0.81 | 1.0000 | 1.0000 | 16/16 | — | — | — | cpu |
| 2 | florence-nllb v0.1.0 | 48.11 [44.1, 51.7] | 54.56 | 0.1541 | 0.8104 | 8/16 | — | — | — | cpu |
| 3 | qwen3-vl-2b-instruct v0.1.0 | 18.43 [14.6, 22.3] | 12.33 | 0.1084 | 0.4504 | 8/16 | 5 | 184.90 | $0.0000 | cpu |
Oracle-layout reference — COMET-Kiwi
Same systems, oracle-layout, COMET-Kiwi.
| # | System | LTB-100 [95% CI] | COMET-Kiwi | Layout IoU | Reading-order τ | Coverage | Parser fails | Median s/doc | Cost USD | Hardware |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | nllb-text-oracle v0.1.0 | 86.51 [85.3, 87.7] | 72.15 | 1.0000 | 1.0000 | 16/16 | — | — | — | cpu |
| 2 | deepl-text-oracle v0.1.0 | 84.89 [81.4, 88.0] | 69.77 | 1.0000 | 1.0000 | 6/16 | — | 0.40 | $0.0000 | api |
| 3 | opus-mt-text-oracle v0.1.0 | 79.35 [77.9, 80.8] | 57.36 | 1.0000 | 1.0000 | 16/16 | — | — | — | cpu |