LayoutTranslateBench Leaderboard

Generated 2026-05-22T05:20:16+00:00 from LayoutTranslateBench v0.1.7.2.

Sample size — v0.1.7. Core 8 pairs: N=20 for en-es/en-de/en-ar/en-fr/en-th/en-ms, N=28 for en-ja, N=27 for en-zh. Extension 8 pairs (v0.1.7): N=10 for en-ru (FLORES-200 only; rileykim refs removed at v0.1.6.4), N=13 for en-ko/en-vi/en-id/en-ur/en-uz/en-kk/en-zh-tw (3 rileykim + 10 FLORES). All 16 pairs now covered. LTB-100 cell shows point [95% CI low, CI high] via 1000-resample percentile bootstrap (seed 42). Coverage column shows pairs with ≥1 scored doc.

Metric. Two parallel leaderboards are shown — chrF (character-level F-score, fast, deterministic, paraphrase-blind) and COMET-Kiwi-22 (reference-free neural MT quality estimation, slower but more correlated with human judgment). System rankings can differ between metrics, especially for systems that paraphrase well.

Leaderboard A — chrF

End-to-end systems (chrF)

These runners produce their own bounding boxes. This is the realistic real-world score.

Rank	System	LTB-100 [95% CI]	chrF	Layout IoU	Reading-order τ	Coverage	Median runtime (s/doc)	Cost (USD)	Hardware
1	identity-baseline v0.1.0	50.38 [50.2, 50.6]	0.62	1.0000	1.0000	16/16	—	—	cpu
2	florence-nllb v0.1.0	39.79 [36.6, 42.9]	37.93	0.1541	0.8104	8/16	—	—	cpu
3	qwen3-vl-2b-instruct v0.1.0	14.82 [12.1, 17.4]	5.13	0.1084	0.4504	8/16	184.90	$0.0000	cpu

Oracle-layout reference (chrF)

Given ground-truth bounding boxes as predictions; only the text is translated. These are upper bounds on text-quality, not realistic end-to-end measurements.

Rank	System	LTB-100 [95% CI]	chrF	Layout IoU	Reading-order τ	Coverage	Median runtime (s/doc)	Cost (USD)	Hardware
1	deepl-text-oracle v0.1.0	78.20 [75.2, 81.7]	56.40	1.0000	1.0000	6/16	0.40	$0.0000	api
2	nllb-text-oracle v0.1.0	73.71 [72.5, 74.9]	47.61	1.0000	1.0000	16/16	—	—	cpu
3	opus-mt-text-oracle v0.1.0	68.60 [67.1, 70.0]	37.26	1.0000	1.0000	16/16	—	—	cpu

Leaderboard B — COMET-Kiwi-22

COMET-Kiwi is reference-free; the per-region chrF column shown above is replaced by the COMET-Kiwi score (also in [0, 100], higher = better).

End-to-end systems (COMET-Kiwi)

These runners produce their own bounding boxes. This is the realistic real-world score.

Rank	System	LTB-100 [95% CI]	COMET-Kiwi	Layout IoU	Reading-order τ	Coverage	Median runtime (s/doc)	Cost (USD)	Hardware
1	identity-baseline v0.1.0	50.47 [50.3, 50.7]	0.81	1.0000	1.0000	16/16	—	—	cpu
2	florence-nllb v0.1.0	48.11 [44.1, 51.7]	54.56	0.1541	0.8104	8/16	—	—	cpu
3	qwen3-vl-2b-instruct v0.1.0	18.43 [14.6, 22.3]	12.33	0.1084	0.4504	8/16	184.90	$0.0000	cpu

Oracle-layout reference (COMET-Kiwi)

Given ground-truth bounding boxes as predictions; only the text is translated. These are upper bounds on text-quality, not realistic end-to-end measurements.

Rank	System	LTB-100 [95% CI]	COMET-Kiwi	Layout IoU	Reading-order τ	Coverage	Median runtime (s/doc)	Cost (USD)	Hardware
1	nllb-text-oracle v0.1.0	86.51 [85.3, 87.7]	72.15	1.0000	1.0000	16/16	—	—	cpu
2	deepl-text-oracle v0.1.0	84.89 [81.4, 88.0]	69.77	1.0000	1.0000	6/16	0.40	$0.0000	api
3	opus-mt-text-oracle v0.1.0	79.35 [77.9, 80.8]	57.36	1.0000	1.0000	16/16	—	—	cpu

See BENCHMARK.md for the spec and docs/submission.md to submit.