LayoutTranslateBench Leaderboard

The first public benchmark for document translation that scores layout fidelity and reading order alongside translation quality. Composite score LTB-100 combines chrF (text), layout IoU (visual), and reading-order Kendall tau.

Generated 2026-05-22T05:20:16+00:00 from LayoutTranslateBench v0.1.7.2.

Sample size — v0.1.5. Per-pair counts: N=20 for en-es/en-de/en-ar/en-fr/en-th/en-ms (10 author-curated + 10 FLORES-200), N=28 for en-ja (+8 rileykim), N=27 for en-zh (+7 rileykim). The LTB-100 cell shows point [95% CI] via 1000-resample percentile bootstrap. CIs at N=20 are roughly √2× tighter than v0.1.3's N=10. v0.2 will scale to N≥25 author-curated per pair.

Two metrics shown. chrF (character-level F-score, fast, deterministic) and COMET-Kiwi-22 (reference-free neural QE, slower but better correlated with human judgment). Rankings can differ — paraphrasing systems often score lower on chrF than on COMET.

End-to-end systems — chrF

Runners that produce their own bounding boxes — the realistic score.

#	System	LTB-100 [95% CI]	chrF	Layout IoU	Reading-order τ	Coverage	Parser fails	Median s/doc	Cost USD	Hardware
1	identity-baseline v0.1.0	50.38 [50.2, 50.6]	0.62	1.0000	1.0000	16/16	—	—	—	cpu
2	florence-nllb v0.1.0	39.79 [36.6, 42.9]	37.93	0.1541	0.8104	8/16	—	—	—	cpu
3	qwen3-vl-2b-instruct v0.1.0	14.82 [12.1, 17.4]	5.13	0.1084	0.4504	8/16	5	184.90	$0.0000	cpu

Oracle-layout reference — chrF

Runners given ground-truth bounding boxes; text-quality ceiling only.

#	System	LTB-100 [95% CI]	chrF	Layout IoU	Reading-order τ	Coverage	Parser fails	Median s/doc	Cost USD	Hardware
1	deepl-text-oracle v0.1.0	78.20 [75.2, 81.7]	56.40	1.0000	1.0000	6/16	—	0.40	$0.0000	api
2	nllb-text-oracle v0.1.0	73.71 [72.5, 74.9]	47.61	1.0000	1.0000	16/16	—	—	—	cpu
3	opus-mt-text-oracle v0.1.0	68.60 [67.1, 70.0]	37.26	1.0000	1.0000	16/16	—	—	—	cpu

End-to-end systems — COMET-Kiwi

Same systems scored with reference-free COMET-Kiwi-22.

#	System	LTB-100 [95% CI]	COMET-Kiwi	Layout IoU	Reading-order τ	Coverage	Parser fails	Median s/doc	Cost USD	Hardware
1	identity-baseline v0.1.0	50.47 [50.3, 50.7]	0.81	1.0000	1.0000	16/16	—	—	—	cpu
2	florence-nllb v0.1.0	48.11 [44.1, 51.7]	54.56	0.1541	0.8104	8/16	—	—	—	cpu
3	qwen3-vl-2b-instruct v0.1.0	18.43 [14.6, 22.3]	12.33	0.1084	0.4504	8/16	5	184.90	$0.0000	cpu

Oracle-layout reference — COMET-Kiwi

Same systems, oracle-layout, COMET-Kiwi.

#	System	LTB-100 [95% CI]	COMET-Kiwi	Layout IoU	Reading-order τ	Coverage	Parser fails	Median s/doc	Cost USD	Hardware
1	nllb-text-oracle v0.1.0	86.51 [85.3, 87.7]	72.15	1.0000	1.0000	16/16	—	—	—	cpu
2	deepl-text-oracle v0.1.0	84.89 [81.4, 88.0]	69.77	1.0000	1.0000	6/16	—	0.40	$0.0000	api
3	opus-mt-text-oracle v0.1.0	79.35 [77.9, 80.8]	57.36	1.0000	1.0000	16/16	—	—	—	cpu

Specification Methodology Submit Cite FAQ GitHub