LayoutTranslateBench — How to Submit

A submission to LayoutTranslateBench (LTB) is a directory containing one manifest.json (system metadata) and one JSONL file per language pair (en-es.jsonl, en-de.jsonl, en-zh.jsonl, en-ar.jsonl, en-ja.jsonl). Each line of each JSONL is one translated document, with predicted text regions and bounding boxes.

Quickstart

# 1. Install
pip install -e .

# 2. Verify the dataset is intact
ltbench verify

# 3. Produce an identity-baseline submission as a reference
ltbench run-baseline

# 4. Score your submission
ltbench score --submission submissions/<your-system>

# 5. Rebuild the leaderboard from the results/ directory
ltbench leaderboard

Directory layout

submissions/<your-system-name>/
├── manifest.json     # SystemManifest — required
├── en-es.jsonl       # one DocumentSubmission per line — required
├── en-de.jsonl
├── en-zh.jsonl
├── en-ar.jsonl
└── en-ja.jsonl

Partial submissions are accepted (e.g., only en-es.jsonl); the unscored pairs are simply omitted from the overall score, and the leaderboard flags the submission as partial.

`manifest.json` schema

See ltbench/schemas.py — SystemManifest. Example:

{
  "system_name": "deepl-doc",
  "system_version": "2026-04-01",
  "manifest_version": "0.1",
  "model_id_or_url": "https://www.deepl.com/docs-api",
  "runner_config": {"formality": "default", "preserve_formatting": true},
  "hardware": "DeepL Cloud",
  "total_runtime_seconds": 142.5,
  "median_per_doc_runtime_seconds": 5.3,
  "cost_usd": 0.42,
  "submitter": "Jane Doe <jane@example.org>",
  "notes": "Used DeepL Document API. Reading order extracted post-hoc by sorting by bbox y-coordinate."
}

JSONL line schema

Each line in <lang-pair>.jsonl is a DocumentSubmission:

{
  "doc_id": "doc_001",
  "regions": [
    {
      "region_id": "r1",
      "bbox": [200, 60, 400, 60],
      "text": "Certificado de Nacimiento",
      "reading_order": 0
    }
  ],
  "output_file": "outputs/doc_001.es.pdf",
  "runtime_seconds": 4.8
}

region_id should match the ground-truth region id when known. The scorer falls back to greedy IoU matching otherwise.
bbox is (x, y, width, height) in pixels relative to the source page. Use the bbox where your system actually placed the translated text.
reading_order is the position of this region in your system’s reading order (0-based, document-global).
output_file (optional, recommended) is a path to the rendered output document. Required for v0.2 visual-fidelity scoring.

Submission lifecycle

Submitter runs their system on the public split, produces a submission directory, runs ltbench score locally, and opens a pull request with their submissions/<name>/ and results/<name>.json.
Maintainers verify the submission is well-formed, then re-score against the held-out split (not present in the public manifest).
If held-out-split scores match within tolerance, the submission is promoted to the leaderboard and flagged “verified”.
Submissions that fail re-scoring (e.g. obvious split overfit) are rejected with a public explanation.

What “good faith” looks like

Submit a single system per PR — not five variants.
Disclose any LTB documents you used as part of model development.
Report a single primary configuration on the leaderboard; report sweeps in notes or a linked write-up.

Reproducibility

The leaderboard prefers submissions with full reproducibility metadata: hardware, runtime, cost, and a runner_config that lets a third party re-run your system. Closed/proprietary systems are accepted but scored as “unverified” — the leaderboard distinguishes them visually.

Open-source runners welcome

ltbench/runners/ is the place to drop a public adapter. PRs adding adapters for new translation systems are welcome — the runner pattern means the same adapter can be used by anyone to re-score the system without re-implementing the integration.

Oracle-layout runners

Some runners — like the bundled deepl-text-oracle — translate only the text and copy the ground-truth bboxes verbatim as predicted bboxes. This is intentional:

It measures the upper bound on text quality for a given translation system, isolated from layout-extraction failure.
It exposes the gap between a hypothetical “perfect-layout commercial MT” and real end-to-end systems.
It is not a fair comparison to a true end-to-end runner that has to extract its own bboxes — and the leaderboard flags oracle-layout systems so readers don’t confuse them.

If you submit a new oracle-layout runner, set "oracle_layout": true in your runner_config and explain the choice in notes. Run-time and cost still belong in the manifest as usual.

Built-in runners

Runner	CLI	Extras to install	What it measures
Identity baseline	`ltbench run-baseline`	(none)	Trivial lower bound — returns source text in source boxes
Qwen-VL family	`ltbench run-qwen-vl`	`pip install -e ".[runners-qwen]"`	End-to-end zero-shot VLM (Qwen3-VL by default). Heavy install.
DeepL Text + oracle layout	`ltbench run-deepl`	`pip install -e ".[runners-deepl]"`	Commercial MT text-quality upper bound. Covers 6/8 pairs (no th, ms). Requires `DEEPL_API_KEY`.
NLLB-200 Text + oracle layout	`ltbench run-nllb`	`pip install -e ".[runners-nllb]"`	Open-source MT text-quality baseline. Covers all 8 pairs including th + ms. NLLB-200 is CC-BY-NC-4.0 (research-only).
Florence-2 + NLLB end-to-end	`ltbench run-florence-nllb`	`pip install -e ".[runners-florence-nllb]"`	True end-to-end product baseline. v0.2-deferred: Florence-2 has a transformers-5.x compat issue.