LayoutTranslateBench Dataset
LayoutTranslateBench Dataset
This directory holds the LTB dataset: the manifest, ground-truth annotations, and source page images.
Layout
data/
├── manifest.json # Index of all documents (one entry per doc_id)
├── annotations/ # Author-curated annotations (v0.1, v0.1.3) — CC-BY-4.0
├── sources/ # Author-rendered page PNGs (v0.1, v0.1.3) — CC-BY-4.0
├── rileykim_derived/ # v0.1.4 expansion — Apache-2.0
│ ├── annotations/ # rileykim/multilingual-document derivatives
│ └── sources/ # Real-world OCR'd document images (downsampled <=1600px)
└── flores_derived/ # v0.1.5 expansion — CC-BY-SA-4.0
├── LICENSE # Share-alike clause (FLORES-200 derivative)
├── README.md # FLORES-derived documentation
├── annotations/ # FLORES-200 derivatives w/ certified-translator refs
└── sources/ # Author-rendered synthetic layouts of FLORES sentences
Manifest schema
See ltbench.schemas.Manifest. Each entry contains a doc_id, category, paths to the source and annotation files, the page size, a per-document license, and an optional source URL.
Annotation schema
See ltbench.schemas.Annotation. Each annotation lists text regions with bounding boxes, the source text, a reading-order index, layout class, optional style hints, and reference translations for every supported language pair.
Dataset status by version
v0.1 — 5 sample documents (initial release)
- doc_001 — Certificate of Birth (certificate)
- doc_002 — Sunny Beans Coffee receipt (receipt-invoice)
- doc_003 — Scientific paper page (scientific-paper)
- doc_004 — Acme Industries business letter (business-letter)
- doc_005 — Q1 planning meeting notes, handwritten style (handwritten-mixed)
v0.1.3 — 5 additional documents (doubles N)
- doc_006 — USCIS Arrival/Departure record I-94 style (gov-form)
- doc_007 — UK Driving Licence application excerpt (gov-form)
- doc_008 — Clinical Laboratory lipid panel report (scientific-paper, medical sub-class)
- doc_009 — Residential Lease Agreement rent clause (legal-contract)
- doc_010 — Public Health Advisory bilingual government notice (magazine-news, gov-comms sub-class)
All 10 v0.1.3 documents have:
- Source rendered from author-authored text reproducing real-world public-form templates
- Bounding-box annotations across 5–8 text regions per doc
- Reference translations in 8 language pairs (
en-es,en-de,en-zh,en-ar,en-ja,en-fr,en-th,en-ms)
v0.1.4 — rileykim/multilingual-document expansion
15 additional documents imported from rileykim/multilingual-document (Apache-2.0):
- 8 docs (doc_011–doc_018) with
en-jareferences - 7 docs (doc_019–doc_025) with
en-zhreferences (mapped from rileykim’sen-zh-cn)
These docs differ from v0.1.3 in three ways:
- Real-world OCR’d document images (not author-rendered) — patent documents, ad imagery, multi-column layouts. Stored under
data/rileykim_derived/sources/at native rileykim resolution (downsampled to ≤1600px wide for repo size). - Real OCR bounding boxes (not author-drawn rectangles) —
merge_ocrsegments from the source dataset, converted from xyxy to xywh. - Single-pair reference coverage per doc — each rileykim image was paired with exactly one target language, so doc_011–018 have
en-jarefs only and doc_019–025 haveen-zhrefs only. They do NOT contain all 8 LTB pairs.
Use case: the v0.1.4 expansion is layout-fidelity / reading-order ground truth on real document images, with reference translations for the two LTB pairs that overlap with rileykim. Use them to stress-test runners on real (not synthetic) document scans. Scoring against rileykim references should be reported with the “ml-curated” provenance caveat — see provenance table below.
v0.1.5 — FLORES-200 expansion (this release): first industry-grade references
10 synthetic documents (doc_026–doc_035) composed from FLORES-200 devtest sentences. Each doc has references in all 8 LTB pairs, produced by professional translators as part of the No Language Left Behind project. See data/flores_derived/README.md for full details.
These are the first certified-translator-grade references in LTB. The v0.2 roadmap originally planned to commission certified translators at ~€10–25k for all docs; FLORES integration delivers part of that quality upgrade without budget, by reusing existing professional translations under the CC-BY-SA-4.0 share-alike license.
- License segregation:
data/flores_derived/is CC-BY-SA-4.0 (share-alike), separated from the LTB core (CC-BY-4.0) to avoid contamination - Composition: 5–6 consecutive sentences per article from FLORES devtest, single-column layout, first sentence as title
- Source articles are Wikinews, diverse topics (business, science, politics, health, crime, culture, etc.)
- Coverage: all 8 LTB pairs in every FLORES doc
v0.1.6 — 16-pair coverage
LANG_PAIRS extended from 8 to 16 pairs by adding rileykim’s 8 net-new languages: en-ru, en-ko, en-vi, en-id, en-ur, en-uz, en-kk, en-zh-tw. Each new pair gets 3 docs (doc_036–doc_059) imported from rileykim/multilingual-document (Apache-2.0). v0.1.6.4 ingest-time + post-hoc script-validation dropped mislabeled refs (rileykim’s en-ru rows were actually Chinese), so the surviving per-pair coverage from rileykim is 3 docs each except en-ru = 0.
v0.1.7 — certified-translator coverage for all 16 pairs (this release)
scripts/add_flores_extension_refs.py enriches the 10 FLORES-derived docs (doc_026–doc_035) with v0.1.6 extension-pair references — also certified-translator-grade, from the same FLORES-200 source. Effect:
| Pair | v0.1.6.4 | v0.1.7 |
|---|---|---|
| en-ru | 0 | 10 (FLORES only — rileykim en-ru rows dropped by script-validation) |
| en-ko, en-vi, en-id, en-ur, en-uz, en-kk, en-zh-tw | 3 | 13 each (10 FLORES + 3 rileykim) |
| Core 8 pairs | 20–28 | unchanged |
Every LTB pair now has at least 10 certified-translator reference documents. The author-curated / certified split documented below still applies: CORE_LANG_PAIRS is the required coverage set for author-curated docs; the extension pairs are required only where the doc actually carries them.
Reference-translation provenance
| Version | Doc range | Provenance | Quality grade |
|---|---|---|---|
| v0.1 | docs 001–005 | Author-curated | Comparable to a competent native-speaker non-professional translator |
| v0.1.3 | docs 006–010 | Author-curated, same standard as v0.1 | Same as above |
| v0.1.4 | docs 011–025 | rileykim/multilingual-document (Apache-2.0) | ml-curated (ML output published as references) |
| v0.1.5 | docs 026–035 | FLORES-200 (CC-BY-SA-4.0) | certified-translator (industry-grade, NLLB project) |
| v0.1.6 | docs 036–059 | rileykim/multilingual-document (Apache-2.0), 3 docs each for 8 net-new pairs | ml-curated |
| v0.1.7 | docs 026–035 (refs only) | FLORES-200 (CC-BY-SA-4.0), extension 8 pairs back-filled | certified-translator |
| v0.2 target | all docs | Certified-translator, 2 references per pair | Industry-grade, multi-reference |
References on docs 001–010 are author-curated. References on docs 011–025 are ml-curated. References on docs 026–035 are certified-translator-grade (the same quality that LTB v0.2 was targeting). v0.2 will extend the certified-translator coverage to all 8 LTB pairs across all 35+ docs.
When reporting LTB results, segment by reference grade where relevant:
overall_ltb_100— averages across all available references- Certified-subset score — restrict to docs 026–035 for the most reliable signal (industry-grade refs, all 8 pairs)
Why these 5 new docs
The v0.1.3 expansion specifically targets the launch-market verticals identified in the broader project strategy:
| Doc | Vertical | Market relevance |
|---|---|---|
| doc_006 USCIS I-94 | Immigration paperwork | US market entry; also de facto template for many world airports |
| doc_007 UK Driving Licence | Government identity documents | UK + EU market entry |
| doc_008 Lab Report | Medical / HIPAA-style records | DE / FR / US healthcare verticals |
| doc_009 Lease Clause | Legal / sworn translation | DE + FR sworn-translation industries (the launch-market wedge) |
| doc_010 Public Health Notice | Government public communications | SG / SE Asia bilingual government materials |
If you would like to contribute documents (especially in underrepresented categories like handwritten / certificates / non-Latin scripts), see docs/submission.md.
License
The annotations and manifest are CC-BY-4.0 unless a stricter per-document license is recorded in the manifest.