v0.1.4 Dataset Expansion — rileykim/multilingual-document Integration
v0.1.4 Dataset Expansion — rileykim/multilingual-document Integration
Status: Shipped. 15 documents (doc_011–doc_025) ingested. Outcome: LTB expanded from 10 → 25 docs; en-ja n=18, en-zh n=17 (other 6 pairs remain at n=10).
Candidate datasets considered
| Dataset | License | Layout/bbox? | Translations? | Pair coverage | Outcome |
|---|---|---|---|---|---|
rileykim/multilingual-document |
Apache-2.0 | Yes (merge_ocr xyxy) |
Yes (10 pairs) | en-ja + en-zh-cn overlap with LTB | Adopted (Tier A) |
facebook/flores (FLORES-200) |
CC-BY-SA-4.0 | No (sentence-only) | Yes, professional translators | 8/8 of LTB pairs | Tier B — quality oracle, license-segregated |
| DocHPLT (2025) | TBD | Document-level | Yes | TBD | Tier C — investigate later |
EUR-Lex parallel corpus |
EU Open Data | No | Yes, certified | EU only | Tier D |
Decision tree
Run scripts/inspect_rileykim.py
├── Has bboxes + >= 4 of our 8 pairs + CC-BY or CC0
│ → Tier A path: write adapter, ingest 20-30 docs as v0.1.4
├── Has translations but no bboxes
│ → Tier B-prime path: use as text source, render synthetic
│ LTB pages via scripts/render_samples.py
└── Wrong shape / restrictive license
→ Fall back to FLORES-200 (Tier B)
→ Compose synthetic LTB pages from FLORES sentences
→ Keep FLORES-derived docs in data/flores_derived/
with own LICENSE file (CC-BY-SA-4.0)
→ Do NOT mix into main annotations/ directory
(avoids share-alike contamination of CC-BY-4.0 LTB core)
Fitness criteria (rileykim spike)
A rileykim row is directly usable if:
- License is CC-BY-4.0, CC-BY-SA-4.0, CC0, or Apache-2.0 (we can attribute)
- Bbox info per text region is present (or derivable from OCR output)
- At least 4 of our 8 target language pairs are represented (en-es, en-de, en-zh, en-ar, en-ja, en-fr, en-th, en-ms)
- Document category is recognizable (legal/medical/business/etc.) for category-balanced sampling
A rileykim row is partially usable (text-only, render pipeline) if:
- License OK + translations present, but bboxes absent
- We then use scripts/render_samples.py to synthesize a layout
A rileykim row is rejected if:
- License is non-commercial or restrictive
- No translations to any of our 8 pairs
License segregation strategy
LTB’s core annotations are CC-BY-4.0. To avoid share-alike contamination:
- Any CC-BY-SA-4.0-derived documents (e.g. FLORES) live in
data/<source>_derived/with their own LICENSE file data/manifest.jsonrecords the license per entry; downstream users must respect per-document license- A
provenancefield will be added to each annotation in v0.1.4:{"source": "author-curated" | "rileykim" | "flores-200" | ...}
Quality grading
| Source | Quality grade (per docs/methodology.md) |
|---|---|
| Author-curated (v0.1, v0.1.3) | Comparable to competent native-speaker non-professional |
| rileykim (v0.1.4 candidate) | TBD — depends on stated provenance |
| FLORES-200 (v0.1.4 fallback) | Industry-grade (professional translators) — would be the first certified-quality references in LTB |
| Certified-translator (v0.2 target) | Industry-grade |
If FLORES integrates cleanly, we can claim v0.1.4 has partial industry-grade references (the FLORES-derived subset) without waiting for the v0.2 €10–25k budget. This would be a meaningful credibility upgrade for the benchmark.
What was actually shipped
- Ran
scripts/inspect_rileykim.py+inspect_rileykim_stream.py+inspect_rileykim_groups.py— schema verified - Ran
scripts/import_rileykim.py(streaming-mode adapter, en-ja batch) - Ran
scripts/import_rileykim_parquet.py(direct-parquet adapter, en-zh-cn batch — faster since en-zh-cn lives at the end of the 27-shard stream) - Added
Provenancemodel +ocr-documentCategory +ReferenceGradeLiteral toltbench/schemas.py - Updated
ltbench/cli.pyverifyto permit partial reference coverage for non-author-curated docs - Updated
ltbench/cli.pyscoreto drop (doc, pair) combinations without references (was double-counting at chrF≈0) - Added
tests/test_v014.pywith 7 regression tests for partial-coverage scoring - Re-scored identity baseline on N=25 →
results/identity-baseline-v014.json - Updated
data/README.mdanddocs/methodology.mdwith v0.1.4 provenance table
Key implementation findings
- Each rileykim image_id appears in exactly ONE lang_pair — no image is paired with multiple targets. So we cannot get multi-pair coverage from a single rileykim doc; this is the structural reason v0.1.4 docs are “partial coverage.”
- The test split (1k rows) has a parquet schema bug that breaks
load_dataset(repo)in non-streaming mode; streaming-mode and direct-parquet-on-cache both work. - Rows are physically grouped by lang_pair in the parquet shards, so en-zh-cn lives near the end of the stream. Direct-parquet read of shards 24–26 in reverse hit en-zh-cn in seconds; streaming from the start took minutes.
Deferred to future versions
- FLORES-200 integration —
scripts/inspect_flores.pyexists; not used yet. Would give certified-translator quality across all 8 LTB pairs but with CC-BY-SA-4.0 license segregation requirement. - Extending
LangPairLiteral — rileykim has 8 net-new pairs (en-ru, en-ko, en-vi, en-id, en-ur, en-uz, en-kk, en-zh-tw). Not adopted in v0.1.4 to avoid widening blast radius. v0.1.5 candidate. - DocHPLT (2025) document-level translations — investigate as additional source.