COMET-Kiwi-22 Setup (Optional Text Metric)

LTB v0.1.2 ships COMET-Kiwi-22 as an optional text metric (--text-metric comet-kiwi). chrF₂ remains the default because COMET requires a clean Python environment.

Why it’s optional

unbabel-comet (the COMET-Kiwi package) requires scipy + numpy ≥ 2.x. Many development environments pin numpy < 2 for opencv-contrib-python or other ML packages, which makes installing COMET into the main ltbench environment fail at scipy-import time.

The fix is a dedicated virtual environment for COMET-based runs.

One-time setup

Step 1: Create the venv and install dependencies

# From the repo root:
python -m venv .comet-env

# Activate the venv:
.comet-env\Scripts\activate          # Windows (PowerShell or cmd)
source .comet-env/bin/activate       # macOS / Linux

# Install torch first (the slowest dep — speeds up later resolution):
pip install torch --index-url https://download.pytorch.org/whl/cpu

# Install COMET + ltbench:
pip install unbabel-comet
pip install -e .

The venv directory is gitignored. ~3 GB on disk after install (torch + pytorch-lightning).

Step 2: HuggingFace authentication (required)

Unbabel/wmt22-cometkiwi-da is a gated model — Unbabel requires free license acceptance before download. Without auth, you’ll see KeyError: "Model 'Unbabel/wmt22-cometkiwi-da' not supported by COMET."

To fix:

  1. Visit https://huggingface.co/Unbabel/wmt22-cometkiwi-da and click “Agree and access repository” (free, ~10 seconds).
  2. Create a read-only access token at https://huggingface.co/settings/tokens.
  3. Authenticate the venv:
    pip install huggingface_hub  # already a transitive dep, but make sure
    huggingface-cli login        # paste token when prompted
    # OR
    set HF_TOKEN=hf_xxxxxxxx      # Windows
    export HF_TOKEN=hf_xxxxxxxx   # macOS / Linux
    

Subsequent runs use the cached token — this is one-time per machine.

Running with COMET-Kiwi

From inside the activated .comet-env:

ltbench score \
  --submission submissions/deepl-text-oracle \
  --text-metric comet-kiwi \
  --output results/deepl-text-oracle.comet.json

The first invocation downloads Unbabel/wmt22-cometkiwi-da (~600 MB) to the HuggingFace cache. Subsequent runs load from cache.

Approximate runtime on CPU (4 systems × 8 lang-pairs × 5 docs × ~7 regions = ~1120 region calls):

Stage Cost
Model download (first-time only) 1–5 min
Per-region inference (batched per doc) 5–20 sec/doc
Full 4-system re-score 30–90 min

GPU inference is ~10–50× faster; pass LTB_COMET_DEVICE=cuda if you have one.

Result file changes

The result JSON’s overall_chrf and per_lang_pair[*].chrf fields now hold the COMET-Kiwi score (still in [0, 100], so the LTB-100 composite formula is unchanged). The weights field is unchanged.

To distinguish chrF-scored and COMET-scored runs, name the output file accordingly (<system>.json vs <system>.comet.json) and rebuild the leaderboard against the appropriate result set.

When to use which metric

Use chrF (default) Use COMET-Kiwi
Quick iteration / CI / unit tests Pre-launch sanity-check on rankings
No GPU available, want <1 sec/region Have GPU or willing to wait
Reproducibility on CPU-only machines Higher human-judgment correlation
All v0.1.1 leaderboard rows A separate *.comet.json set of result files

chrF is the metric for the v0.1.1 leaderboard until COMET-scored runs are available for every submission — otherwise the leaderboard would mix metrics on the same axis.

Verifying setup

Inside the activated venv:

python -c "
from ltbench.metrics.comet import score_one
print(score_one('Certificate of Birth', 'Certificado de Nacimiento'))
"

A first-time run prints model-download progress, then a float in [0, 100] (typically 70–95 for a competent translation). Subsequent runs skip the download.

Limitations

  • First-load overhead. Model loading takes ~30–60 sec; bake this into your timing.
  • Per-pair quality varies. COMET-Kiwi-22 is trained mostly on WMT data; low-resource pairs (en-th, en-ms) have less training signal than en-de or en-es.
  • No GPU on Windows by default. Pytorch CUDA on Windows needs a specific torch install. If you have an NVIDIA GPU and want speedup, follow the torch Windows-CUDA install instructions inside .comet-env.

Roadmap

v0.2 will ship a Docker image that bundles unbabel-comet and the LTB scoring pipeline in one container, eliminating the dual-venv workflow. Until then, this guide is the supported path.