JP-TL-Bench: Anchored Pairwise LLM Evaluation for Bidirectional Japanese-English Translation

Note

We’re pleased to announce the official release of JP-TL-Bench, an open benchmark we built to help iterate on Japanese-English translation quality. This benchmark has been one of our “secret weapons” (mostly because we’ve been too busy to post about it) during development of Shisa V2, Shisa V2 405B, and Shisa V2.1, this year and it’s now available along with a full paper.

We’ve been fortunate to have our work cited by others over the past couple years, and we’ve mostly been happy to be heads-down shipping models and products, but it’s probably time for us to roll up our sleeves and throw our hat in on the research side too, so here’s Shisa.AI’s first academic-style paper to end 2025 in style.

TL;DR

JP-TL-Bench helps you figure out which translation model is actually better when they’re both already pretty good. Unlike traditional metrics that bunch all the good models together, JP-TL-Bench spreads them out so you can see real differences.

Code: github.com/shisa-ai/jp-tl-bench
Paper: arXiv:2601.00223
Paper (PDF): GitHub

The Problem: Looks Good To Me Boss

When we were developing our Shisa V2 models, we found that for some of the uses cases we cared about, the existing benchmarks were not so useful in telling us which model was actually better at the task. Translation (Machine Translation or MT in the academic parlance) was one of these tasks.

You can imagine training a new LLM, running it through a standard MT eval and getting a score like 0.89. Great! And then you train a new model that you know is better since you actually bothered to read its raw output, and you get a score of … 0.89. It turns out the standard metrics for measuring Japanese MT performance are great for catching obvious problems, but once your models are already producing good translations, the scores all bunch together. A model that produces awkward but technically correct translations scores almost the same as one with natural, flowing prose.

This matters a lot for Japanese-English translation specifically. Japanese has layers of politeness baked into the grammar, frequently omits subjects and objects that need to be inferred, and uses cultural references that don’t translate directly. The difference between a “good” versus a “good enough” translation often comes down to subtle choices that automatic metrics miss.

Our Solution: Let Them Fight

Instead of trying to assign an absolute score to each translation, we use a common technique, pairwise comparison, which asks a simple question: which one is better?

Shisa.AI is all about the AI life, so JP-TL-Bench works by having an LLM judge look at two translations of the same text and picking the winner. Do this enough times and you get stable, meaningful rankings.

Those familiar with Chess (Elo scores) or Chatbot Arena are probably quite familiar with this but we have a twist. The key innovation that we add is anchored comparison. Rather than comparing every model against every other model (which gets expensive fast), we compare each new model against a fixed set of 20 “anchor” models ranging from very strong to very weak. This means:

Consistent scores: A score from today means the same thing as a score from six months ago
Affordable: ~$7 and 10-30 minutes per model evaluation
Discriminating: Models that look identical on BLEU or COMET get well separated

What JP-TL-Bench Measures

The benchmark has 70 translation prompts covering:

Both directions: English→Japanese and Japanese→English
Easy and Hard: both easier, and extremely challenging texts
Different Lengths: from shorter texts to thousand-word+ long pieces

The hard prompts include things like:

Dialogue from video games with cultural references
Literature passages that are famously difficult to translate
Text that requires reading between the lines

Here’s how the 20 anchor models score (the full range from world-class to struggling):

#	Model	WR%	LT
1	google/gemini-2.5-pro	96.15	9.94
2	google/gemini-2.5-flash	92.92	9.89
3	Qwen/Qwen3-30B-A3B-Instruct-2507	84.33	9.63
4	shisa-ai/shisa-v2-llama3.1-405b	81.45	9.49
5	openai/gpt-4o	76.02	9.12
6	shisa-ai/shisa-v2-unphi4-14b	72.81	8.83
7	tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.5	62.16	7.45
8	nvidia/NVIDIA-Nemotron-Nano-12B-v2	59.91	7.07
9	meta-llama/Llama-3.3-70B-Instruct	58.05	6.74
10	microsoft/phi-4	49.80	5.13
11	cyberagent/Mistral-Nemo-Japanese-Instruct-2408	47.52	4.69
12	Qwen/Qwen3-4B	44.68	4.11
13	LiquidAI/LFM2-2.6B	43.83	3.92
14	meta-llama/Llama-3.1-8B-Instruct	38.84	2.95
15	microsoft/Phi-4-mini-instruct	24.94	0.98
16	augmxnt/shisa-7b-v1	21.36	0.68
17	meta-llama/Llama-3.2-3B-Instruct	19.18	0.54
18	Rakuten/RakutenAI-2.0-mini-instruct	14.20	0.29
19	LiquidAI/LFM2-350M	8.75	0.13
20	SakanaAI/TinySwallow-1.5B	2.52	0.03

There are some Shisa models in there of course, but mainly this Base Set V1.0 was chosen specifically to try to evenly space out win rates from highest to lowest.

Why Traditional Metrics Fall Short

Here’s the problem visualized. We ran COMET evaluations on the same models:

Score progression across all 20 anchor models. JP-TL-Bench (green) spreads models out across the range, while COMET variants (other colors) compress all the good models into a narrow band at the top.

See how all the COMET lines flatten out at the top? That’s not because those models are equally good - it’s because COMET runs out of resolution when translations are already relatively solid.

The red zones show where models with different JP-TL-Bench scores cluster at identical COMET values.

The scatter plot tells the same story. Those red shaded areas are models that COMET says are basically identical, but where JP-TL-Bench shows real quality differences between them.

That’s Interesting

One thing JP-TL-Bench revealed was that many models are surprisingly lopsided in which direction they translate better.

JP-TL-Bench scores by direction. Blue is Japanese→English, red is English→Japanese. The candlestick bars show the Easy-Hard range.

Some findings:

Llama 3.1 8B scores 4.52 translating Japanese→English but only 1.40 going the other way - a massive gap
Swallow v0.5 8B shows the opposite: 8.80 for English→Japanese vs 5.96 for Japanese→English (it was optimized more for EN→JA)
As expected, the Hard prompts expose brittleness in smaller models that Easy prompts miss

This kind of directional analysis is invisible in aggregate metrics but critical if you’re deploying a model for a specific use case.

The Magic of Pairwise Comparison

Here’s a real world example of why asking “which is better?” beats “rate this from 1-10”:

Original Japanese:

華やかでおしゃれな街が多く、自然にも恵まれた兵庫県。県内には41の市町があり、県庁所在地の神戸市は中央区や東灘区など9つの行政区に分かれています。それぞれの市町や区には、その土地ならではの魅力がありますよね

Mistral 7B (respect to the OG):

The prefecture of Hyogo is known for its vibrant, fashionable cities that are abundantly blessed by nature. There are 41 cities, towns within the prefecture, with Kobe City as its capital being divided into nine administrative districts such as Chuou Ward and Higashinada Ward. Each city, town, or district has unique charms specific to their land, don’t they?

Shisa Chotto (the new hotness):

Hyogo Prefecture boasts many vibrant and stylish cities as well as abundant natural beauty. The prefecture consists of 41 cities and towns, with its capital, Kobe City, divided into nine administrative wards such as Chuo Ward and Higashinada Ward. Each city, town, and ward has its own unique charm, doesn’t it?

Both translations convey the same information. Both are technically correct. But the Mistral version has a comma splice (“cities, towns”), an awkward phrase (“abundantly blessed by nature”), and overly literal wording (“unique charms specific to their land”).

How many points off for a comma splice? Is “abundantly blessed” worth 2 points or 5? These questions are almost impossible to answer consistently. But ask “which is better?” and evaluators agree almost immediately: Shisa Chotto.

Multiply this by hundreds of comparisons, and you get rankings that actually reflect quality differences humans care about.

Would you like to know more?

We’ve been iterating on JP-TL-Bench (fka shisa-jp-tl-bench) for most of this year (we’ve been busy!) but it’s time to get this out there for the community. It is a massive leap from all the other Japanese translation evals that have been published and it’s Apache 2.0 licensed and available on our Github: https://github.com/shisa-ai/jp-tl-bench

Over the holiday break, we also wrote up our first academic-style technical paper. It is long, and detailed, and jargony, and it has a bunch of math that formally describes how our rating system works (including turning up a few wrinkles in the scoring). If you’re interested in the nitty gritty, the full paper is 20+ pages of, well, what we just posted about, but … [MORE TECHNICAL].

Made it this far but not sure if you want to read all that? Let’s wrap this up with our paper abstract and you can decide if you’re good or not:

Abstract

We introduce JP-TL-Bench, a lightweight, open benchmark designed to guide the iterative development of Japanese↔︎English translation systems. In this context, the challenge is often “which of these two good translations is better?” rather than “is this translation acceptable?” This distinction matters for Japanese↔︎English, where subtle choices in politeness, implicature, ellipsis, and register strongly affect perceived naturalness. JP-TL-Bench uses a protocol built to make LLM judging both reliable and affordable: it evaluates a candidate model via reference-free, pairwise LLM comparisons against a fixed, versioned anchor set. Pairwise results are aggregated with a Bradley–Terry model and reported as win rates plus a normalized 0–10 “LT” score derived from a logistic transform of fitted log-strengths. Because each candidate is scored against the same frozen anchor set, scores are structurally stable given the same base set, judge, and aggregation code. »

Citation

@misc{jp-tl-bench,
  title={JP-TL-Bench: Anchored Pairwise LLM Evaluation
         for Bidirectional Japanese-English Translation},
  author={Lin, Leonard and Lensenmayer, Adam},
  year={2025},
  eprint={2601.00223},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}