JP-TL-Bench: Why Translation Direction Matters (JA↔︎EN)

Last week we announced JP-TL-Bench, our benchmark for evaluating Japanese-English translation quality. While that post covered the high-level designs we made (anchored pairwise comparisons, LT scores, etc) and includes a link to our technical writeup (now on arXiv!), we wanted to talk a bit more about one specific aspect, that we feel is extremely important: directional translation asymmetry.

Most translation evals give you a single score for a language pair, but what we found is that, at least for Japanese/English, that direction matters a lot for many models. Some models can be great translating Japanese to English, but terrible at English to Japanese, or vice versa. We also found that some models are able to handle “easy” translations, but fall apart on harder, more complex ones. JP-TL-Bench is able to measure this.

We mentioned this in our previous post, but what does this actually mean in real terms? Well, to find out, you have to actually dig in and read the output, so we will. In this post, we’ll dig into actual outputs from JP-TL-Bench’s Base Set v1.0 to see what directional gaps look like in practice.

TL;DR

Translation direction matters - a lot. Llama 3.1 8B scores 4.52 on JA→EN but only 1.40 on EN→JA. Swallow v0.5 shows the opposite pattern (8.80 EN→JA vs 5.96 JA→EN). These scores represent real differences in capabilities, and we show examples of what this means by looking at actual outputs.

What We’re Looking At

All numbers in this post come from our Base Set v1.0 anchor evaluations using Gemini-2.5-Flash as the judge. You can find:

Full model outputs: in the translations/ directory
Pairwise judgments: base_set.gemini-2.5-flash.jsonl
Slice scores: base_set.gemini-2.5-flash_scores.json

If you want to reproduce any of these comparisons or dig into the raw data, check out the repo.

Directional Gaps

Before we get into concrete examples, here’s what the JP-TL-Bench scores show:

Model	EN→JA	JA→EN	Gap	Note
Llama 3.1 8B	1.40	4.52	+3.12	Strong JA→EN bias
Swallow v0.5 8B	8.80	5.96	-2.84	Strong EN→JA bias
Shisa Chotto	9.65	9.44	-0.21	Balanced
LFM2-2.6B	4.91¹	2.79	-2.12	Easy/Hard gap

¹ Average scores hide difficulty gaps - LFM2 drops from 6.38 on Easy to 4.06 on Hard EN→JA

Our LT scoring is on a 0-10 score, and a 3-point gap is as you’d expect - a significant difference in quality. It’s probably the difference between useful and unusable in production. Oh, our Chotto model isn’t in the Base Set, but we’ll use it as an additional comparison/reference for all the samples we review. You can treat it as a stand-in for one of the top Gemini 2.5 Base Set models (very strong, balanced JA/EN), and of course, I recommend you try it for yourself.

Case Study 1: Llama 3.1 8B — Decent JA→EN, Struggles with EN→JA

Scores: JA→EN 4.52 vs EN→JA 1.40 (Δ +3.12)

Llama 3.1 8B was trained primarily on English data with some multilingual coverage. It can understand Japanese reasonably well and produce English output, but despite its additional multilingual tuning vs Llama 3, generating natural Japanese is … not great.

Where Llama Succeeds: Technical Japanese → English

Prompt: amazon_2 (JA→EN, Easy) - Kindle support documentation

Source (Japanese):

デジタルサービスとデバイスサポート › Kindle無料アプリのヘルプ › 設定と登録 ›
Kindle for PCのコンピュータへのインストールまたはアップデート

Kindle無料アプリを使用して、お持ちのPCまたはMacでKindle本をご利用いただけます。
サポートされているオペレーティングシステム:
    PC: Windows 8、8.1、10、11
    Mac: OS X 10.14以降

Llama 3.1 8B Translation:

Digital Services and Device Support › Kindle Free App Help › Settings and Registration ›
Installing or Updating Kindle for PC on Your Computer

You can use the Kindle Free App to read your Kindle books on your PC or Mac.
Supported Operating Systems:
    PC: Windows 8, 8.1, 10, and 11
    Mac: macOS 10.14 or later

Technical documentation has predictable structure and terminology and Llama 3.1 8B correctly renders the breadcrumb navigation, updates “OS X” to the more current “macOS,” and produces clear, professional English. This is the kind of structured content where Llama’s English fluency compensates for weaker Japanese understanding.

Shisa Chotto Comparison (LT: 9.44 JA→EN)

**Digital Services & Device Support › Kindle App Help › Settings & Registration ›**
**Installing or Updating Kindle for PC**

Use the free Kindle app to read Kindle books on your PC or Mac.

**Supported Operating Systems:**
- PC: Windows 8, 8.1, 10, 11
- Mac: OS X 10.14 or later

Chotto produces cleaner formatting with markdown headers and bullet points, making it more usable for documentation purposes. Both translations are accurate, but Chotto’s is more polished.

Where Llama Fails: English → Japanese Dialogue

Prompt: sample_convo_14 (EN→JA, Easy) - Marketing team discussion

Source (English):

Manager: "So where are we with the Q3 digital campaign metrics? The client meeting
is next Thursday."
Team Member: "I've analyzed the engagement rates across platforms. Our Instagram
carousel ads are showing a 3.2% CTR, which is above industry average, but the
conversion rate drops to 1.1% on the landing page."

Llama 3.1 8B Translation:

マネージャー: 「Q3のデジタルキャンペーンのメトリクスはどうですか? 次の木曜日に
クライアント会議があります。」
チームメンバー: 「私はプラットフォームをまたがってエンゲージメントレートを分析
しました。Instagramのキャローセールアドは、業界平均を上回る3.2%のクリック率を
示していますが、ランディングページでは1.1%のコンバージョーレートに落ちます。」

Problems identified:

Unnatural word order: “私はプラットフォームをまたがって” (I across platforms) is stilted; natural Japanese would front the topic differently
Katakana transliteration errors: “キャローセールアド” is a garbled rendering of “carousel ads”—should be “カルーセル広告”
Register mismatch: The Japanese sounds like machine translation rather than how colleagues actually speak in a business setting
Missing discourse markers: Japanese business conversation uses specific softening phrases that are absent here

Compare to Swallow v0.5 8B:

マネージャー：「それで、Q3のデジタルキャンペーンの指標はどうなっていますか？
クライアントとの会議は来週の木曜日です。」
チームメンバー：「プラットフォーム全体でのエンゲージメント率を分析しました。
Instagramのカルーセル広告はCTRが3.2%で、業界平均を上回っていますが、
ランディングページでのコンバージョン率は1.1%に低下しています。」

Swallow correctly uses “カルーセル広告,” employs natural topic-comment structure, and reads like actual Japanese business dialogue.

Shisa Chotto Comparison (LT: 9.65 EN→JA)

マネージャー: 「第3四半期のデジタルキャンペーンの指標はどうなっていますか？
クライアントとの会議は来週の木曜日です。」
チームメンバー: 「プラットフォームごとのエンゲージメント率を分析しました。
Instagramのカルーセル広告のCTRは3.2%で、業界平均を上回っていますが、
ランディングページのコンバージョン率は1.1%に低下しています。」

Shisa Chotto renders “Q3” as the more natural “第3四半期” (third quarter), uses proper business register, and maintains appropriate formality throughout. The translation reads like a real Japanese business conversation.

Case Study 2: Swallow v0.5 8B — Excellent EN→JA, Weaker JA→EN

Scores: EN→JA 8.80 vs JA→EN 5.96 (Δ -2.84)

Swallow v0.5 8B is a Llama 3-based model that was explicitly fine-tuned on Japanese, and it shows. The model produces remarkably natural Japanese but it struggles more when generating English from Japanese sources.

Where Swallow Excels: Literary English → Japanese

Prompt: jdate1 (EN→JA, Hard) - Dark philosophical monologue from fiction

Source (English):

"Oh, there is, Arnie. I know there is. It's called being willing to sacrifice
everyone around me for the cause. And why not? All of the great men do it.
The pyramids were built with tens of thousands of nobodies who were worked to
death so that the big thing could be achieved. That's the name of the game,
that's how you defeat the bad guys. Just be willing to spend your friends
like pennies, that's all."

Swallow v0.5 8B Translation:

「ああ、いる、アルニー。知ってる、いるんだ。原因のために、私の周りの全てを
犠牲にする覚悟があるってことだ。なぜなら、偉大な男たちは皆、そうするからだ。
ピラミッドは、何万人もの無名の者たちが死ぬまで働かされて建てられた。偉大な
ものを成し遂げるためだ。ゲームの名前はそれだ、悪者を倒すにはそうするしかない。
友人をペニーみたいに使い捨てる覚悟があればいい、それだけだ。」

Why this works:

Natural conversational flow: Phrasing like “なぜなら、偉大な男たちは皆、そうするからだ。” captures the speaker’s casual, almost manic tone, even if some of it isn’t grammatically perfect
Appropriate register: Uses informal verb forms (～だ、～しかない) matching the character’s direct speech style
Idiomatic rendering: “友人をペニーみたいに使い捨てる” (spend friends like pennies) finds a natural Japanese equivalent rather than literal translation
Dialogue markers: Proper use of Japanese quotation brackets「」

Compare to Llama 3.1 8B:

私たちは、アルニー。私はそれが存在することを知っている。そう呼ばれるものは、
自分を取り巻く人々をすべて犠牲にしてもよいという意欲だ。

Problems with Llama’s version:

Opens with “私たちは” (we are) which makes no sense-the source says “Oh, there is”
Overly formal register (～という意欲だ) clashes with the casual, dark tone of the original
Reads like dictionary definitions strung together rather than natural speech

Shisa Chotto Comparison (LT: 9.65 EN→JA)

「ああ、あるさ、アーニー。あるに決まってる。それは、自分のために周りの奴らを
犠牲にする覚悟だ。なぜそうしない？偉大な人たちがみんなそうやってるんだ。
ピラミッドだって、無名の者たちを死ぬほど働かせて建てたんだ。大きな目的の
ためにな。それが世の中のルールだ、悪党を倒す方法だ。友達を小銭のように
使う覚悟があればいい。」

Shisa Chotto correctly renders “Arnie” as “アーニー” (more accurate transliteration than Swallow’s “アルニー”), uses punchy colloquial Japanese (“あるに決まってる”, “周りの奴ら”), and captures the dark, cynical tone of the original. The phrase “それが世の中のルールだ” (that’s how the world works) is more idiomatic than the literal “ゲームの名前はそれだ”.

Where Swallow Struggles: Creative Japanese → English

Prompt: writing_tips_1 (JA→EN, Easy) - Blog post about dialogue writing

Source (Japanese):

セリフの書き方講座～不自然じゃない会話文とは～
創作おTips@定期購読はじめました
2024年2月19日 17:15

地の文講座に続いて、セリフ編です。
■会話文の書き方

じぶんはよく台詞・会話文に悩むため、よければ、そのあたりの書き方を
台詞講座として記事化していただけると幸いです。

有料記事感想フォームより

嬉しい・参考になるおフィードバック、ありがとうございますワ！
会話……会話って悩ましいですよネ……！難しい！

Swallow v0.5 8B Translation:

Okay, so I've been struggling with dialogue/conversation writing a lot, and I
was hoping someone could help me out with that. I'd love it if you could write
a little "dialogue writing lesson" article on how to make it sound more natural.

I've been getting some really great and helpful feedback on my paid articles,
and I'm so grateful! Conversation... conversation is tricky, you know? It's
really hard!

Problems identified:

Missing content: The title, author byline, date, and section header are completely omitted
Restructured meaning: The translation conflates the reader’s request (from the feedback form) with the author’s response, making it unclear who is speaking
Lost structure: The original clearly separates the quoted reader feedback from the author’s reply; this distinction disappears

Shisa Chotto Comparison (LT: 9.44 JA→EN)

**Dialogue Writing Workshop: How to Write Natural Dialogue**
*Creative Writing Tips – Now on Subscription!*
February 19, 2024, 5:15 PM

Following the Narrative Text Workshop, here's the dialogue edition!

■ How to Write Dialogue

(…)
I often struggle with dialogue, so if you'd like, I'd be happy to write an
article on that as a dialogue workshop.

*From the Paid Article Feedback Form*

Thanks so much for the kind and helpful feedback!
Dialogue… dialogue is tricky, isn't it? So difficult!

Shisa Chotto preserves the document structure with proper headers, clearly marks the quoted feedback, maintains the date/byline, and keeps the section marker (■). The translation is complete and structurally faithful while remaining natural English.

Case Study 3: Easy vs Hard — Exposing Brittleness

Some models work fine on easy prompts but become unreliable on hard ones—even within the same translation direction. LFM2-2.6B shows a clear EN→JA drop, while Chotto stays essentially flat.

Model	Easy	Hard	Gap
LFM2-2.6B	6.38	4.06	-2.32
Shisa Chotto	9.77	9.60	-0.17

Easy Success: Simple Expository English → Japanese

Prompt: sample_convo_6 (EN→JA, Easy) - Scientific explanation of rainbows

Source (English):

#6 - Simple Scientific Explanation of Rainbows
Have you ever wondered why rainbows always appear as an arc of the same seven colors? The science behind rainbows combines two fascinating processes: refraction and reflection. When sunlight hits a raindrop, it doesn't just bounce off - it enters the drop and bends, or refracts, as it passes through. This bending happens because light travels at different speeds through water and air.
White sunlight is actually made up of all the colors we see in a rainbow. As light bends through the raindrop, it splits into these different colors because each color bends at a slightly different angle. Red light bends the least, while violet bends the most. After the light splits, it reflects off the back of the raindrop and exits toward our eyes. This is why we always see rainbows in the opposite direction from the sun.
The rainbow's arc shape happens because raindrops at a specific angle (about 42 degrees) from your line of sight will reflect sunlight back to your eyes. Since there are countless raindrops in the sky during or after rain, you see a continuous arc of colors. That's also why no two people see exactly the same rainbow - each person sees light reflected from different raindrops!

LFM2-2.6B Translation:

#6 – 虹の自然科学的説明：なぜ虹は七色の弧として現れるのか？

あなたは、なぜ虹が常に同じ七色の弧として現れるのか、たまに考えたことはありませんか？虹の科学は、屈折と反射という二つの興味深い現象によって成り立っています。太陽光が雨粒に当たると、それは単に跳ね返るのではなく、雨粒の中に入り込み、水と空気を通る際に曲がります。この曲がりは、光が水中と空気中を異なる速度で進むため起こります。実際、白い太陽光は、私たちが虹で見るすべての色を含んでいます。光が雨粒を通って曲がると、それぞれの色がわずかに異なる角度で曲がるため、分離されます。赤い光は最も曲がりにくく、紫の光は最も曲がります。光が分かれた後、雨粒の裏面で反射し、私たちの目に向かって放出されます。これが、私たちが太陽とは逆方向に見る虹の理由です。

虹の弧形は、視線に対して約42度の角度にある雨粒によって生じます。この角度で太陽光を反射する雨粒だけが見えるため、私たちは連続した色の弧として虹を認識します。そのため、同じ虹を二人が全く同じように見ることはありません。それぞれの人々が見ているのは、異なる雨粒から反射された光なのです。

Shisa Chotto Comparison (LT: 9.77 EN→JA Easy)

#6 - 虹のシンプルな科学的説明
虹がいつも同じ七色の弧の形で現れるのはなぜでしょうか？ 虹の科学的な仕組みは、屈折と反射という二つの興味深い現象が組み合わさって生まれます。 太陽光が雨粒に当たると、単に跳ね返るだけでなく、雨粒の中に入り込み、通過する際に曲がる（屈折する）のです。 この曲がりは、光が水と空気の中で異なる速度で進むために起こります。

白い太陽光は実は、虹で見られるすべての色の組み合わせです。 光が雨粒の中で屈折する際、それぞれの色がわずかに異なる角度で曲がるため、色が分かれていきます。 赤色の光は最も曲がりにくく、紫色の光は最も曲がりやすいのです。 光が分かれた後、雨粒の裏側で反射し、私たちの目に向かって出てきます。 そのため、虹はいつも太陽とは反対方向に見えるのです。

虹が弧の形をしているのは、あなたの視線から約42度の角度にある雨粒が太陽光を反射してあなたの目に届くからです。 雨の時や雨上がりには空に無数の雨粒があるため、連続した色の弧として見えます。 また、二人として同じ虹を見ることはないのは、それぞれの人が異なる雨粒から反射した光を見ているからです！

Both are fluent; the key point is LFM2 stays stable on “explain-it” easy content.

Hard Failure: Code-heavy How-to Content

Prompt: howto1 (EN→JA, Hard) - Python tutorial with diagrams + code blocks

Source (English, excerpt):

How to Reverse a Binary Tree in Python

Reversing a binary tree is a fundamental programming operation where we swap all the left and right children of each node, effectively creating a mirror image of the original tree. While this might sound complex, the implementation is surprisingly elegant using recursive techniques.

First, let's understand what we mean by "reversing" a tree. Consider this simple binary tree:
```
  1
  / \
  2 3
  / \ / \
4 5 6 7
```

Let's implement this in Python. First, we'll define our basic tree structure:

```python
class TreeNode:
 def __init__(self, val=0, left=None, right=None):
 self.val = val
 self.left = left
 self.right = right
```

LFM2-2.6B Translation (excerpt):

```json
{
  "translation": {
    "How to Reverse a Binary Tree in Python": "バイナリツリーを逆転させる方法 – Pythonでの実装",
    "Introduction": "バイナリツリーの逆転は、プログラミングにおける基本的な操作であり…",
    ...
```

Critical problems:

Wrong output format: Returns a JSON object instead of a clean Japanese article
Degeneration on hard inputs: Later sections devolve into garbled code/tokens (e.g., \new_token, runaway escaping, broken Python)
Unusable as a translation: Even when parts are accurate, the structure is no longer publishable or directly reusable

Shisa Chotto Comparison (LT: 9.60 EN→JA Hard)

Pythonにおける二分木の反転方法

二分木の反転は、各ノードの左右の子をすべて入れ替えることで、元の木の鏡像を作成する基本的なプログラミング操作です。複雑に聞こえるかもしれませんが、再帰的な手法を用いることで、実装は驚くほど簡潔になります。このガイドでは、その手順をステップごとに解説します。

まず、「反転」とはどういう意味かを理解しましょう。以下のような単純な二分木を考えます。

```
 1
 / \
 2 3
 / \ / \
4 5 6 7
```

反転後は、以下のようになります。

```
 1
 / \
 3 2
 / \ / \
7 6 5 4
```

これをPythonで実装してみましょう。まず、基本的な木構造を定義します。

```python
class TreeNode:
 def __init__(self, val=0, left=None, right=None):
 self.val = val
 self.left = left
 self.right = right
```
...

Shisa Chotto preserves the diagrams and code blocks while keeping the Japanese natural and consistent.

Key Takeaways

Direction matters more than aggregate scores: A model scoring 4.0 overall might be 6.0 in one direction and 2.0 in the other. Use directional scores for deployment decisions.
Training data determines bias: Llama’s English-heavy pretraining makes it better at understanding Japanese (JA→EN) than generating it (EN→JA). Swallow’s fine-tuning on primarily Japanese data creates the opposite pattern.
Easy/Hard gaps reveal capabilities: LFM2-2.6B drops from 6.38→4.06 on Easy vs Hard EN→JA sets, indicating it will be less predictable on harder content. Shisa Chotto stays near-flat (9.77→9.60).
Failure modes widen with difficulty: On hard prompts, weaker models may hallucinate content (Llama) or collapse into invalid formats/token soup (LFM2 on code-heavy inputs).
Choosing the right model depends on use case:
- Translating Japanese documentation to English? Something like Llama 3.1 8B might be serviceable for easy content
- Translating English content for Japanese audiences? Swallow v0.5 8B is much better than base Llama (but is much weaker in the reverse direction)
- Be sure to pay attention to Easy and Hard scores: a large gap may mean that a model may have more limitations than what the aggregate score might suggest
- Need reliable performance across directions and difficulty levels? Shisa Chotto maintains 9.4+ (close to frontier model performance) in all slices. The Chotto translation model is extremely strong and also now available via API.

Hopefully this more concrete look at the actual output more clearly shows exactly what JP-TL-Bench is designed to measure, and why having separate directional (and difficulty!) splits really matter.