Qwen 3 Japanese Performance

Similar to our previous Llama 4 Japanese Performance review, here’s an initial one for Alibaba’s latest Qwen 3 release. This is going to be more of a first look/preview, and we’ll revisit as time and compute frees up, but since it’s topical and useful, I wanted to publish some initial testing at least. Qwen 3 claims support for 119 languages and dialects, so let’s see how it does on Japanese!

Since we’ve just wound down our METI-funded cluster (some fun results coming soon), instead of our PoLL (LLM Jury) judges we used for our prior testing, these tests were judged by GPT 4.1 (while not rigorously tested like our LLM Jury yet, in preliminary testing gpt-4.1-2025-04-14 results scored within 0-2% of gpt-4-turbo-preview judging and we’ll be switching to this for convenience/stability/cost/performance and easier 3rd party reproducibility).

As a reminder, Shaberi uses the following Japense functional benchmarks scored by an LLM judge:

These tests were run with the latest vLLM 0.8.5 release which defaulted to “thinking” mode. For judging, the think blocks are stripped.

And with that out of the way, here are the results of running our Shaberi evals for the 8B and 30B A3B MoE (FP8) as well as some comparison baselines:

Model	Average	ELYZA 100	JA-MT	Rakuda	Tengu
gpt-4o-2024-11-20	9.21	9.30	9.30	9.88	8.38
shisa-ai/shisa-v2-llama3.3-70b	8.39	8.60	8.82	8.85	7.28
Qwen/Qwen3-30B-A3B-FP8	8.36	8.52	8.67	8.85	7.42
Qwen/Qwen3-8B	7.81	8.08	8.48	7.80	6.86
shisa-ai/shisa-v2-qwen2.5-7b	7.32	7.56	7.88	7.35	6.49
shisa-ai/shisa-v2-llama3.1-8b	7.14	7.54	6.83	7.85	6.34
Qwen/Qwen2.5-7B-Instruct	6.78	7.50	6.88	6.25	6.49

Not bad! The Qwen 3 results are incredibly strong on these benchmarks, with the 8B model beating our 7B/8B class Shisa V2 models and the 30B A3B MoE scoring almost as well as the 70B Shisa V2 model.

I’m incredibly excited to test out the 235B A22B model… (If you revisit this post in the future, you might see the results here instead). 😂

Cross-Lingual Token Leakage

One issue Qwen models traditionally have had is a propensity for cross-lingual token leakage. For our Shaberi testing, we found min_p to be the most useful sampler parameter to reduce this. All tests were done with min_p=0.1 and temp=0.2. Note, for the character and language character counts, the <think> block is ignored (about 3X the token count).

More real world testing is required, but an initial run through our character analysis code seems to show this may still be an issue:

Model	# Chars	JA	ZH	EN	#	WS	JP_P	LAT_P	OTH_P	Other
Qwen__Qwen2.5-7B-Instruct	28031	81.29%	0.04%	2.36%	1.22%	0.00%	0.00%	0.00%	0.00%	1.07%
Qwen/Qwen3-8B	51451	73.20%	0.00%	0.70%	1.27%	0.00%	0.00%	0.00%	0.00%	7.37%
Qwen/Qwen3-30B-A3B-FP8	53742	74.25%	0.02%	0.68%	1.38%	0.00%	0.00%	0.00%	0.00%	2.09%
shisa-ai/shisa-v2-qwen2.5-7b	38844	76.26%	0.06%	0.48%	1.24%	0.00%	0.00%	0.00%	0.00%	1.17%

Alignment and Refusals

One last major issue with most Chinese models is their state-enforced alignment and refusals. We tested with the augmxnt/deccp EN/ZH refusal set and some new testing code that uses the recently released NousResearch/Minos-v1 refusal classifer to analyze refusal rates:

Model	Always Refuse	Sometimes Refuse	Never Refuse
Qwen/Qwen3-8B	29.47%	12.63%	57.89%
Qwen/Qwen2.5-7B-Instruct	31.58%	5.26%	63.16%
shisa-ai/shisa-v2-qwen2.5-7b	4.21%	3.16%	92.63%
shisa-ai/shisa-v2-llama3.1-8b	0.00%	3.16%	96.84%

The refusal testing run 5 times on the 95 sample “censored” set batched at temp=0.0. Qwen 3 appears to have similar refusal rates to Qwen 2.5, so the results of our prior Qwen 2 analysis are likely still valid. We don’t explicitly do alignment training with our current Shisa V2 recipe, but it appears that our post-training recipe reduces, but does not entirely eliminate these types of refusals.

NOTE: in manual review, we found that Minos-v1 has a tendency to misclassify non-refusals as refusals. We highly recommend using any Minos results as a general guide, but to carefully review the raw output results for any critical decision-making. (This is highly recommended since even a perfect refusal classifier will of course miss any non-refusal misalignment.)

Final Thoughts

These are just a few initial tests, and we may revisit at some point with a full eval and to test the largest Qwen 3 models (235B A22B). The entire Qwen 3 family, from the smallest 0.6B to the massive 235B have all been open sourced with an Apache 2.0 license, which warms our heart.

The Qwen 330B A3B MoE is a particularly interesting training target since it looks to be highly capable and in theory, should be able to inference very efficiently.