Llama 4 Japanese Performance

Last weekend Meta launched Llama 4, starting with two models: Scout - a 17B active parameter, 16 expert (109B total parameter) model, and Maverick - a 17B active parameter, 128 expert (400B total parameter) model. At Shisa.AI, we’re most interested in the Japanese performance of LLMs. We’ve been able to get some great results out of Llama 3 (see our recent shisa-ai/shisa-v2-llama3.1-8b-preview model), and Llama 4 claims to have even stronger multi-lingual capabilities, with “10x more multilingual tokens than Llama 3” and “pretraining on 200 languages.”

A surprise launch, there has been some mixed reception due to uneven performance, gamed LM Arena results from a custom model, and the lack of stable Day 1 inferencing. But with validated vLLM inferencing now released, we took a slight break from our other work to see how the Llama 4 models perform on Japanese evals.

For our primary Japanese testing, we maintain our own fork of Shaberi, an evaluation harness originally created by Lightblue that uses four different LLM-judged functional benchmarks:

Our current version uses an LLM Jury consisting of NexusFlow Athene V2, Tulu 3.1 405B FP8, and Llama 3.3 70B judges that we have statistically validated to be both highly correlated with GPT-4 and with human gold-standard ratings.

First, let’s see how Maverick does. Besides the models that Meta compares to in their announcement blog post, we’ve also added the top scoring models that we’ve ever tested, OpenAI’s GPT 4.5 and the new stealth model, Quasar Alpha for comparison.

Model Name	Shaberi AVG	ELYZA 100	JA MT Bench	Rakuda	Tengu
openrouter/quasar-alpha	9.20	9.41	9.01	9.42	8.97
gpt-4.5-preview-2025-02-27	9.19	9.50	8.85	9.56	8.86
gpt-4o-2024-11-20	9.15	9.34	9.10	9.55	8.60
deepseek-ai/DeepSeek-V3-0324	8.98	9.22	8.68	9.24	8.77
gemini-2.0-flash	8.83	8.75	8.77	9.48	8.33
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8	8.64	8.54	8.81	9.14	8.08
meta-llama/Llama-3.1-405B-Instruct-FP8	8.41	8.52	8.42	9.07	7.63

While Llama 4 Maverick trails the top frontier models, these are all show high proficiency and are quite capable on regular Japanese tasks. Remember that while Maverick occupies a large amount of VRAM (basically a whole 8xH100 node is required to run it), it only uses 17B parameters per active pass, making it quite compute efficient. In this context, Maverick is performing quite respectably for its size. It’s also worth pointing out that Llama 4 Maverick performs better than Llama 3.1 405B Instruct (FP8) across the board on these evals.

Perhaps more interesting though are the results of the Scout level models. While we don’t have numbers for Gemini 2.0 Flash Lite, we did throw in a few other models, including the prior Llama 3.1/3.3 models, a couple smaller models (Phi-4 14B, Gemma 3 12B), and of course our own shisa-v2 preview model:

Model Name	Shaberi AVG	ELYZA 100	JA MT Bench	Rakuda	Tengu
google/gemma-3-27b-it	8.53	8.53	8.71	8.85	8.03
mistralai/Mistral-Small-3.1-24B-Instruct-2503	8.51	8.56	8.63	9.12	7.74
microsoft/phi-4	8.48	8.49	8.65	9.11	7.68
google/gemma-3-12b-it	8.48	8.34	8.67	9.02	7.88
meta-llama/Llama-3.1-405B-Instruct-FP8	8.41	8.52	8.42	9.07	7.63
meta-llama/Llama-4-Scout-17B-16E-Instruct	8.35	8.07	8.54	8.94	7.86
meta-llama/Llama-3.3-70B-Instruct	8.28	8.09	8.76	8.88	7.40
shisa-ai/shisa-v2-llama-3.1-8b-preview	8.10	7.58	8.32	9.22	7.28
meta-llama/Llama-3.1-8B-Instruct	7.34	6.95	7.67	8.36	6.40

In these particular Japanese benchmarks, we can see that Llama 4 Scout is able to match Llama 3.3 70B and get within margin-of-error of Llama 3.1 405B. While not explicitly mentioned as supported, it’s clear that Llama 4’s native Japanese capabilities are an upgrade from Llama 3.

That being said, not only do Gemma 3 27B and Mistral Small 3.1 beat Llama 4 Scout, but also several smaller models as well. You can see that our shisa-v2 preview model performs quite well and is able to greatly improve on Llama 3.1 8B’s native performance. (It’s worth noting that we do not explicitly use any Shaberi benchmark specific training in our shisa-v2 training datasets, so we believe that our score results are a fair reflection of improved Japanese language proficiency.)

What we are seeing is that now in 2025, smaller models have a good level of Japanese proficiency, which is very different from the landscape even just one year ago.

That being said, these tests aren’t the end-all be-all of Japanese language proficiency and over the past few months, we’ve developed three new evals to test specific use cases: Japanese role-playing and character personas, Japanese/English translations, and Japanese instruction following. The latter especially is quite challenging and even the best frontier models do not come close to saturating this eval.

We’ve been using these evals for internal model development and plan on open sourcing these soon for the benefit of the Japanese LLM community when they are fully baked. For now, here’s a preview of the scores for two of these tests:

Model Name	shisa-jp-rp-bench	shisa-jp-ifeval
gpt-4.5-preview-2025-02-27	4.76	0.640
openrouter/quasar-alpha	4.76	0.720
deepseek-ai/DeepSeek-V3-0324	4.73	0.593
gpt-4o-2024-11-20	4.71	0.533
gemini-2.0-flash	4.70	0.513
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8	4.70	0.533
google/gemma-3-27b-it	4.67	0.567
shisa-ai/shisa-v2-llama-3.1-8b-preview	4.66	0.253
meta-llama/Llama-3.3-70B-Instruct	4.65	0.347
google/gemma-3-12b-it	4.64	0.347
meta-llama/Llama-3.1-405B-Instruct-FP8	4.63	0.447
mistralai/Mistral-Small-24B-Instruct-2501	4.55	0.387
meta-llama/Llama-4-Scout-17B-16E-Instruct	4.55	0.380
microsoft/phi-4	4.55	0.353
meta-llama/Llama-3.1-8B-Instruct	4.13	0.160

While the rankings are broadly similar, we believe that having more challenging, non-saturated evals are required to more accurately assess and improve Japanese language capabilities. Keep an eye out for more releases/publications from us very soon!