When we released Shisa V2 405B earlier this summer, we teased a “2.1” release. What started out as a quick update with some easy fixes and performance gains grew into a bigger refresh of our model lineup, training pipeline, and almost all of our datasets, resulting in substantial across the board improvements.
Shisa V2.1 provides both meaningful uplifts in Japanese performance to our Shisa V2 models at 14B and 70B, as well as brand new models at 1.2B, 3B, and 8B sizes for local and edge deployments. Open weights are now available on Hugging Face, but we have API access rolling out as well.
Shisa V2.1: Made in Japan
Today we’re proud to present the latest Shisa V2.1 family of Japanese open-models, all available for download now on Hugging Face:
| License | Model | Parameters | JA AVG | EN AVG | JA-MT1 |
|---|---|---|---|---|---|
| LFM | Shisa V2.1 1.2B | 1.2B | 43.4 | 27.6 | 6.69 |
| Llama 3.2 | Shisa V2.1 3B | 3B | 57.9 | 43.2 | 7.55 |
| Apache 2.0 | Shisa V2.1 8B | 8B | 67.8 | 57.8 | 8.93 |
| MIT | Shisa V2.1 14B | 14B | 72.6 | 57.7 | 9.28 |
| Llama 3.3 | Shisa V2.1 70B | 70B | 73.1 | 66.0 | 9.26 |
1: Japanese MT-Bench scores judged by the standard GPT-4 Turbo judge for easy comparison to other models
After our Shisa V2 405B model training, we found that resampling our primary Shisa V2 dataset (Apache 2.0) with our 405B model’s output gave us a noticeable uplift on Shaberi benchmark performance. That’s free real estate, and about 30% of our Shisa V2.1 core synthetic dataset is now based on Shisa V2 405B output.
We once again make our improved dataset available under a permissive Apache 2.0 license as part of Shisa.AI’s mission to make all LLMs better at the Japanese language. Having now trained this dataset on models from 350M to 405B in size, we are more confident than ever that it can make practically any model stronger at Japanese.
Our data updates didn’t end there, and we do have a bit of “secret sauce” in the punch bowl to spice things up. All told, over 80% of our datasets are new or improved in our Shisa V2.1 data mix, including improved instruction following, translation, and several new datasets we’ve created to improve Japanese cultural and linguistic nuances. Beyond data, our V2.1 models also integrate additional RL and model-merging stages in addition to standard SFT and DPO stages.
Japanese Evals: Built for Real-World Use
Our V2 models remain some of the strongest ever trained in Japan, but we’re able to make some impressive improvements in Japanese language performance with our new V2.1 models. Our V2.1 14B now surpasses the 5X larger V2 70B, and our V2.1 70B closes in on our 405B at almost 1/6th the parameter count. This translates into huge efficiency gains as our smaller models are able to match larger size classes. We’ve also updated our V2.1 8B (again, almost matching our V2 14B) and added new smaller 1.2B and 3B models that are suitable even for running locally on mobile devices.
These gains were achieved without any benchmark-specific targeted training, and based on both the broad coverage of our evals as well as our own production deployments of variants of these models, we believe they reflect actual improvements in real-world Japanese-language capabilities.
| Model | JA AVG | EN AVG | Shaberi V2.1 | JA-MT1 |
|---|---|---|---|---|
| Shisa V2 405B | 74.7 | 67.5 | 8.31 | 9.43 |
| Shisa V2.1 70B | 73.1 | 66.0 | 8.03 | 9.26 |
| Shisa V2.1 14B | 72.6 | 57.7 | 7.71 | 9.28 |
| Shisa V2 70B | 69.0 | 64.3 | 7.68 | 9.07 |
| Shisa V2 14B | 68.7 | 66.7 | 7.62 | 8.69 |
| Shisa V2.1 8B | 67.8 | 57.8 | 7.35 | 8.93 |
| Shisa V2 8B | 58.7 | 55.1 | 6.43 | 7.97 |
| Shisa V2.1 3B | 57.9 | 43.2 | 6.23 | 7.55 |
| Shisa V2.1 1.2B | 43.4 | 27.6 | 5.35 | 6.69 |
As our models have improved in Japanese language capabilities, we’ve actually found that, legacy judges like GPT-4 and GPT-4o can actually be limiting for model training. We switched to GPT-4.1 for Shisa V2 training, and for Shisa V2.1, we’ve switched again to much stricter GPT-5.1 judging for both MT-Bench and our Shaberi V2.1 evals. Here’s a comparison table to show what the Japanese MT-Bench scores look like with different judges:
| Model | GPT-4-Turbo | GPT-4o | GPT-4.1 | GPT-5.1 |
|---|---|---|---|---|
| shisa-ai/shisa-v2-llama3.1-405b | 9.43 | 8.92 | 9.13 | 7.66 |
| shisa-ai/shisa-v2.1-llama3.3-70b | 9.26 | 8.74 | 8.69 | 7.24 |
| shisa-ai/shisa-v2.1-unphi4-14b | 9.28 | 8.50 | 8.68 | 7.07 |
| shisa-ai/shisa-v2-llama3.3-70b | 9.07 | 8.44 | 8.41 | 6.82 |
| shisa-ai/shisa-v2-unphi4-14b | 8.69 | 8.38 | 8.30 | 6.51 |
| shisa-ai/shisa-v2.1-qwen3-8b | 8.93 | 8.10 | 8.04 | 6.39 |
| shisa-ai/shisa-v2-qwen2.5-7b | 8.16 | 7.62 | 7.31 | 5.79 |
| shisa-ai/shisa-v2-llama3.1-8b | 7.97 | 7.41 | 6.99 | 5.44 |
| shisa-ai/shisa-v2.1-llama3.2-3b | 7.55 | 6.94 | 6.42 | 4.92 |
| shisa-ai/shisa-v2.1-lfm2-1.2b | 6.69 | 6.13 | 5.55 | 4.19 |
Besides using variants of our models in production, most of our team members also literally talk to our models every day and we’ve found that even the best of the common LLM evals only tell part of the story. During Shisa V2 development, we created some new benchmarks to better capture important real-world Japanese language use cases, and we’re happy to finally share some of these:
- shisa-jp-ifeval - Inspired by Google’s IFEval, but designed specifically for Japanese. Where IFEval tests English-centric constraints like spelling and capitalization, ours tests verifiable Japanese-specific constraints: mora counting, script selection (hiragana/katakana/kanji), honorific usage, and more. Obviously we’re biased, but we find an IFEval that is carefully constructed and targeted to the Japanese language to be more relevant than the alternatives we found.
- shisa-jp-rp-bench - Character adhesion and natural dialogue in multi-turn conversations is a key LLM use case, and we created a benchmark based on Aratako’s Japanese-RP-Bench that uses pairwise LLM judging with a Bradley-Terry model to more reliably rate our models on these key qualities.
- shisa-jp-tl-bench - Our translation benchmark compares model outputs pairwise against a frozen baseline, aggregated via Bradley-Terry scoring. This approach gives us stable, fine-grained rankings even between closely-performing models - essential for iterating on translation quality during development.
We’re now releasing all three benchmarks publicly. Each addresses gaps we found in existing Japanese evaluation tools, and we hope they’ll be useful for the broader Japanese LLM community.
Cross-Lingual Token Leakage
One issue we’ve found surprisingly underexplored is what we call Cross-Lingual Token Leakage (CLTL) - when models output stray non-Japanese words or sub-words in Japanese responses. Even as Japanese capabilities in LLMs have improved, we’ve still observed this behavior not just in overseas multilingual models, but even in many domestically-trained Japanese models - often producing Chinese or English fragments mid-sentence.
Detecting CLTL reliably is harder than it sounds. Japanese and Chinese share Unicode code planes, and many English words appear legitimately in Japanese text (“AI”, “Google”, acronyms). Even frontier models struggle to identify wrong-language tokens in their own output due to how LLMs “see” and understand text.
For V2.1, we developed new evaluation methods specifically designed to identify and measure CLTL and we are finally able to quantify the behavior we’ve observed:
| Base Model | Shisa V2.1 Model | Base Leak % | Shisa V2.1 Leak % | Improvement |
|---|---|---|---|---|
| Llama 3.2 3B | Shisa V2.1 3B | 11.48% | 0.24% | 47.8× |
| LFM2 1.2B | Shisa V2.1 1.2B | 4.32% | 0.32% | 13.5× |
| Qwen 3 8B | Shisa V2.1 8B | 2.18% | 0.44% | 5.0× |
| Llama 3.3 70B | Shisa V2.1 70B | 1.90% | 0.36% | 5.3× |
| Phi 4 | Shisa V2.1 14B | 0.12% | 0.06% | 2.0× |
Eliminating language confusion is essential for production Japanese applications - translations, customer service, content generation are all non-starters if a model frequently outputs wrong language tokens. Our Shisa V2.1 models show marked improvements versus their base models (and versus almost all other models that we’ve tested). We’ll be publishing more details on our CLTL detection methods soon in a future writeup.
Trained on AMD
Primary compute for Shisa V2.1 was provided by AMD through the AMD Developer Cloud. All our final V2.1 models (and several hundred ablations) were trained on AMD MI300X GPUs using Axolotl, with RL stages done in TRL and merging via mergekit.
These models are the first major models in Japan trained on AMD hardware. And while we are big fans of NVIDIA at Shisa.AI, competition is good for everyone, and we’re also glad to see that there is a real viable alternative now. We’re happy to report that our AMD training experience has been largely problem-free.
Along the way, we’ve also been contributing back to the ecosystem, especially when it comes to Mixture of Experts training. We ported the HF megablocks kernel to AMD as shisa-ai/megablocks-hip, achieving 4-5x performance gains for MoE training on MI300X. We’ve also developed an Aux-Loss Free Balancing plugin for multiple MoE architectures for Axolotl, and a new distributed-friendly Muon-clip optimizer implementation as well. The eagle-eyed may have noticed that the Shisa V2.1 family is currently solely populated with dense models. 😉
Open For Business
The biggest change for Shisa.AI since our last releases is that we’ve spent the summer cranking away on the product side as well. We’ve released our chotto.chat translation product (powered by Shisa models, of course) and a number of other translation offerings, including real-time event subtitling, and enterprise translation. We are now offering text, translation, and voice APIs as well.
We understand it’s still hard and expensive to try our best and biggest models, so we’re happy to announce that we are rolling out Shisa model API access both on via our Shisa Talk platform as well as soon via OpenRouter. If you need dedicated capacity, on-premise hosting, or even custom training, drop us a line.
Our Shisa V2 models are also now approved METI GENIAC domestic models, and our LLM inference is proudly Japan-hosted for both the best latency for real-time applications, and to support Japanese data residency and regulatory compliance.
It’s been almost exactly 2 years since our first Shisa V1 model release, and while the pace of progress and AI advancements have been breathtaking, our basic mission remains the same. It’s now clear to us that simply training the best Japanese models is only the first step in making sure that the Japanese public and society at large benefit from these advances. So, we use this release as an opportunity to also announce that Shisa.AI is also now officially “open for business.”

Compute for Shisa V2.1 training was provided by AMD. Additional compute and credits for ablations, data processing, and evaluations were provided by Chutes, Hot Aisle, Emerald Compute System, Lambda, and Strata. We also thank AWS Activate, Google for Startups, and NVIDIA Inception for their ongoing support.
Per the Llama Community License Agreements, the official names of the Llama-based models are “Llama 3.2 Shisa V2.1 3B” and “Llama 3.3 Shisa V2.1 70B”