We’re proud to announce Shisa V2, the latest generation of our bilingual Japanese-English language models from Shisa.AI. Over the past few months, our team has been pushing incredibly hard on the frontier of Japanese language training, and today we’re excited to share our results.
Shisa V2 sets new SOTA in Japanese benchmarks across all size classes (7B–70B), improving JA accuracy by up to +32% vs base models. Open weights, commercial-ready, available now on Hugging Face.
Introducing Shisa V2
The Shisa V2 model family sets a new standard for open-weight Japanese language models. With parameter sizes ranging from 7B to 70B, Shisa V2 delivers exceptional performance across the board, achieving state-of-the-art or near state-of-the-art results in Japanese benchmarks at every model size.
License | Model | Parameters | Context Length | JA AVG | EN AVG |
---|---|---|---|---|---|
Apache 2.0 | shisa-v2-qwen2.5-7b | 7B | 128K/8K | 71.06 | 54.86 |
Llama 3.1 | shisa-v2-llama3.1-8b | 8B | 128K | 70.83 | 54.75 |
Apache 2.0 | shisa-v2-mistral-nemo-12b | 12B | 128K | 72.83 | 53.33 |
MIT | shisa-v2-unphi4-14b | 14B | 16K | 75.89 | 60.10 |
Apache 2.0 | shisa-v2-qwen2.5-32b | 32B | 128K/8K | 76.97 | 67.41 |
Llama 3.3 | shisa-v2-llama3.3-70b | 70B | 128K | 79.72 | 67.71 |
Consistent Quality, Robust Scaling
While we maintain our focus on improving Japanese language capabilities for smaller models for local and edge applications, we also found that our improved data and training pipeline robustly scaled to larger and more capable models as well.
Recent open-weight LLMs have greatly improved their baseline Japanese capabilities, but we found that there’s still significant value in applying additional training.
Shisa V2 | Base Model | Base JA AVG | Shisa V2 JA AVG | Improvement |
---|---|---|---|---|
shisa-v2-qwen2.5-7b | Qwen 2.5 7B Instruct | 65.30 | 71.06 | +8.8% |
shisa-v2-llama3.1-8b | Llama 3.1 8B Instruct | 53.43 | 70.83 | +32.6% |
shisa-v2-mistral-nemo-12b | Mistral Nemo Instruct 2407 | 58.44 | 72.83 | +24.6% |
shisa-v2-unphi4-14b | Microsoft Phi-4 (Unsloth) | 72.47 | 75.89 | +4.7% |
shisa-v2-qwen2.5-32b | Qwen 2.5 32B Instruct | 66.79 | 76.97 | +15.3% |
shisa-v2-llama3.3-70b | Llama 3.3 70B Instruct | 72.75 | 79.72 | +9.6% |
Built for Real-world Performance
Shisa V2 models excel not only in traditional metrics but also across new Japanese evaluations we developed focused on important, previously unmeasured, downstream use cases:
- shisa-jp-ifeval: - Advanced instruction-following tasks in Japanese
- shisa-jp-rp-bench: - Personas, role-play, and multi-turn conversational capabilities
- shisa-jp-tl-bench: - High-quality Japanese-English translation proficiency
We will soon open-source these benchmarks to further benefit the broader Japanese LLM community.
Open and Accessible
We’re committed to openness. All the Shisa V2 models are released under the most permissive licenses the base models allow, and we’ve preferred to release Apache 2.0 or MIT licensed models to ensure unrestricted research, experimentation, and commercial use.
For those who are unaware, Japan is somewhat unique in that it has explicit copyright carve-outs for AI training. For more on this, see our writeup from last year on Copyright and AI Training Data in Japan.
Try Shisa V2 Now
The entire Shisa V2 model lineup is now available on Hugging Face (Shisa V2 collection):
- shisa-v2-qwen2.5-7b
- shisa-v2-llama3.1-8b
- shisa-v2-mistral-nemo-12b
- shisa-v2-unphi4-14b
- shisa-v2-qwen2.5-32b
- Shisa-v2-llama3.3-70b
UPDATE: Thanks to Jon Durbin and Chutes.ai you can now also try chatting with our best Shisa V2 70B model!
We have also spent considerable time and effort doing many tests on different data and training approaches and will be publishing additional reports and documentation in the coming weeks, so stay tuned for more on how we’re pushing the boundaries of Japanese language modeling.
Compute for training these models were provided by Ubitus K.K. and METI GENIAC.