Shisa 7B released
A bilingual general-purpose chat model using a synthetic-data driven approach.
Shisa 7B (shisa-7b-v1) is a bilingual Japanese and English (JA/EN) general-purpose chat model that aims to achieve strong Japanese language performance while retaining robust English capabilities, using a synthetic-data driven approach.
This model is based on Mistral 7B with a custom JA-optimized extended tokenizer that is >2X more efficient in Japanese than Mistral’s original tokenizer. The base model was pre-trained for an additional 8B primarily Japanese tokens. It was then subsequently fine-tuned with an expanded, machine-translated version of airoboros-3.1, a set of the highest-scoring items from ultrafeedback_binarized, and additional freshly generated airoboros data directly to the target languages.
We also release our base model, datasets, and pipeline code under a permissive Apache 2.0 license which can be used for any purpose, commercial or otherwise. Moreover, we are in the process of publishing extended writeups and more details of our process, including ablation results, testing methodology, and key findings on our project wiki that may be of interest to fellow researchers.
More from the newsroom
Shisa-Gamma-7b-v1 Surpasses 1 Million Downloads
One year after its role in pioneering evolutionary model merges, our model reaches a significant milestone.
Read moreShisa.AI Develops Multilingual LLM with Industry-Leading Performance
Releasing an open-source 405B parameter model that surpasses GPT-4 in Japanese language tasks.
Read more