Research December 6, 2023

Shisa 7B released

A bilingual general-purpose chat model using a synthetic-data driven approach.

Shisa 7B (shisa-7b-v1) is a bilingual Japanese and English (JA/EN) general-purpose chat model that aims to achieve strong Japanese language performance while retaining robust English capabilities, using a synthetic-data driven approach.

This model is based on Mistral 7B with a custom JA-optimized extended tokenizer that is >2X more efficient in Japanese than Mistral’s original tokenizer. The base model was pre-trained for an additional 8B primarily Japanese tokens. It was then subsequently fine-tuned with an expanded, machine-translated version of airoboros-3.1, a set of the highest-scoring items from ultrafeedback_binarized, and additional freshly generated airoboros data directly to the target languages.

We also release our base model, datasets, and pipeline code under a permissive Apache 2.0 license which can be used for any purpose, commercial or otherwise. Moreover, we are in the process of publishing extended writeups and more details of our process, including ablation results, testing methodology, and key findings on our project wiki that may be of interest to fellow researchers.

News

Shisa 7B released

More from the newsroom

Shisa-Gamma-7b-v1 Surpasses 1 Million Downloads

Shisa.AI Develops Multilingual LLM with Industry-Leading Performance