Originally published on smallstep.ai on April 13, 2024. Read the full post here: smallstep.ai/making-misal

Overview#

Misal is India's first competitive Marathi LLM — 7B and 1B parameter models pretrained and finetuned on ~2B Marathi tokens, with a custom SentencePiece tokenizer that fixes Llama's 3–5x token inefficiency on Devanagari script.

Highlights:

Custom tokenizer — 15K Marathi tokens added to Llama's vocabulary, cutting tokens-per-word 3–5x
Pretraining — 2B Marathi tokens on A100, LoRA-based continued pretraining of Llama2 7B/1B
Instruction tuning — 200K Marathi instructions curated from Alpaca translations + IndicQuestionGeneration
Eval — beat GPT-3.5 on Marathi reading comprehension benchmarks
Open-sourced — models on Hugging Face, tokenizer, pretraining configs, and eval framework

Read the full write-up#

The complete technical breakdown — data curation, tokenizer training, pretraining recipe, finetuning, and evals — lives on the smallstep.ai site:

➜ smallstep.ai/making-misal

Making Misal — India's First Competitive Marathi LLM

Overview#

Read the full write-up#

Coverage#