Making Misal — India's First Competitive Marathi LLM
← back1 min read· 13 Apr 2024
Originally published on smallstep.ai on April 13, 2024. Read the full post here: smallstep.ai/making-misal
Overview
Misal is India's first competitive Marathi LLM — 7B and 1B parameter models pretrained and finetuned on ~2B Marathi tokens, with a custom SentencePiece tokenizer that fixes Llama's 3–5x token inefficiency on Devanagari script.
Highlights:
- Custom tokenizer — 15K Marathi tokens added to Llama's vocabulary, cutting tokens-per-word 3–5x
- Pretraining — 2B Marathi tokens on A100, LoRA-based continued pretraining of Llama2 7B/1B
- Instruction tuning — 200K Marathi instructions curated from Alpaca translations + IndicQuestionGeneration
- Eval — beat GPT-3.5 on Marathi reading comprehension benchmarks
- Open-sourced — models on Hugging Face, tokenizer, pretraining configs, and eval framework
Read the full write-up
The complete technical breakdown — data curation, tokenizer training, pretraining recipe, finetuning, and evals — lives on the smallstep.ai site: