Overview
In the new series of Small Language Models (SLM) for Indic languages, OdiaGenAI released Hindi-Gemma-2B-instruct, a 2Billion SFT with 187k large instruction sets in Hindi.
The Gemma-2B was chosen as a base model due to:
- 2B versions for CPU and on-device applications.
- Efficient tokenizers on Indic languages compared to other LLMs.
Dataset
- We have used a 187k large Hindi instruction set, combining a set of instruction sets for different NLP tasks. One of the advantages of the instruction set is that it enhances the Gemma Hindi model's capability. The dataset is a comprehensive mix of 3 datasets:
- Alpaca 67K
- samanantar_100K_hindi
- OdiaGenAI’s Hindi 20K QA Pairs dataset.
Tokenizer
Gemma tokenizer is based on byte-level Byte-Pair-Encoding and is found efficient for Indic languages in comparison to other tokenizers.
Training
We used NVIDIA RTX A4000 with 16GB memory. The model was trained for 5 epochs. It took 17 hours and 41 minutes to complete the training. The training hyperparameters and shown in Table 1.
Table 1: Training Hyperparameters
Inference
Below are a few inference samples for the model.
Table 2: Good quality outputs
Table 3: Outputs with hallucination and errors
License
The Hindi-Gemma-2B-instruct released with cc-by-nc-4.0 license.
Availability
Conclusion
OdiaGenAI released its first Gemma series Indic LLM in Hindi LLM (Hindi-Gemma-2B-instruct). The future work includes:
- More data addition for SFT
- Epochs wise analysis 1-10 human and automatic both for Hindi
- multiple langauge capabilities
Contributors
Acknowledgement
We express our gratitude to Dr. Prasad Reddy, Data Care LLC, USA and his team for providing the necessary infrastructure support.