Olive: An Instruction Following LLaMA Model For the Low-Resource Odia Language

 

Author: OdiaGenAI Team

broken image

Overview

Generative artificial intelligence is indeed changing many aspects of our lives and transforming the way we interact with technology. Large Language Models (LLMs) are significantly impacting the AI community, and the advent of ChatGPT and GPT-4 leads to rethinking the possibilities of artificial general intelligence (AGI).

However, most of the LLMs are trained in English and other high-resource languages resulting unavailability of LLM and its related technologies and services for many low-resource languages. Though many LLMs support multilingual, including Odia language, the performance for various tasks (e.g., content generation, question-answering) could be improved due to the amount of ingested data for Odia.

The Odia Generative AI (in short, OdiaGenAI) is an initiative by a group of researchers to research Generative AI and Large Language Models (LLMs) for the low-resource Odia language. It is supported by Odisha.ML, the machine learning global community of Odias.

In an attempt to build the Generative AI and LLM-based technologies for the Odia language, OdiaGenAI, released "Olive", an instruction following LLaMA model for the low-resource Odia language on 19th May 2023.

The dataset, and code (training/inference) are available freely for research and non-commercial purposes. The datasets and models are available on Odia Generative AI Hugging Face page.

Dataset

The dataset contains 171K Odia instruction sets. The instruction sets are prepared by:

i) translating popular instruction sets (as below) from English to Odia using the IndicTrans machine translation library from AI4Bharat.

  • Alpaca
  • Dolly
  • GPT Teacher

ii) preparing translation instruction set using OdiEnCorp, English-Odia parallel corpus

iii) hard-coded instruction set

Training

The first experimental model trained on GPU (A100, 40G VRAM) was provided through Colab Pro+ for 3 epochs following the Alpaca-LoRA training script. The training parameters are shown in Table 1.

broken image

Table 1: Training Hyperparameters

The training took more than 30 hours, cost 4200 INR. We need to restart the training in between due to issues in saving checkpoints which can be seen in the train/eval curve.

 

broken image
broken image

Figure 1: Train/Eval Loss curve

The first OdiaGPT model, odiagenAI-model-v1, was released on 19th May 2023 through HugginFace with a CC BY-NC-SA 4-0 license. The model is based on Llama-7b as the base model and finetuned with the Odia instruction set with 3 epochs. The Hugging Face model card shows the model descriptions and running instructions. The code (translation, training, and inference) is available on GitHub. 

Inference

The inference script is adapted from Alpaca-LoRA considering the base model Llama-7b with the odia-genAI-model-v1 weights. The sample inferences are shown in Figure 2-5.

broken image
broken image

Figure 2: Sample Inference. The question is, “Who are you” The answer is “I am Olive, known as a chat assistant and trained by the researchers of OdiaGenAI “. Note: The last sentence is repeated.

broken image
broken image

Figure 3: Sample Inference. The question is, “Which are the main cities of India” The answer is “Some of the main cities of India are NewDelhi, Mumbai, Chennai, Bengaluru, and Kolkata,” Note: The few words not correct that talk about population growth in the city of Chennai and Hindu (wrong city name).

 

broken image
broken image

Figure 4: Sample Inference. The question is, “Who is the prime minister of India” The answer is “Narendra Modi is the prime minister of India”.

broken image
broken image

Figure 5: Sample Inference. The question is, “Write a poem in Odia” The answer is “I apologize, I am an artificial intelligence model and I don't have the capability to write a poem but I write a poem that can be written in Odia”.

Analysis

  • The model is able to follow Odia instructions and generate content in Odia.
  • The model provides correct answers relating to the general knowledge questions about India.
  • The model still suffers from hallucinations.
  • Due to the lack of Odisha-related context data, the model fails to answer questions relating to Odisha.
  • Still not able to follow arithmetic problems and critical reasoning.

Future Plan

The plan includes:

i) Fine-tuning with more instruction sets containing knowledge about Odisha and its local context (literature, food, places, persons, festivals, history, politics, etc.), arithmetic, and critical reasoning.

ii) Continuing fine-tuning with more instruction sets with validated Odia data and base LLM as larger open-source models supporting Odia.

iii) Release pre-train Odia LLM model following BLOOM specification,

iv) Fine-tuning LLM for the specific domain and releasing Odia chatbots (education, public health, tourism, governance) for general usage.

Acknowledgment

We thank the following institutions/organizations for their LLM resources and support.

Team

broken image

 

Feel free to contact us for any feedback/suggestions/contributions.