OdiaGenAI Releases Comprehensive Pre-Trained Dataset fo...

OdiaGenAI Releases Comprehensive Pre-Trained Dataset for Odia LLM Development

Overview

The widespread adoption of AI technology in recent years has led to transformative changes in numerous industries throughout the world. LLMs tailored to regional languages are essential in India, where barely 10% of the population communicates English proficiently.

The OdiaGenAI initiative is a pioneering endeavor within the context of Odisha and Odia language, as well as in encouraging the boundaries of GenAI and LLM technologies for the advantage of Odisha and the approximately 4.37 crores global Odia community.  It acknowledges the potential of AI to revolutionize regional languages and prompt cultural preservation.

We are proposing a large pre-train dataset in Odia useful for building pre-train LLM. We built a pipeline for data collection and validation utilizing OCR and web scraping. We have collected the data span across several domains including major dialects. The dataset contains more than 300 million Odia tokens. The proposed approach can be easily extended to other Indic and low-resource languages and will be freely
available for LLM pre-training and research purposes

Dataset Preparation

This rich collection, encompassing over 293 million tokens and nearly 20 million sentences, offers a valuable resource for building a comprehensive Odia LLM.

 Figure 1: A visual representation of the diverse datasets listed, each contributing to a comprehensive pool of resources for Odia language.

 The Table 1 shows the summary of datasets with respective license information, token counts, and sentence counts.

Table 1: Summary of datasets with respective license information, token counts, and sentence counts, providing an overview of the vast array of Odia language resources available for research and analysis.

 Data Collection

Odia News paper: Newspaper images were collected from the online newspaper Dharitri. The archive hosts a vast collection of digitized newspaper pages in jpg format. Utilizing a custom Python script, we systematically accessed image files corresponding to each page within a specified range of issue numbers. The script employed concurrent requests todownload images in parallel efficiently.

Image Download and Processing: The Python script utilized the requests library to download images directly from the newspaper archive’s server. Each image was saved locally in a designated folder structure, facilitating organization and retrieval. Notably, care was taken to ensure compliance with copyright regulations, and all downloaded images were acquired with proper permissions or under fair use guidelines.

For OCR To extract textual content from the newspaper images, we employed OCR technology. Specifically, we utilized the Tesseract OCR engine, integrated into the pytesseract Python library. Tesseract is renowned for its effectiveness in recognizing text from images, particularly in diverse languages and fonts. For our study, we configured Tesseract to recognize Odia script, the primary language used in newspaper publications under analysis.

The data collection through OCR is shown in Figure 2. 

Figure 2: Pipeline illustrating the systematic process of collecting, downloading, processing, and quality assuring textual data from Odia newspaper images, emphasizing compliance with copyright regulations and rigorous quality assurance measures.

Text Extraction and Preprocessing: Following OCR processing, the extracted text was saved in plain text format (.txt). Additionally, we performed basic pre-processing steps to enhance the accuracy and usability of the extracted text. This included the removal of extraneous characters, normalization of white space, and, where necessary, manual correction of OCR errors.
Data Integrity and Quality Assurance: Throughout the data collection and processing stages, rigorous quality assurance measures were implemented to ensure the integrity and reliability of the extracted textual data. This involved manually verifying all the extracted texts against the original newspaper images to identify and rectify any discrepancies or errors in py-tesseract OCR.

Odia Lyrics: Utilizing web scraping techniques, Odia lyrics and classical lyrics were sys-
tematically collected from two distinct sources: lyricstranslate.com/ and utkalsangeet.
wordpress.com/songs/. 

The process was streamlined by employing the Scrapy framework, enabling efficient and rapid extraction of data from multiple pages. This approach allowed for a comprehensive gathering of lyrics while optimizing time and resource utilization.

With a focus on expediency, the scraping framework facilitated the extraction of lyrics from each page of the respective websites, ensuring thorough coverage of available content.
District Data: District data was systematically collected utilizing a Python script incorpo-
rating the requests and BeautifulSoup libraries. By accessing the designated website, relevant information about various districts was extracted and organized for further analysis.

Iterating through the retrieved paragraphs, the script extracted the text content while
ensuring proper formatting. Each paragraph was processed to remove any extraneous
white-space characters, ensuring clean and concise data extraction.

Human Validation

Human validation of text in Odia plays a crucial role in ensuring accuracy, cultural sensitivity, and linguistic integrity. It involves the process of having native speakers review and verify text content to ensure its fluency, appropriateness, and adherence to the nuances of the language. The clean database we prepare has been validated by the native speakers of the language, which distinguishes the unclean data through web scraping. In this section, we provided sample human-validated Odia data. 

Human Validation of Web Data

In Fig. 5, we have shown the sample raw Odia lyric data marked with the misspelled words and the corrected words after human validation. Misspelled words are colored red, and corrected words after human validation are colored green.

Figure 5: The figure displays the human validation process of Odia Lyrics Data, showcasing the raw text obtained from website scraping alongside the corresponding human-validated text. The incorrect words in the scraped data highlighted in red text, and the corrected words in the human-validated text highlighted in green.

Human Validation of OCR Data

In Fig 6, we have shown the OCR extracted misspelled words and the corresponding corrected words after human validation.

Figure 6: Comparison of incorrect words in the newspaper image, OCR output, and corrected text.

In Fig. 7, We have shown the original news paper image, the OCR-extracted data, and the
validated data after human validation, respectively.

Figure 7: Three sample newspaper images, their respective OCR outputs, and the corresponding human-validated text after OCR processing is presented in the figure.

Domain Coverage

The datasets selected to train an Odia LLM provide extensive coverage in several domains,
guaranteeing the model’s ability to comprehend and produce text in various scenarios as
shown in Fig. 4. General text corpora covering a broad range of subjects, genres, and writing styles, such as CultureaX, Oscar, Samanantar, and XP3, offer a solid base. These datasets give the LLM a flexible grasp of the Odia language in many settings.

Specialized datasets like IndicQA and Paraphrasing concentrate on particular linguistic activities like question-answering and paraphrase creation in addition to generic text, improving the LLM’s ability to precisely handle complex language processing tasks. By
adding datasets such as PMO, Varta, and Odia Newspaper, the model’s comprehension of
social concerns, governance dynamics, and current events is enhanced. These datasets offer insights into political debate, news, and current affairs.

Additionally, structured information from Wikipedia articles is provided by databases like
Wiki, which aid with knowledge extraction and help the LLM accurately respond to user
inquiries on a variety of subjects. The Sentiment Analysis dataset facilitates sentiment
analysis tasks by helping the model identify and categorize sentiment polarity in Odia text. This improves the model’s capacity to evaluate social media interactions and public opinion.

Furthermore, Odia culture, language, and regional variety may be understood using databases like Odiencorp, Odia Lyrics, and District Data. These databases help to conserve the Odia cultural legacy while also facilitating the LLM’s ability to understand and produce material appropriate for certain cultural situations and linguistic subtleties. Overall, these datasets’ varied topic coverage guarantees that the trained Odia LLM is adaptable and skilled in managing a variety of NLP tasks.

Figure 4: Pie chart depicting the distribution of datasets across various domains, highlighting the prevalence of datasets in News & Media, Multilingual & Encyclopedia, and Literature domains, among others.

Use Cases

Some of the use cases of proposed dataset are:

Language Model Training: All datasets provide valuable resources for training LLMs
specifically tailored for the Odia language. These models can then be used for various NLP tasks.

Pre-trained model: We can develop a chatbot that can assist users in Odia, a regional
language spoken in the state of Odisha using a pre-trained Odia LLM utilizing the pre-train dataset.
Odia tokenizer: The development of an Odia tokenizer enables the processing of Odia
language text into meaningful tokens, facilitating tasks such as named entity recognition and part-of-speech tagging. For example, a software developer building a chatbot for a customer service application in Odia can use the Odia tokenizer to preprocess user queries, extracting relevant entities like names, locations, and dates to provide more accurate responses.
General multilingual model: Researchers can create a flexible language model that can
comprehend and generate text in multiple languages by training a broad multilingual model on a varied dataset that comprises different dialects and domains of the Odia language. This model can be deployed in educational applications to provide language learning exercises tailored to different proficiency levels in Odia and other languages spoken in India. Additionally, it can support multilingual content creation for businesses operating in linguistically diverse regions, enabling them to reach a wider audience with localized marketing campaigns and product information.

Cultural legacy Preservation: The Odia Lyrics dataset helps scholars preserve and com-
prehend Odisha’s cultural legacy by providing insights into the rich cultural diversity and
distinctive vernacular of the state’s musical culture.

License

The dataset is available freely for research and non-commercial purposes with a CC BY-NC-SA 4.0 license.

Availability

The  dataset available in Hugging Face at: 
Dataset: https://huggingface.co/datasets/OdiaGenAIdata/pre_train_odia_data_processed

Limitation

Limited Coverage of Domains: The dataset lacks representation from certain do-
mains, such as higher mathematics and science. This limitation would restrict the
applicability of the pre-trained language model in domains requiring specialized
terminology or concepts not present in the dataset.
Sparse Availability of Dialect Odia Data: Data collection efforts may have been
hindered by the scarcity of dialect variations in the Odia language. This limitation
may impact the model’s ability to capture the full spectrum of linguistic diversity
and usage within the Odia-speaking community.
Subjectivity and Time-Intensiveness of Manual Validation: The manual validation
process introduces subjectivity and consumes significant time and resources. This
limitation not only affects the efficiency of dataset validation but also raises concerns
about consistency and reliability across the validation process, potentially impacting
the quality of the dataset and the performance of downstream language models.

Conclusion and Future Work

Odia validation: Imagine a program that checks your Odia composition for exactness,
very much like spell checkers accomplish for English! This is the thing we’re dealing with
a framework that utilizes trend-setting innovation to guarantee your Odia reports, site
content, and web-based entertainment posts are great. This will be a gigantic assistance for anybody making Odia content on the web, and it will likewise assist with advancing the Odia language itself.
Odia Dictionary: We as a whole are tied in with building an all-in-one resource for everything Odia! This thorough word reference will be a mother lode of definitions, equivalents, and utilization models for Odia words. We’ll utilize conventional strategies
and present-day innovation to make the most intensive and forward-thinking Odia word
reference of all time.
Odia Spell Checker: Grammatical mistakes happen to potentially anyone, yet not any
longer. Our Odia spell checker will utilize sharp calculations to find and fix your mix-ups.
This will work everything out so that it is a lot more straightforward to compose without
hesitation and make Odia content cleaner. We likewise plan to integrate this spell checker into various internet-based apparatuses and applications, so you can utilize it anywhere you write in Odia.

Contributors