- Tech Mahindra announces “Project Indus,” which is being called a Made In India ChatGPT.
- The first model will have 7 billion parameters, will be open-source and support 40 Hindi dialects.
- It will help represent Indic languages, preserve them, and ensure cultural sensitivity while being cost-effective.
ChatGPT has inevitably taken over the whole world. It has definitely made our lives a lot easier. Whether you need ideas for a project or help with an essay or are just curious about a particular topic, all you have to do is ask this chatbot and it generates responses for you in seconds. Such is the extent of global technology today. And India is not far behind. Tech Mahindra recently announced the launch of Project Indus, an Indic-based foundational model for Indian languages.
Large language models (LLMs) like the GPT models by OpenAI do possess multilingual capabilities but they have been predominantly trained on English datasets. This limits their proficiency in understanding and generating content in Indic languages. Thus, came the idea for an open-source Indic LLM that will be hugely beneficial for India. This could potentially prove to be one of Tech Mahindra’s most important projects. They haven’t disclosed the cost associated with it or the launch date, but as per Nikhil Malhotra, Global Head-Makers Lab, Tech Mahindra, the aim is to build a 7-billion parameter LLM to begin with.
Also Read: Open AI’s GPT 3.5 Gets An Upgrade, Could Be Better Than GPT 4 As Well
About The Indic LLM
The model, which would be the largest Indic LLM, may potentially serve 25% of the world’s population, according to Tech Mahindra’s Chief CP Gurnani. Project Indus is expected to support 40 different Hindi dialects initially. More languages and dialects will be added subsequently. Tech Mahindra’s main objective is to develop an LLM for text continuation and then generate a dialogue.
“We understand that much work has been done on the Indic suite like Bhashini and AI4 Bharat, etc., but a foundation model still needs to be developed. As we continue to develop the model, we are constantly learning and improving the process. Our interface could have voice and textual information; however, we haven’t considered incorporating a chat interface like ChatGPT yet. Once we are clear that the model performs well and generates dialects well, we would launch it in the open source,” said Nikhil Malhotra.
Benefits Of Project Indus
OpenAI’s GPT models have undoubtedly been groundbreaking. Hence, developing an LLM, primarily designed for Indic languages could be highly beneficial for India.
It is essential for the Indic LLM to understand the nuances of local cultures and contexts for effective communication. Project Indus can be designed to prioritise cultural sensitivity, ensuring that the generated content is respectful of local customs and norms.
It could also democratise AI and cater to the wider section of non-English speakers in the country.
“One of the benefits of a foundation model is its versatility. For instance, a language model is capable of performing multiple tasks such as Q&A, fill-in-the-blanks, etc. using the same model. This approach is beneficial for specialised healthcare, retail, and tourism industries,” Nikhil Malhotra added.
The cost of tokens for the Indic languages in the GPT models is significantly higher when compared to English. Hence, an Indic LLM offers a more cost-effective solution for generating Indic language content without token pricing constraints.
The main requirement of these languages is adequate representation. Project Indus represents unrepresented languages, which further aids in preserving them.
Building The Indic Datasets
The effectiveness of any AI model relies on the quality of its datasets. While English datasets are ample in quantity and readily accessible, there aren’t many available for Indic languages and dialects. Several stakeholders, including the Indian government, are actively working to create such datasets in response to this dilemma. “Despite various efforts, in India, datasets for languages other than Hindi are scarce and incomplete. Additionally, even Hindi data is fragmented,” Malhotra mentioned.
Last year, Prime Minister Narendra Modi launched the Bhasini project, which aims to develop Indian language translation technologies and crowd-source voice datasets in multiple Indian languages to enhance the availability and accessibility of digital services in local languages. Educational institutions like the Indian Institute of Science (IISc) and IIT Madras (Ai4Bharat) and Microsoft are also involved in building datasets for Indic languages.
Bhasha Daan
Tech Mahindra is sourcing information from various avenues that are commonly available on the internet like Common Crawl, newspapers, Wikipedia, books written in specific dialects and more. They are also referring to YouTube descriptions since the information on dialects is primarily available through YouTube videos or spoken language samples. Accessing computational resources is a key challenge for the tech giant. For this purpose, they have established a portal to get a “bhasha daan” from Indians who speak these dialects.
“By clicking ‘Make a Contribution‘ on our website, you will find a user-friendly interface with all the dialects in which we collect data. Once you select a dialect, you can listen to a sample voice recording of how Hindi is spoken in that particular dialect. Users can then scroll down and anonymously record a sentence by clicking the record button.” The IT giant’s MD and CEO, CP Gurnani took to X (formerly Twitter) to request contributions from the general public for the creation of Indic datasets.
Project Indus: An AI-powered present from @Tech_Mahindra to the millions of Hindi speakers of India..
— CP Gurnani (@C_P_Gurnani) August 21, 2023
But we humbly request a bit of #BhashaDaan from you..
Please lend us your expressions, your vocabulary, your conversations.. And help us train India’s biggest indigenous LLM… pic.twitter.com/6kDql3qQno
While dealing with local languages, it will be essential for Tech Mahindra to put adequate guardrails in place so as to avoid biases and ensure Project Indus’ success. “When we collect the data at the first phase, it is essential to realise that this data would have to go through cleaning to ensure there is no bias. To address this challenge, we would be using both human annotation and automatic techniques to ensure there is no racial, ethnic, or gender bias, etc.,” Malhotra said.
Project Indus is definitely a commendable move and a proud one for the country. However, its success shall depend upon a number of elements, including thorough data collection, effective model training, and taking linguistic variations into account. We look forward to its availability and testing it out!