Google DeepMind’s India unit is spearheading an ambitious project named Morni (Multimodal Representation for India) aimed at developing artificial intelligence to encompass 125 Indian languages and dialects. This initiative seeks to address the linguistic diversity of India, which includes over 22 officially recognized languages and numerous regional dialects.
Manish Gupta, Director at Google DeepMind India, shared insights at the Global Fintech Fest in Mumbai, highlighting the project’s scope. “India has 22 official languages, but we are focusing on over 100 languages. There are 60 languages with over a billion speakers and 125 languages with more than 100,000 speakers each,” Gupta explained.
Overcoming Data Challenges
A significant challenge in this project is the lack of digital data for many Indian languages. Gupta noted that 73 out of the 125 target languages had no available digital corpus. Even Hindi, despite its wide use, represents only 0.1% of internet text.
To tackle this issue, Google launched Project Vaani in collaboration with the Indian Institute of Science and ARTPARK (Artificial Intelligence & Robotics Technology Park). This project has created an open-source database featuring over 14,000 hours of speech data from 58 languages, gathered from 80,000 speakers across 80 districts.
Progress and Future Goals
First announced in December 2022, Project Vaani aims to collect and transcribe 154,000 hours of open-source anonymized speech data from across India. Gupta stated that the project is currently in its second phase, targeting 160 districts throughout the country. This phase is crucial for expanding the data coverage and improving AI models for India’s diverse linguistic landscape.