Preserving Linguistic Diversity: Language Documentation & Technology Projects
Pioneering research initiatives focused on documenting, preserving, and developing technology for endangered and low-resource languages across India.
Speech Datasets for Tibeto-Burman Languages
1
Project Launch
April 2022: SpeeD-TB project initiated with funding from Mission Bhashini, Ministry of Electronics and IT.
2
Data Collection
Building 1,200 hours of speech data across 6 languages: Toto, Chokri, Nyishi, Kok Borok, Bodo and Meetei.
3
Model Development
Creating phone sets, language models and baseline speech recognition systems.
4
Project Completion
March 2026: Scheduled completion with all deliverables.
SpeeD-TB Collaborative Network
Academic Partners
IIT Kharagpur, Tezpur University, and Manipur University providing linguistic expertise and research infrastructure.
Technology Partners
Karya Inc. and Panlingua Language Processing LLP contributing technical knowledge and development resources.
Research Team
Led by Bornini Lahiri, Meiraba Takhellambam, Amalesh Gope, Ritesh Kumar and team members.
Inter-genetic Study of Languages of West Bengal
Tibeto-Burman
Dhimal language case markers analyzed
Austro-Asiatic
Santhali language case markers studied
Dravidian
Kurukh language case system examined
Indo-Aryan
Bangla language case markers compared
Project period: December 2020 – December 2023, funded by ISIRD, SRIC, Indian Institute of Technology, Kharagpur.
Mundari Language Technology Development
Community Assessment
Identified needs of 1,128,050 Mundari speakers across Jharkhand, Bihar, Odisha and West Bengal.
TTS Dataset Creation
Developed 50 hours of Text-to-Speech data (25 hrs male, 25 hrs female) focused on educational content.
Application Development
Created gaming-pedagogical App in Mundari to facilitate education and knowledge dissemination.
Project funded by Microsoft Research Lab India from July 2021 to June 2024.
UGC-SRIELI: Documenting Indigenous Languages
From November 2016 to January 2019, comprehensive documentation of Kurmali, Mahali, Toto and Koda languages was conducted, resulting in published resources including the Kurmali dictionary.
Scheme for Protection and Preservation of Endangered Languages
Identification
Locating Dhimal speakers in West Bengal through systematic field surveys and community engagement.
Documentation
Recording native speakers in natural cultural contexts to capture authentic language patterns and usage.
Preservation
Creating permanent linguistic records including dictionaries, grammar guides, and digital archives for future generations.
Project conducted from August 2014 to July 2016 under the Ministry of Education (Government of India) initiative monitored by Central Institute of Indian Languages (CIIL).
The ComMA Project: Addressing Online Aggression
Multilingual Models
AI systems to detect harmful content
Multimodal Dataset
59,152 annotated comments + 500 memes + 50 hours speech
Four Languages
Meitei, Bangla, Hindi, and Indian English
Funded by Facebook Research, USA from January 2020 to April 2022, this project developed systems to identify communal and misogynistic aggression in online content.
ComMA Project: Technical Implementation
Hierarchical Annotation
Fine-grained tagset marking different types of aggression and contextual information within conversational threads.
Multi-platform Data
Content collected from YouTube, Facebook, Twitter and Telegram to ensure diverse representation of online discourse.
Discursive Role Analysis
Comments analyzed for their functional role in relation to previous comments, creating context-aware classification.
Project Resources and Outputs
SpeeD-TB
Speech datasets and models available at github.com/speed-tb
ComMA Dataset
Aggression detection data at github.com/unrealtecellp/ComMA
AI Models
Try models at life.unreal-tece.co.in:5001/lifemodels/models_playground
Publications
Dictionaries and linguistic analyses in multiple languages
Impact on Language Communities
6
Tibeto-Burman Languages
Receiving speech technology support through SpeeD-TB
1.1M
Mundari Speakers
Benefiting from new educational technology
4
Documented Languages
Under UGC-SRIELI project
50+
Hours of Speech Data
For online aggression detection
Collaborative Research Network
The success of these language documentation and technology projects relies on strong partnerships across academic institutions, government bodies, technology companies, and most importantly, the language communities themselves.
Contact & Connect
Connect professionally to follow research updates and academic collaborations.
Access published papers and track citations of our language research projects.
Reach out directly for research inquiries or collaboration opportunities.
View complete academic background and institutional affiliations.
Let's collaborate on linguistic research initiatives. Reach out to discuss potential partnerships in language technologies, language documentation and description, linguistic typology and other projects in related fields.