Ritesh Kumar | Council for Diversity and Innovation

Groundbreaking Language Research Projects
Ritesh Kumar, PhD
Exploring linguistic diversity and mitigating online harms through cutting-edge research programmes funded by prestigious institutions across India and internationally. Hiswork spans endangered language documentation, speech technology for underrepresented languages, and automatic detection of harmful content.
Speech Datasets for Tibeto-Burman Languages
Data Collection
Building 1,200+ hours of speech data across 6 languages: Toto, Chokri, Nyishi, Kok Borok, Bodo and Meetei.
Phonetic Analysis
Preparing comprehensive phone sets for each language to document unique linguistic features.
Model Development
Creating baseline speech recognition models to preserve these cultural treasures for future generations.
Funded by Mission Bhashini, Ministry of Electronics and Information Technology, Govt. of India (2022-2026). 
Led by Bornini Lahiri, Meiraba Takhellambam, Amalesh Gope, Ritesh Kumar and team in collaboration with IIT-Kharagpur, Tezpur University, Manipur University, Karya Inc., and Panlingua.
Documentation of Beda Language
Field Recordings
Collection of over 5 hours of high-quality audio recordings from the last remaining speakers of Beda.
Cultural Documentation
Systematic recording of cultural practices, rituals, and performances unique to the Beda community.
Lexicographic Work
Creation of a comprehensive dictionary preserving the lexicon of this critically endangered language.
This project was funded by the Scheme of Preservation and Protection of Endangered Languages (SPPEL), Central Institute of Indian Languages, Mysore (January 2015-November 2016). Principal Investigator: Ritesh Kumar.
Verbal Threat Detection in Aggressive Speech
Data Collection
Creation of 160 hours of speech corpus in Hindi and Indian English, capturing various levels of aggression and verbal threats.
Annotation Process
Meticulous annotation of speech samples to identify markers of aggression intensity and threatening language patterns.
Model Development
Building machine learning algorithms to automatically detect aggression markers in multilingual speech contexts.
Validation Testing
Rigorous testing protocols across diverse speech samples to ensure accuracy and reliability of detection systems.
Funded by UGC-UKIERI Thematic Partnerships (January 2015-December 2016). Collaborating with University of Huddersfield, University of Sussex, JNU, and Microsoft Research India.
Aggression Detection on Social Media
Data Collection
Gathering diverse social media content in Hindi-English code-mixed environments.
Aggression Tagging
Development of multilevel tagging scheme to categorize different forms of online aggression.
AI Model
Creation of automatic classification algorithms for real-time detection of aggressive content.
Implementation
Deployment of models to assist content moderation in multilingual Indian contexts.
Supported by an Unrestricted Research Grant from Microsoft Research, India (June 2017). The project produced a robust dataset and model for automatically recognizing multilingual aggression patterns in social media texts.
The ComMA Project: Analysing Online Hate
Multimodal Analysis
Comprehensive examination across text, memes, and speech
Hierarchical Annotation
Fine-grained classification of aggressive content types
Multilingual Coverage
Spanning Meitei, Bangla, Hindi, and Indian English
Conversational Context
Mapping aggression within discourse structures
The Communal and Misogynistic Aggression project analyzed 59,152 comments across social media platforms, alongside 500+ memes and 50+ hours of speech data. Funded by Facebook Research (2020-2022), with partners including Panlingua and IIT-Kharagpur.
The HarmPot Project: Measuring Online Harm
Framework Development
Creation of India-specific methodology to measure potential harm of social media content, accounting for linguistic and cultural nuances unique to the Indian context.
Multidisciplinary Approach
Collaboration between linguists, social scientists, legal experts, law enforcement, parliamentarians, activists, and affected communities.
Large-Scale Data Collection
Analysis of 574,000+ datapoints in Hindi and English from 500+ incidents of online harm to establish patterns and metrics.
Model Implementation
Development of AI systems to automatically assess harm potential and recommend proportionate intervention measures.
Funded by the Council for Strategic and Defense Research (2022-2024) in partnership with Unreal Tece LLP, this project addresses growing concerns about online misinformation and hate speech in India.
Research Impact and Knowledge Transfer
Endangered Language Preservation
Our documentation projects have created permanent records of critically endangered languages like Beda, preserving cultural heritage that might otherwise be lost to future generations.
Digital Inclusion
Speech technology development for Tibeto-Burman languages helps bridge the digital divide, enabling millions of speakers to access information and services in their native tongues.
Safer Online Spaces
Our aggression detection models provide practical tools for identifying and mitigating harmful content, contributing to safer digital environments across Indian language communities.
Policy Influence
Research findings inform evidence-based approaches to online content moderation and digital safety policies specific to Indian linguistic and cultural contexts.
All projects make their datasets and models publicly available through GitHub repositories, fostering transparent research practices and enabling broader academic and industry innovation.
Research Methodology and Approach
Needs Assessment
Identifying critical gaps in language technology and documentation through stakeholder consultation.
Community Engagement
Establishing partnerships with language communities to ensure ethical data collection and cultural sensitivity.
Rigorous Analysis
Applying interdisciplinary analytical frameworks combining linguistics, computer science, and social sciences.
Open Dissemination
Making research outputs freely available through open repositories and accessible platforms.
Our research philosophy emphasizes participatory approaches, ethical considerations, and sustainable impact. We prioritize long-term value creation for language communities while maintaining high academic standards and methodological innovation.
International Collaborations
University of Huddersfield
Partnered on verbal threat detection in aggressive speech, providing expertise in computational linguistics and speech analysis methodologies.
Facebook Research
Supported the ComMA Project, bringing expertise in large-scale content moderation and multilingual NLP techniques.
Microsoft Research India
Collaborated on social media aggression detection, contributing advanced machine learning expertise and computational resources.
Our international partnerships extend our research capabilities and global impact while bringing valuable cross-cultural perspectives to addressing linguistic challenges in the Indian context.
Research Applications and Tools
All project resources are publicly accessible. Interactive demonstrations of the aggression detection and harm potential models are available at the Unreal Tece LLP models playground
Try Models
Future Research Directions
Expanded Language Coverage
Including more endangered and low-resource Indian languages
Advanced Real-time Detection
Developing faster, more accurate detection systems
Preventive Intervention
Building tools for early-stage harm mitigation
Policy Framework Development
Informing evidence-based digital safety regulations
Our research roadmap focuses on expanding language coverage, improving detection accuracy, developing preventive mechanisms, and contributing to policy frameworks. We aim to build on our foundation of linguistic expertise and technological innovation to address emerging challenges in language preservation and online safety.
Contact & Connect
LinkedIn
Connect professionally to follow research updates and academic collaborations.
Google Scholar
Access published papers and track citations of our language research projects.
Email Contact
Reach out directly for research inquiries or collaboration opportunities.
Institutional Profile
View complete academic background and institutional affiliations.
Let's collaborate on linguistic research initiatives. Reach out to discuss potential partnerships in language technologies, speech technologies, language documentation, online harm mitigation projects and other projects in related fields.