Ensemble Based Hinglish Hate Speech Detection

Ensemble Based Hinglish Hate Speech Detection

Abstract:

The mixing of multiple languages in speech or text is termed as the phenomenon of code-mixing. The easier access of internet to a larger population, fluent in various regional languages and the more convenient usage of ubiquitous languages like English for technical terms, sports related words, scientific concepts, etc. has led to an increasing presence of codemixed content on the world wide web, especially socializing and microblogging platforms like of Facebook, twitter, Instagram, etc. The code-mixing of Hindi, the predominant language of South Asia, with English is colloquially referred to as ldquo;Hinglishrdquo;, a portmanteau resulting from the names of these two languages. Hinglish is considerably different from its parent languages in syntax, phonetics, grammar and even usage of punctuations. The accent and sentiments are drawn from Hindi, the vocabulary is comprised of varying English (Roman) transliteration of Hindi words, certain English terms. The proposed research work aims to build a self-sufficient model, independent of models meant for English language, which can classify code-mixed posts or tweets, or any other text material for that matter, into the categories: Non-Offensive, Abusive and Hate-Inducing. Our work proposes two ensemble models. The first consisting of some basic machine learning models, first experimented individually experimented with, then as an ensemble. The other ensemble is made by stacking some deep learning models. The aim was to combine the learning of models as weak learners into one for getting better results.