Improve offensive language detection with ensemble classifiers

Ekinci E., Omurca S., Sevim S.

International Journal of Intelligent Systems and Applications in Engineering, vol.8, no.2, pp.109-115, 2020 (Scopus) identifier


© 2020, Ismail Saritas. All rights reserved.Sharing content easily on social media has become an important communication choice in the world we live. However, in addition to the conveniences it provides, some problems have been emerged because content sharing is not bounded by predefined rules. Consequently, offensive language has become a big problem for both social media and its users. In this article, it is aimed to detect offensive language in short text messages on Twitter. Since short texts do not contain sufficient statistical information, they have some drawbacks. To cope with these drawbacks of the short texts, semantic word expansion based on concept and word-embedding vectors are proposed. Then for classification task, decision tree and decision tree based ensemble classifiers such as Adaptive Boosting, Bootstrap Aggregating, Random Forest, Extremely Randomized Decision Tree and Extreme Gradient Boosting algorithms are used. Also the imbalanced dataset problem is solved by oversampling. Experiments on datasets have shown that the extremely randomized trees which takes word-embedding vectors as input are the most successful with an F-score of 85.66%.