The Effectiveness of Homogenous Ensemble Classifiers for Turkish and English Texts

Creative Commons License

Kilimci Z. H. , Akyokus S., Omurca S.

International Symposium on Innovations in Intelligent Systems and Applications (INISTA), Sinaia, Romanya, 2 - 05 Ağustos 2016 identifier identifier


Text categorization has become more and more popular and important problem day by day because of the large proliferation of documents in many fields. To come up with this problem, several machine learning techniques are used for categorization such as naive Bayes, support vector machines, artificial neural networks, etc. In this study, we concentrate on ensemble of multiple classifiers instead of using only a single one. We perform a comparative analysis of the impact of the ensemble techniques for text categorization domain. To carry out this, the same type of base classifiers but diversified training sets are used which is referred as homogenous ensembles. In order to diversify the training dataset, various ensemble algorithms are utilized such as Bagging, Boosting, Random Subspace and Random Forest. Multivariate Bernoulli Naive Bayes is preferred as a base classifier due to its superior classification performance compared to the success of the other single classifiers. A wide range of comparative and extensive empirical studies are conducted on four widely-used datasets in text categorization domain in both Turkish and English. Finally, the effectiveness of ensemble algorithms is discussed.