EMPIRICAL COMPARISON OF NAÏVE BAYES EVENT MODELS AND SMOOTHING METHODS FOR TEXT CLASSIFICATION

Tez Türü: Yüksek Lisans

Tezin Yürütüldüğü Kurum: Doğuş Üniversitesi, Fen Bilimleri Enstitüsü, Bilgisayar Mühendisliği, Türkiye

Tez Danışmanı: Murat Can Ganiz

Tezin Onay Tarihi: 2013

Tezin Dili: İngilizce

Desteklendiği Program: Diğer

Özet:

Naïve Bayes is one of the most commonly used algorithms in text classification due to its easy implementation and low complexity. There are two commonly referred event models in Naïve Bayes for text categorization; multivariate Bernoulli and multinomial models. A very large number of studies choose multinomial model and Laplace smoothing just based on the assumption that it performs better than multivariate model under almost any conditions. This thesis aims to shed some light into this widely adopted assumption by empirically analyzing Naïve Bayes event models and smoothing methods from a different perspective. In order to clarify the difference between these event models of Naïve Bayes, their classification performance are compared on different languages –English and Turkish-datasets. Results of our extensive experiments demonstrate that superior performance of multinomial model does not observed all the time. On the other hand, multivariate Bernoulli model can perform well when combined with an appropriate smoothing method under different training data size conditions at any training set size.