Document Embedding based Supervised Methods for Turkish Text Classification

Celenli H. I., Ozturk S. T., Sahin G., Gerek A., GANİZ M. C.

3rd International Conference on Computer Science and Engineering (UBMK), Sarajevo, Bosna-Hersek, 20 - 23 Eylül 2018, ss.477-482

Yayın Türü: Bildiri / Tam Metin Bildiri
Cilt numarası:
Doi Numarası: 10.1109/ubmk.2018.8566326
Basıldığı Şehir: Sarajevo
Basıldığı Ülke: Bosna-Hersek
Sayfa Sayıları: ss.477-482
Kocaeli Üniversitesi Adresli: Hayır

Özet

Following the recent increase in the amount of available data, Deep Learning has become the most popular branch of Machine Learning. This trend can also be seen in Natural Language Processing (NLP) especially since textual data can now be scraped from in World Wide Web in vast quantities and used in an unsupervised or semi-supervised manner. For this reason, Deep Learning methods are being used more frequently. In this work we devise several classification methods based on the Paragraph Vector model (a.k.a. Doc2Vec) which represents documents as vectors. These include k-Nearest Neighborhood classifier (k-NN), Support Vector Machines (SVM), Centroid Classifier (CC) that works on paragraph vectors of documents and a custom made method which uses pairwise cosine similarities between documents and class centroids as features in Doc2Vec space. Our experiments use a number of representations and classifiers combined in various ways. On the representation side the Paragraph Vector model is compared with Term Frequency (tf) and Term Frequency-Inverse Document Frequency (tf-idf) using SVM, k-NN, CC and Centroid Features Support Vector Machine (CFSVM) as classifiers.