NET-LDA: a novel topic modeling method based on semantic document similarity

Ekinci, Ekin; İLHAN OMURCA, SEVİNÇ

doi:10.3906/elk-1912-62

NET-LDA: a novel topic modeling method based on semantic document similarity

TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, cilt.28, sa.4, ss.2244-2260, 2020 (SCI-Expanded, Scopus, TRDizin)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 28 Sayı: 4
Basım Tarihi: 2020
Doi Numarası: 10.3906/elk-1912-62
Dergi Adı: TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC, TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.2244-2260
Anahtar Kelimeler: Aspect extraction, cooccurence relation, latent Dirichlet allocation (LDA), semantic similarity, topic modeling, LATENT DIRICHLET ALLOCATION
Kocaeli Üniversitesi Adresli: Evet

Özet

Topic models, such as latent Dirichlet allocation (LDA), allow us to categorize each document based on the topics. It builds a document as a mixture of topics and a topic is modeled as a probability distribution over words. However, the key drawback of the traditional topic model is that it cannot handle the semantic knowledge hidden in the documents. Therefore, semantically related, coherent and meaningful topics cannot be obtained. However, semantic inference plays a significant role in topic modeling as well as in other text mining tasks. In this paper, in order to tackle this problem, a novel NET-LDA model is proposed. In NET-LDA, semantically similar documents are merged to bring all semantically related words together and the obtained semantic similarity knowledge is incorporated into the model with a new adaptive semantic parameter. The motivation of the study is to reveal the impact of semantic knowledge in the topic model researches. Therefore, in a given corpus, different documents may contain different words but may speak about the same topic. For such documents to be correctly identified, the feature space of the documents must be elaborated with more powerful features. In order to accomplish this goal, the semantic space of documents is constructed with concepts and named entities. Two datasets in the English and Turkish languages and 12 different domains have been evaluated to show the independence of the model from both language and domain. The proposed NET-LDA, compared to the baselines, outperforms in terms of topic coherence, F-measure, and qualitative evaluation.