SkyWords: An automatic keyword extraction system based on the skyline operator and semantic similarity


GÖZ F., MUTLU A.

Engineering Applications of Artificial Intelligence, cilt.123, 2023 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 123
  • Basım Tarihi: 2023
  • Doi Numarası: 10.1016/j.engappai.2023.106338
  • Dergi Adı: Engineering Applications of Artificial Intelligence
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Aerospace Database, Applied Science & Technology Source, Communication Abstracts, Computer & Applied Sciences, INSPEC, Metadex, Civil Engineering Abstracts
  • Anahtar Kelimeler: Candidate keyword selection, Keyword extraction, Majority voting, Semantic similarity, The skyline operator
  • Kocaeli Üniversitesi Adresli: Evet

Özet

This study presents a hybrid keyword extraction method called SkyWords. It implements a novel supervised step based on the skyline operator and the majority voting principle for high-quality candidate keyword selection and an unsupervised step based on contextual semantics for the candidate keyword ranking. To achieve this, firstly, we build a feature vector database using the features of known keywords and then apply the skyline operator to retrieve the dominating feature vectors. To select the candidate keywords of a document, we compare each word of the document against the dominating feature vectors and assume the words that are as good as the majority of the dominating feature vectors as candidate keywords. To obtain the final set of keywords, we rank the candidate keywords based on their semantic similarity to the document based on their vector representation using the MPNet sentence transformer. We conducted experiments on six benchmark scholarly datasets to evaluate the performance of SkyWords and compared the results against eleven baseline keyword extraction systems. The experimental results show that the proposed novel keyword selection algorithm reduced the number of candidate keywords by several folds. Moreover, SkyWords achieved statistically significant improvements over the baseline methods in precision, recall, and F1 score. Compared to the baseline regarding ranking-based metrics, SkyWords achieved the highest mean average precision score for all datasets and the highest mean reciprocal rank score for all datasets but one. Furthermore, SkyWords extracted more relevant keywords than the baseline methods.