An Annotated Corpus for Turkish Sentiment Analysis at Sentence Level


Omurca S. , Ekinci E., Turkmen H.

2017 International Artificial Intelligence and Data Processing Symposium (IDAP), Malatya, Türkiye, 16 - 17 Eylül 2017 identifier identifier

Özet

With the rapid growth of unstructured data accessible via web, managing these data and finding undiscovered information in huge dataset become a necessary task. Consequently text mining, which can be defined as gleaning important information from natural language text, has emerged. In this study, in order to facilitate information management for aspect based sentiment analysis studies, a Turkish sentiment corpus, which is comprised of user reviews and is annotated semi-automatically, is constructed. In the constructed corpus, the root form of the words, the usage (aspect/multiaspect/seedsentiment/absent) of these words, Part of Speech (POS) tags and their polarities are defined. Turkish hotel review dataset which contains 1000 reviews and 5364 sentences for this study was crawled from a web source. The system takes reviews, aspect and seedsentiment lists and returns JSON data structures of the annotated corpus. In this paper, both we provide a ready to use dataset for developing aspect based sentiment analysis applications and we make this dataset easy to use for Java applications by creating JSON data.