Context-Dependent Sequence-to-Sequence Turkish Spelling Correction

BÜYÜK, OSMAN

doi:10.1145/3383200

Context-Dependent Sequence-to-Sequence Turkish Spelling Correction

BÜYÜK O.

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, cilt.19, sa.4, 2020 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 19 Sayı: 4
Basım Tarihi: 2020
Doi Numarası: 10.1145/3383200
Dergi Adı: ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
Kocaeli Üniversitesi Adresli: Evet

Özet

In this article, we make use of sequence-to-sequence (seq2seq) models for spelling correction in the agglutinative Turkish language. In the baseline system, misspelled and target words are split into their letters and the letter sequences are fed into the seq2seq model. We prefer letters as the unit of the model due to the agglutinative nature of Turkish, which results in an impractical dictionary size when words are used as a dictionary unit. In order to improve the baseline performance, we incorporate right and left context of the misspelled words. All context words are represented with their first three consonants in the context-dependent model. We train the seq2seq models using a large text corpus collected automatically from the Internet. The corpus contains approximately 4 million sentences. We randomly introduce substitution, deletion, and insertion spelling errors to the words in the corpus. We test the performance of the proposed context-dependent seq2seq model using synthetic and realistic test sets. The synthetic test set is constructed similar to the training set. The realistic test set contains human-made misspellings from Twitter messages. In the experiments, we observed that the proposed context-dependent model performs significantly better than the baseline system. Its correction accuracy reaches 94% on the synthetic dataset. Additionally, the proposed method provides 2.1% absolute improvement over a state-of-the-art Turkish spelling correction system on the Twitter test set.