A novel framework for mispronunciation detection of Arabic phonemes using audio-oriented transformer models


Çalık Ş. S., KÜÇÜKMANİSA A., KİLİMCİ Z. H.

Applied Acoustics, cilt.215, 2023 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 215
  • Basım Tarihi: 2023
  • Doi Numarası: 10.1016/j.apacoust.2023.109711
  • Dergi Adı: Applied Acoustics
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, PASCAL, Communication & Mass Media Index, Compendex, ICONDA Bibliographic, INSPEC, DIALNET
  • Anahtar Kelimeler: Arabic pronunciation detection, Audio transformers, Computer aided language learning, HUBERT, UniSpeech, Wav2Vec
  • Kocaeli Üniversitesi Adresli: Evet

Özet

Computer-Aided Language Learning (CALL) is experiencing notable growth in contemporary times due to the indispensability of acquiring proficiency in various languages for effective communication across diverse linguistic contexts. Within the domain of CALL, the inclusion of mispronunciation detection serves as an intrinsic component aimed at automatically pinpointing errors made by non-native speakers. In this research endeavor, a pioneering framework is put forth to address the task of detecting mispronunciations in Arabic phonemes through the utilization of audio-centric transformer models. To our current understanding, this represents the inaugural endeavor to comprehensively ascertain the mispronunciations of Arabic phonemes by employing audio-focused transformer models, including Squeezed and Efficient Wav2Vec (SEW), Hidden-Unit BERT (HUBERT), WAV2VEC, UNI-SPEECH. In order to demonstrate the effectiveness of the proposed model, a comprehensive evaluation is conducted on a set of 29 Arabic phonemes, including 8 hafiz sounds, uttered by 11 distinct individuals. For experimental purposes, two distinct versions of the dataset are utilized, incorporating additional voice samples obtained from the YouTube platform. The extensive experimental findings substantiate that employing the UNI-SPEECH transformer model yields notable classification outcomes in the context of Arabic phoneme mispronunciation detection. The systematic comparison of audio-centric transformer models adds valuable insights to the existing scholarly discourse and offers clarity regarding their effectiveness and appropriateness in the domain of Arabic phoneme mispronunciation detection. Additionally, the comprehensive comparative analysis of these transformer models unveils their respective advantages, constraints, and avenues for enhancement, thereby guiding future investigations in this realm.