End-to-End Spoken Language Recognition Using Self-Attention Speech Models


Kilinc H. H., Kilimci H., KİLİMCİ Z. H.

17th International Conference on Electronics, Computers and Artificial Intelligence, ECAI 2025, Targoviste, Romanya, 26 - 27 Haziran 2025, (Tam Metin Bildiri) identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/ecai65401.2025.11095447
  • Basıldığı Şehir: Targoviste
  • Basıldığı Ülke: Romanya
  • Anahtar Kelimeler: HuBERT, language recognition, transformers, Wav2Vec2, WavLM
  • Kocaeli Üniversitesi Adresli: Evet

Özet

Spoken language recognition (SLR) is a pivotal challenge in speech processing, serving a variety of practical applications such as cross-lingual communication platforms, speech-based authentication systems, and real-time transcription that adapts to multiple languages. This study evaluates the effectiveness of self-attention-driven transformer models in automatically identifying spoken languages, with a particular emphasis on five distinct languages: German, Turkish, French, Spanish, and English. To build a diverse and representative dataset, speech samples are systematically gathered from YouTube using API integration. This approach ensures a broad range of speakers, accents, and environmental conditions, enriching the model training process. The collected data undergo essential preprocessing steps, including noise reduction and normalization, to improve audio quality and standardize input. These refined datasets are used to train and assess the performance of several advanced transformer-based models, including HuBERT, Wav2Vec2, and WavLM, along with their specific variants. The experimental results reveal that HuBERT leads with an accuracy of 99.30%, achieving nearperfect results. These outcomes emphasize the efficacy of transformer-based architectures in distinguishing between linguistically diverse languages. Furthermore, the findings point to the substantial potential of these models in real-world multilingual applications, where precise and effective spoken language recognition is essential for seamless interaction with automated systems.