7th International Congress on Human-Computer Interaction, Optimization and Robotic Applications, ICHORA 2025, Ankara, Türkiye, 23 - 24 Mayıs 2025, (Tam Metin Bildiri)
Automatic language identification (LID) from speech constitutes a fundamental task in speech processing, enabling a wide range of applications, including multilingual communication systems, speech-based user authentication, and real-time language-aware transcription services. The effectiveness of transformer-based speech models for LID is investigated in this study, with a particular focus on five widely spoken languages: Arabic, Chinese, English, French, and Hindi. To construct a diverse and representative dataset, a web crawler is developed to systematically collect speech recordings in these languages from YouTube. The collected data are subjected to preprocessing steps, including noise reduction, segmentation, and feature extraction, to enhance model performance. Subsequently, the dataset is utilized to train and evaluate five state-of-the-art transformer-based speech models: HuBERT, Wav2Vec2, SEW, UniSpeech, and Audio Spectrogram Transformer (AST). The experimental evaluation demonstrates that these models achieve exceptionally high classification accuracy, with HuBERT attaining 99.9%, and Wav2Vec2, SEW, UniSpeech, and AST each achieving 99.8%. These results highlight the remarkable capability of transformer-based architectures in speech-driven language identification. Furthermore, comparative analysis indicates that contextualized speech representations learned by transformer models contribute significantly to their robustness in distinguishing between linguistically diverse languages. The findings underscore the potential of such models for real-world deployment in multilingual environments, where accurate and efficient language identification is critical for seamless human-computer interaction.