International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Türkiye, 7 - 09 Kasım 2024, ss.1-6, (Tam Metin Bildiri)
Degradation in speech quality during Voice over Internet Protocol (VoIP) calls refers to any reduction in the clarity or intelligibility of audio. Such degradation can present itself in various ways, including but not limited to distortion, latency, jitter, packet loss, echo, and background noise. Identifying these issues is of paramount importance for several reasons. Firstly, it enhances the user experience and satisfaction, ensuring that communication is clear and effective. Secondly, it fosters better communication and collaboration, which can directly impact productivity and efficiency. Additionally, it enables service providers to proactively manage and rectify issues, thus maintaining the overall quality of service. Moreover, it assists network administrators in diagnosing the underlying causes of quality impairments and implementing necessary measures to improve the reliability and performance of the system. This study focuses on detecting degradations in VoIP speech by employing audio-based transformer models, specifically AST, Wav2Vec, CNN, Whisper, and DeitCNN. To evaluate the effectiveness of these models, a comprehensive dataset is assembled from YouTube videos, categorized into four distinct types of degradation: background noise, choppy speech, competing speaker, and echo. The experimental findings demonstrate that these audio-based transformer models achieve outstanding results, indicating their significant potential in accurately identifying and categorizing speech degradation in VoIP communications.