Evaluating raw waveforms with deep learning frameworks for speech emotion recognition

KİLİMCİ, ZEYNEP; Bayraktar, Ülkü; KÜÇÜKMANİSA, AYHAN

doi:10.1007/s11042-025-20930-y

Evaluating raw waveforms with deep learning frameworks for speech emotion recognition

KİLİMCİ Z. H., Bayraktar Ü., KÜÇÜKMANİSA A.

Multimedia Tools and Applications, cilt.84, sa.36, ss.45119-45149, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 84 Sayı: 36
Basım Tarihi: 2025
Doi Numarası: 10.1007/s11042-025-20930-y
Dergi Adı: Multimedia Tools and Applications
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, FRANCIS, ABI/INFORM, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC, zbMATH
Sayfa Sayıları: ss.45119-45149
Anahtar Kelimeler: CNN, CNN-LSTM, Deep learning, LSTM, Raw audio files, Speech emotion recognition
Kocaeli Üniversitesi Adresli: Evet

Özet

Speech emotion recognition is a challenging task in speech processing field. For this reason, feature extraction process has a crucial importance to demonstrate and process the speech signals. In this work, we represent a model, which feeds raw audio files directly into the deep neural networks without any feature extraction stage for the recognition of emotions utilizing six different data sets, namely, The Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Toronto Emotional Speech Database (TESS), Crowd-sourced Emotional Multimodal Actors (CREMA), Surrey Audio-Visual Expressed Emotion (SAVEE), and TESS+RAVDESS. To demonstrate the contribution of proposed model, the performance of traditional feature extraction techniques namely, mel-scale spectogram, mel-frequency cepstral coefficients, are blended with machine learning algorithms, ensemble learning methods, deep and hybrid deep learning techniques. Support vector machine, decision tree, naive Bayes, random forests models are evaluated as machine learning algorithms while majority voting and stacking methods are assessed as ensemble learning techniques. Moreover, convolutional neural networks, long short-term memory networks, and hybrid CNN-LSTM model are evaluated as deep learning techniques and compared with machine learning and ensemble learning methods. To demonstrate the effectiveness of proposed model, the comparison with state-of-the-art studies are carried out. Based on the experiment results, CNN model excels existent approaches with 95.86% of accuracy for TESS+RAVDESS data set using raw audio files, thence determining the new state-of-the-art. The proposed model performs 90.34% of accuracy for EMO-DB with CNN model, 90.42% of accuracy for RAVDESS with CNN model, 99.48% of accuracy for TESS with LSTM model, 69.72% of accuracy for CREMA with CNN model, 85.76% of accuracy for SAVEE with CNN model in speaker-independent audio categorization problems.