Assessing Audio-Based Transformer Models for Speech Emotion Recognition

Bayraktar U., Kilimci H., Kilinc H., KİLİMCİ Z. H.

7th International Symposium on Innovative Approaches in Smart Technologies, ISAS 2023, İstanbul, Türkiye, 23 - 25 Kasım 2023, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/isas60782.2023.10391313
Basıldığı Şehir: İstanbul
Basıldığı Ülke: Türkiye
Anahtar Kelimeler: Audio Spectogram, HUBERT, MCTCT, Speech emotion recignition, transfomers, Wav2Vec
Kocaeli Üniversitesi Adresli: Evet

Özet

Speech Emotion Recognition (SER) is a field of research and technology that focuses on the automatic detection and classification of emotional states conveyed through speech. SER has a wide range of applications, including customer service, healthcare, entertainment, market research, and so on. Also, it has the potential to enhance human-computer interaction and improve the understanding of human emotional behavior. So far the studies mostly focus on traditional machine learning algorithms and deep learning architectures for the purpose of detection of the speech emotion while this work takes one step forward using the cutting edge technology called as transformers. To show the effectiveness of the transformer models Hidden-Unit BERT, Squeezed and Efficient Wav2Vec, Multi-lingual Concatenated transformer, and Audio Spectogram transformer models are employed to recognize the speech emotion on publicy available and used datasets namely, EMO-DB, RAVDESS, and TESS. Experiment result demonstrate that Audio Spectogram transformer model exhibits remarkable classification results specifically, 75.42% of accuracy for EMO-DB, 88.17% of accuracy for RAVDESS, and 98.17% of accuracy for TESS datasets.