Effect of number and position of frames in speaker age estimation

Osman, Mohammed; Büyük, OSMAN; TANGEL, ALİ

doi:10.14744/sigma.2023.00036

Effect of number and position of frames in speaker age estimation

Osman M. M., Büyük O., TANGEL A.

Sigma Journal of Engineering and Natural Sciences, cilt.41, sa.2, ss.243-255, 2023 (ESCI, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 41 Sayı: 2
Basım Tarihi: 2023
Doi Numarası: 10.14744/sigma.2023.00036
Dergi Adı: Sigma Journal of Engineering and Natural Sciences
Derginin Tarandığı İndeksler: Emerging Sources Citation Index (ESCI), Scopus, Academic Search Premier, Directory of Open Access Journals
Sayfa Sayıları: ss.243-255
Anahtar Kelimeler: Filter Banks, Frame Position, Mean Absolute Error, Regression, Speaker Age, Utterance Length
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Kocaeli Üniversitesi Adresli: Evet

Özet

With the invention of powerful processing devices as well as lucrative capabilities in the first two decades of the 21st century, machine learning algorithms will soon be able to predict speaker age with higher accuracy or much lower error rate. It is an age-old quest for the human society to profile individuals remotely which basically includes age. Speaker age estimation has been treated in quite few perspectives. However, most of these approaches fail to show the effect of utterance length, aka number of frames on speaker age estimation. We present a detailed analysis on the effect of number of frames and position of frames for speaker age estimation using four magnitude-based and one phase-based spectral feature sets. The optimal speech duration for this objective is investigated. In addition, the mismatch between the training and test utterance duration is explored. The magnitude-based features are mainly derived from filter bank analysis. After the filter-bank analysis, an i-vector is generated for each utterance. Least squares support vector regression (LSSVR) is employed for speaker age estimation. In the experiments, the aGender database which consists of utterances from four age groups of German speakers is used. Increasing number of frames in the training and test increases the age estimation accuracy. This can be associated with the notion that more data helps the estimation process. Concerning position, the frames located at the centre of utterances tend to offer better results for both genders. The backend algorithms offer the best performance when the utterance length of training and test sets are equal for longer speech segments, otherwise training with medium length utterances and testing with longer ones offers better estimation performance especially for the female dataset.