Impact of Silent Regions on Spoofed Speech Detection: A Deep Learning Based Analysis


IRMAK M. A., Akdeniz F., Savas B. K.

7th International Congress on Human-Computer Interaction, Optimization and Robotic Applications, ICHORA 2025, Ankara, Türkiye, 23 - 24 Mayıs 2025, (Tam Metin Bildiri) identifier identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/ichora65333.2025.11017171
  • Basıldığı Şehir: Ankara
  • Basıldığı Ülke: Türkiye
  • Anahtar Kelimeler: LCNN, LFCC, RawNet2, ResNet18, Silent Regions, VAD, Wav2Vec2
  • Kocaeli Üniversitesi Adresli: Evet

Özet

Silent regions in speech signals can significantly impact the training performance of spoofing detection models. The literature suggests that models primarily rely on silent regions to distinguish between genuine and spoofed speech, as genuine speech typically contains more silence segments. This study analyzes the effect of silent regions on the training and generalization capability of developed models by evaluating test results from various deep learning models and presenting the findings comparatively. A balanced subset was created from the ASVSpoof 2019 LA dataset, ensuring an equal number of genuine and spoofed speech samples. The speech files in this subset were processed using a Voice Activity Detection (VAD) algorithm to remove silent regions. For the experimental study, Spoofing Detection (SD) models were trained and evaluated using baseline models (RawNet2 and LFCC-LCNN) developed for the ASVSpoof 2021 LA dataset, along with Wav2Vec2-ResNet18 models. The results demonstrated a significant degradation in model performance when silent regions were removed from the dataset: LFCC-LCNN exhibited a 15.1% decrease in performance, RawNet2 showed a 69.8% decline, Wav2Vec2-ResNet18 experienced a 30.1% drop. These findings confirm that silent regions play a critical role in the performance of spoofing detection models. The study further reveals that removing silence information significantly reduces the models' ability to distinguish between genuine and spoofed speech.