Postprocessing Synthetic Speech With a Complex Cepstrum Vocoder for Spoofing Phase-Based Synthetic Speech Detectors

Demiroglu C., BÜYÜK O. , Khodabakhsh A., Maia R.

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, cilt.11, ss.671-683, 2017 (SCI İndekslerine Giren Dergi) identifier identifier

  • Cilt numarası: 11 Konu: 4
  • Basım Tarihi: 2017
  • Doi Numarası: 10.1109/jstsp.2017.2673807
  • Sayfa Sayısı: ss.671-683


State-of-the-art speaker verification systems are vulnerable to spoofing attacks. To address the issue, high-performance synthetic speech detectors (SSDs) for existing spoofing methods have been proposed. Phase-based SSDs that exploit the fact that most of the parametric speech coders use minimum-phase filters are particularly successful when synthetic speech is generated with a parametric vocoder. Here, we propose a new attack strategy to spoof phase-based SSDs with the objective of increasing the security of voice verification systems by enabling the development of more generalized SSDs. As opposed to other parametric vocoders, the complex cepstrum approach uses mixed-phase filters, which makes it an ideal candidate for spoofing the phase-based SSDs. We propose using a complex cepstrum vocoder as a postprocessor to existing techniques to spoof the speaker verification system as well as the phase-based SSDs. Once synthetic speech is generated with a speech synthesis or a voice conversion technique, for each synthetic speech frame, a natural frame is selected from a training database using a spectral distance measure. Then, complex cepstrum parameters of the natural frame are used for resynthesizing the synthetic frame. In the proposed method, complex cepstrum-based resynthesis is used as a postprocessor. Hence, it can be used in tandem with any synthetic speech generator. Experimental results showed that the approach is successful at spoofing four phase-based SSDs across nine parametric attack algorithms. Moreover, performance at spoofing the speaker verification system did not substantially degrade compared to the case when no postprocessor is employed.