Publication
Generalization Of Audio Deepfake Detection
Tianxiang Chen, Avrosh Kumar, Parav Nagarsheth, Ganesh Sivaraman, Elie Khoury
Pindrop, Atlanta, GA, USA
{tchen,akumar,pnagarsheth,gsivaraman,ekhoury}@pindrop.com
Abstract
Recent Audio Deepfakes, technically known as logical-access voice spoofing techniques, have become an increased threat on voice interfaces due to the recent breakthroughs in speech synthesis and voice conversion technologies. Effectively detecting these attacks is critical to many speech applications including automatic speaker verification systems. As new types of speech synthesis and voice conversion techniques are emerging rapidly, the generalization ability of spoofing countermeasures is becoming an increasingly critical challenge. This paper focuses on overcoming this issue by using large margin cosine loss function (LMCL) and online frequency masking augmentation to force the neural network to learn more robust feature embeddings. We evaluate the performance of the proposed system on the ASVspoof 2019 logical access (LA) dataset. Additionally, we evaluate it on a noisy version of the ASVspoof 2019 dataset using publicly available noises to simulate more realistic scenarios. Finally, we evaluate the proposed system on a copy of the dataset that is logically replayed through the telephony channel to simulate spoofing attacks in the call center scenario. Our baseline system is based on residual neural network, and has achieved the lowest equal error rate (EER) of 4.04% among all single-system submissions during the ASVspoof 2019 challenge. Furthermore, the additional improvements proposed in this paper reduce the EER to 1.26%.
1. Introduction
The fast growing voice-based interfaces between humans and computers have led to the need for more accurate voice biometrics strategies. The accuracy of speaker verification technology has improved by leaps and bounds in the past decade with the help of deep learning. At the same time, the ability to spoof and impersonate voices using deep learning based speech synthesis systems have also significantly improved.
Such high quality text-to-speech synthesis (TTS) and voice conversion (VC) approaches can successfully deceive both humans and automatic speaker verification systems. This has created the need for systems to detect logical access attacks such as speech synthesis and voice conversion to protect the voicebased authentication systems from such malicious attacks.
ASVspoof1 series started in 2015, and aims to foster the research on countermeasure to detect voice spoofing. In 2015 [1], the challenge focused on detecting commonly used state-ofthe-art logical speech synthesis and voice conversion attacks that were largely based on hidden Markov models (HMM), Gaussian mixture models (GMM) and unit selection. Since then, the quality of the speech synthesis and voice conversion systems has drastically improved with the use of deep learning. WaveNet [2], proposed in 2016, was the first end-to-end speech synthesizer that directly uses the raw audio for training, and showed a mean opinion score (MOS) very close to human speech. Similar quality was shown by other TTS systems such as Deep Voice [3] and Tacotron [4], and also by VC systems [5, 6]. These breakthroughs in TTS and VC technologies made the spoofing attacks detection more challenging.
1http://www.asvspoof.org
In 2019, the ASVspoof [7] logical access (LA) dataset included seventeen different TTS and VC techniques. The organizers took good care of evaluating spoofing detection systems against unknown spoofing techniques by excluding eleven unknown technologies from train and development datasets. Therefore, strong robustness is required for spoofing detection system in this dataset.
The challenge results show that the current biggest problem in a spoofing detection system is its generalization ability. Traditionally, signal processing researchers tried to overcome this problem by engineering different low-level spectro-temporal features. For example, constant-Q cepstral coefficients (CQCC) were proposed in [8], cosine normalized phase and modifiedgroup delay (MGD) were studied in [9, 10]. Although these works have confirmed the effectiveness of various audio processing techniques in detecting synthetic speech, they are not able to narrow down the generalization gap on ASVspoof 2019 dataset with the recent improved TTS and VC technologies. A detailed analysis of 10 different acoustic features, including linear frequency cepstral coefficient (CQCC) and mel frequency cepstral coefficient (MFCC), was made on ASVspoof 2019 dataset in [11]. The results show that none of these acoustic features are able to generalize well on unknown spoofing technologies. Also, using deep learning models to learn discriminate feature embeddings for audio spoofing detection was studied in [12, 13, 14]. A comprehensive study of different traThe Speaker and Language Recognition Workshop (Odyssey 2020) 1-5 November 2020, Tokyo, Japan 132 10.21437/Odyssey.2020-19 ditional acoustic features and learned feature from autoencoder was made in [15].
In this work, we tackle this challenge from a different perspective. Instead of investigating different low level audio features, we try to increase the generalization ability of the model itself. To do so, we use large margin cosine loss function (LMCL) [16] which was initially used for face recognition. The goal of LMCL is to maximize the variance between genuine and spoofed class and, at the same time, minimize intra-class variance. Additionally, inspired by SpecAugment [17], we propose to add FreqAugment, a layer that randomly masks adjacent frequency channels during the DNN training, to further increase the generalization ability of the DNN model. On the ASVspoof 2019 EVAL dataset, we achieve an EER of 1.81% which is significantly better than the baseline. The proposed system is illustrated in Figure 1.
Furthermore, we investigate the effectiveness of audio augmentation techniques. We augment the audio files using publicly available noises, including freely available movies and TV shows, music, other noises and room impulse responses to train and evaluate our system under a noisy scenario. Adding augmented data in the training dataset further reduces the EER from 1.81% to 1.64% on the ASVspoof 2019 EVAL dataset.
Finally, we study the performance of the proposed spoofing detection system in a call center environment. Therefore, we logically-replay the ASVspoof 2019 dataset through VoIP channel to simulate the spoofing attacks. Interestingly, we found that, by adding those audio samples to the training data, the EER is further reduced from 1.64% to 1.26% on the ASVspoof 2019 EVAL dataset.
This paper is organized as follows: Section 2 describes the datasets used to train and evaluate the proposed spoofing detection system. Section 3 details the proposed spoofing detection system. Section 4 presents the experimental results on different evaluation datasets. Section 5 concludes this paper.
2. Datasets
We use three different training protocols and three different evaluation benchmarks as shown in Table 1 and Table 2. The following sections briefly describe the dataset and the data augmentation method used in this work.2.1 ASVspoof 2019 Challenge Dataset
ASVspoof 2019 [7] logical access (LA) dataset is derived from the VCTK base corpus. It includes seventeen text-to-speech (TTS) and voice conversion (VC) techniques. The spoofing techniques are divided into two groups, six as known techniques, eleven as unknown techniques. The entire dataset is partitioned into training, development and evaluation sets. The train and development sets include spoofed utterances generated from two known voice conversion and four speech synthesis techniques. However, only two known techniques are present in the evaluation set. The remaining spoofed utterances were generated from eleven unknown algorithms. The training and evaluation parts of this data are named T1 and E1, respectively.
and In order to evaluate our system under noisy conditions, data augmentation is performed on original ASVspoof 2019 dataset by modifying the the data augmentation technique from Kaldi. Two types of distortions were used to augment the ASVspoof 2019 dataset: reverberation, and background noise. Room impulse responses (RIR) for reverberation were chosen from publicly available RIR datasets2 [18, 19, 20]. We chose four different types of background noises for augmentation – music, television, babble, and freesound3 . One part of the background noise files for augmentation were selected from the open source MUSAN noise corpus [21]. We also constructed a television noise dataset using audio segments from publicly available movies and TV shows from Youtube. Around 40 movies and as many TV show videos were downloaded and segmented into 30 second segments to construct the TV-noises set. In all, we collected around 46 hours of TV-noises in our dataset. For music and TV-noises, the audio was reverberated using a randomly selected RIR from the RIR dataset. Then the speech utterances were reverberated using randomly chosen RIRs and then the reverberated noise was added to the reverberated speech utterance. Babble noise was generated by mixing usgov utterances from the MUSAN corpus. The freesound noises were the general noise files from the MUSAN corpus which consisted of files collected from freesound and soundbible. For babble and freesound noises, we added the background noise files to the clean audio and then reverberated the mixture using a randomly selected RIR. The noises were added with a random SNR between 5dB to 20dB. The training part of this data together with T1 is depicted as T2. Similarly, the evaluation part of this data together with E1 is named E1.
2.3 Logically-Replayed ASVspoof 2019 T
To simulate voice spoofing in a call center environment, Twilio’s Voice service4 is used to playback ASVSpoof 2019 data over voice calls and recorded at the receiver’s end. The resulting dataset has VoIP channel characteristics and has reduced bandwidth from 16kHz to 8kHz sampling rate. Twilio’s default OPUS codec5 was used for encoding and decoding audio. This dataset is used to evaluate benchmark (E3) to understand how well our spoofing detection system generalizes in a call-center environment. Also, the replayed training set is added to the protocol (T3). During training and testing, the dataset was upsampled to 16kHz. The training part of this data together with T2 is named T3. Similarly, the evaluation part of this data together with E2 is named E3.
2http://www.openslr.org/28/
3https://freesound.org/
4https://support.twilio.com/hc/en-us/articles/360010317333- Recording-Incoming-Twilio-Voice-Calls
5https://www.opus-codec.org/