Voice conversion as a fraud tool
Deepfake Audio includes two types of speech processing techniques: speech synthesis and voice conversion. Speech synthesis typically aims to convert a text to speech and is well deployed in real world applications like smart devices and voice assistants (e.g. Siri, Amazon Echo, Google Home), and more recently to help individuals who lost their voices.
Voice conversion consists of converting a source speaker’s voice (e.g. attacker’s voice) into a target speaker’s voice (e.g. victim’s voice). This technique is less known to the public because of the limited real-word applications. It is however used in some social media applications to convert someone’s voice to a celebrity voice. At the time of writing this blog, only little details about the $35 Million bank heist are public knowledge, but Pindrop suspects that voice conversion was used in this attack because a live call and natural conversation took place between the attacker and the victim’s bank manager. This is more difficult using text-to-speech techniques.
Call center deepfake attacks are most challenging
In Anthony Bourdain’s video, the audio is categorized as wideband where the sampling rate is higher than 16 kHz. Wideband audio signals not only enable high/studio-quality sound, but also inherently allow to carry audio artifacts in the high frequencies that are very relevant to detect a deepfake by a software such as Pindrop’s deep learning based deepfake detector, or even sometimes by the human ear, particularly when it comes to fricatives such as /s/, /f/, or /v/ that are typically not well produced by speech synthesis systems (e.g. Wavenet of DeepMind, Tacatron of Google, or FastSpeech of Microsoft).
Additionally, the detection of wideband deepfake audio has been the focus of the research community where academic and industrial competitions such as the ASVspoof were conducted since 2015, and where Pindrop technologies always showcase high performance and good generalization ability.
While the task of detecting deepfake audio on wideband audio is still hard, particularly when it comes to newer and more sophisticated attacks, the problem is significantly more challenging for narrowband call center audio signals. With a Shannon frequency of only 4 kHz (half of the sampling rate), a good portion of the relevant information narrowband audio is missing, making the discriminative ability of the standard models much more reduced.
Pindrop Deepfake detection holds promises
However, Pindrop Research has been focusing on this particular problem for a while and its systems have evolved significantly over the last 5 years, where a large spectrum of approaches were investigated ranging from the use of Gaussian Mixture Models based approaches, to more recently the use of Deep Learning-based approaches. This enabled Pindrop to leverage DSP-level skills, big data, and model architectures to accurately solve this problem. Contact us right now to learn more.