WEBINAR

Deepfake + Voice Clone Deep Dive
with Voicebot.ai

Deepfakes are one of the most controversial applications of GenAI technology. While some view it as a harmless tool, others recognize the greater threats. Pindrop and Voicebot.ai surveyed more than 2,000 U.S. consumers to uncover their perception of deepfake and voice clone technology. This collaboration led to an extensive and insightful report, which we’ll break down in this webinar.

Consumer sentiment around deepfakes and voice clones

Which industries have the highest consumer concern around deepfake risks

How pop culture plays a role in AI technology sentiment

Strategies to combat the threat of deepfakes and voice clones

Meet the Speakers

Amit Gupta

VP, Product Management, Research & Engineering

Brett Kinsella

Founder, Editor, CEO & Research Director

Background 

Digital audio watermarking received a great deal of attention in the early 2000s as a means of protecting the intellectual property of digital media in the advent of file-sharing and media streaming. While there were some publications discussing watermarking, the vast majority of research from this period focused on music where copyright protection requirements have been most significant.

In the past year, the topic of audio watermarking has seen a resurgence in interest and this time the focus is on speech. The key driver behind this resurgence has been the vast improvement in text-to-speech and voice conversion technologies that has led to the somewhat negative connotation of ‘deepfakes’. It quickly became apparent that deepfakes can be a vehicle for misinformation, media manipulation, social engineering and fraud, to mention but a few. It has therefore become increasingly important to be able to quickly and accurately decide if a speech signal is real or not – something that by now is far beyond the capabilities of a human listener. And this is where watermarking comes in. It has been proposed to watermark synthetically generated or manipulated audio then use the watermark to validate the authenticity of the speech signal.

In Pindrop’s contribution at Interspeech 2024, one of the flagship scientific conferences of speech science, we present an improved method of watermarking based on the classic spread spectrum approach. In this blog, we will provide a summary of the main findings of this work. The interested reader is referred to the paper for details [REF TO PAPER]. 

Fundamentals of watermarking

A watermark typically consists of a pseudo-random sequence of +1s and -1s. This sequence is embedded in a carrier signal, which in our case is speech. The watermarked signal may then be compressed for storage, transmitted over telephone networks or replayed through a loudspeaker. At some point, a user of the speech signal can probe its authenticity by trying to detect a watermark and, if the watermark is present, it would indicate a deepfake. 

Figure 1. Generic watermarking system diagram.

Although conceptually straightforward, watermarking presents conflicting requirements that must be satisfied for a watermark to be useful in practice. These requirements are illustrated in Fig. 2. The balance must be struck between the robustness to anticipated and deliberate attacks, imperceptibility (or inaudibility) to a listener or to an observer, and the information bearing capacity.

Figure 2. Triangle of conflicting requirements of watermarking.

What makes watermarking speech more challenging than music?

There are a number of factors that make watermarking of speech more challenging than watermarking music. The most important of these factors are listed below:

  • Speech communication channels: a typical speech communication channel includes several stages where the speech signal is degraded through, for example, downsampling, additive noise, reverberation, compression, packet loss and acoustic echoes. All of these may be viewed as non-deliberate attacks, and thus they form the base of minimum requirements for watermark robustness. 
  • Tolerance to degradations: the objective of a speech signal is to convey two pieces of information: (i) who is speaking and (ii) a message between a speaker and a listener. Both of these can be achieved successfully even in large amounts of background noise and reverberation. This may be exploited by bad actors to make a watermark undetectable. 
  • Limited spectral content: speech signals generally have much less spectral content than music. This makes it more difficult to find space for embedding a watermark in a manner that makes it imperceptible. 
  • Short frame stationarity: speech can be considered stationary in 20-30ms frames only which is at least two to three times lower than music signals. As will be discussed later in the blog, this has implications on the length of watermark that can be embedded. 

Improved spread-spectrum watermarking of speech

Spread-spectrum watermarking is one of the most prominent solutions available in the scientific literature. However, it was developed with focus on music and as we described earlier, watermarking of speech requires a different set of requirements. Below we summarize the important improvements and, thus, the novel contributions of our work.

  • Frame-length analysis: in the original spread-spectrum work, frame sizes of 100 ms were used for the embedding of the watermark. We demonstrated empirically that the optimal frame-size for speech is in the range of 20-30 ms; longer frame-size than that makes the watermark audible and its intensity must be reduced, which in turn reduces robustness. We also showed that frame-sizes greater than 100 ms may be used for music without compromising robustness or imperceptibility.
  • LPC-based weighting: one commonly used technique to improve imperceptibility and without compromising robustness is to embed the watermark in high magnitude frequency components of the carrier signal. While this has proven to work for music, we demonstrate in our work that it is detrimental to speech. The reason for this is that the high magnitude frequency components in speech typically correspond to formant frequencies and when these are disturbed the speech quality is adversely impacted. Hence, we derive a weighting function from the linear prediction coding (LPC) spectral envelopes that is closely related to the formants and use it to weight the watermark such that it is reduced within the spectral peaks but emphasized elsewhere. Our results show that the intensity of a watermark may be doubled (thereby increasing robustness) when this method is applied. 
  • Deep spectral shaping: from classical detection theory, the optimal detection of the watermark (or any signal in general) is a matched filter or the correlation between the watermarked signal and the watermark. This holds true if the carrier signal is spectrally white and for simple interference such as added white Gaussian noise. As we have discussed above, this is rarely the case for speech signals. Applying a pre-whitening filter, such as a cepstral filter, can improve detection accuracy by combating the inherent spectral slope in speech, however, it does not deal with more complex degradations. Hence, we considered two different deep neural network (DNN)-based architectures for preprocessing the signal prior to the matched filter operation. The models were trained on anticipated degradations such as downsampling and compression down to 8 kbit/s. We showed that this could significantly improve detection accuracy in these more challenging cases with an equal error rate improvement of up to 98%. 

Summary

Watermarking has been proposed as a possible solution to the detection of synthetically generated or modified speech. While many methods were developed originally for music, they are not directly applicable to speech. We have highlighted the differences between speech and music and we addressed several of those in this work. Specifically, we defined an optimal frame-size range for embedding a watermark, we derived an LPC-based weighting function for improved watermark embedding, and a DNN-based decoding strategy for watermark decoding robust to complex degradations. This work thus shows that we are able to obtain reasonably robust watermarking strategies for speech signals. However, there is still work to be done in order to fully understand the extent to which this can help combat the misuse of deepfakes.

Learn more about Pindrop liveness detection technology here.

Pindrop, a leader in voice security and deepfake detection solutions, has joined forces with Respeecher, a leading provider of voice cloning solutions, to promote the ethical use of Generative Artificial Intelligence and to strengthen the fight against bad actors. This landmark partnership paves the way for the rapid development of deepfake detection technology by working closely with voice cloning systems. Pindrop and Respeecher will share research tools and data to maximize accuracy in detecting bad actors who use real-time voice cloning systems for fraud. 

With generative artificial intelligence (Gen AI) advancements, voice cloning has become a powerful tool that creates believable replicas of human voices. These cloned voices capture the subtleties of human speech, cadences, and imperfections which make these voices sound authentic. Respeecher helps content creators use voice clones in new and exciting ways in movies, television, advertising, gaming, and other creative applications. Voice clones, used ethically, can create more social and consumer engagement, help patients with speech disabilities recover their voice, dub content in a different language or voice a new character. Respeecher has established its leadership in developing best practices for the ethical and safe use of voice cloning with its strict consent, moderation, and data security policies. 

However, the same voice cloning technology, in the hands of bad actors, can be used for nefarious purposes such as financial fraud, impersonating family members, or audiojacking live conversations. With the increasing sophistication of AI, the risk of more realistic voice clones and scaled-up organized attacks poses a major challenge. It is imperative that voice cloning providers design their solutions to avoid causing harm, upholding their high standards for the ethical use of AI and avoiding cloning voices without permission. The next level of threat is the real-time voice conversion (voice swap) that gets rid of synthetic artifacts and unnatural pauses and creates a more natural flow. These synthetic voices are nearly impossible for human ears to detect. 

This is where Pindrop’s solution can play a pivotal role. Voice clones are not easily detectable by human ears. Research from UCL demonstrates that a person can detect an AI-generated voice only 73% of the time. However, Pindrop’s sophisticated deepfake detection technology can reach 99% accuracy. Pindrop achieves this by leveraging “Liveness detection” technology which analyzes a voice 8000 times per second for artifacts that both should and shouldn’t be there, including sounds that reflect a human vocal tract opening and closing or machine-generated frequencies that are impossible for humans to hear.  

Pindrop’s partnership with Respeecher is a major advancement in the fight against deepfakes. This partnership introduces a unique solution in the market that detects voice conversion in real time and provides the industry with a headstart on fraudsters who are planning to fool traditional authentication and biometric defenses. This partnership also lays the groundwork for both companies to promote the ethical use of GenAI in our technology moving forward. Pindrop and Respeecher, together, will help keep deepfake detection technology ahead of the evolving deepfake fraud tactics.

Pindrop is excited about partnering with Respeecher so deepfake detection can stay ahead of the curve to protect against voice clones. This partnership will help Pindrop in our mission to promote trust in every call for businesses like banks, insurance firms, or healthcare providers, allowing them to deliver exceptional experiences that their customers expect and deserve.”

Rahul Sood, Chief Product Officer, Pindrop   

Respeecher proved itself not only as a Hollywood quality AI voice technology with credits in Netflix, Disney+, Paramount, and HBO movies but also as an ethical company that takes this issue seriously. Within the last two years, we’ve put a lot of effort into developing a benchmark for AI companies on how to work ethically and following it, too. It includes our collaboration with a number of initiatives like PAI and Adobe’s CAI. No doubt that a partnership with Pindrop will help us bring security, identity, and trust to every voice interaction.”

Alex Serdiuk, CEO and Co-founder, Respeecher  

About Pindrop

Pindrop solutions are leading the way to the future of voice by establishing the standard for identity, security and trust for every voice interaction. Pindrop solutions protect some of the biggest banks, insurers, and retailers in the world using patented technology that extracts intelligence from every call and voice encountered. Pindrop solutions help detect fraudsters and authenticate genuine customers, reducing fraud and operational costs, while improving customer experience and protecting brand reputation. Pindrop, a privately held company, headquartered in Atlanta, GA, was founded in 2011 by Dr. Vijay Balasubramaniyan, Dr. Paul Judge and Dr. Mustaque Ahamad and is venture-backed by Andreessen Horowitz, Citi Ventures, Felicis Ventures, CapitalG, GV, IVP and Vitruvian Partners. For more information, please visit pindrop.com. 

About Respeecher

Respeecher – an AI voice cloning technology with a portfolio of projects for Hollywood and AAA games. Startup is better known for recreating Darth Vader’s voice for the Star Wars TV series, Edith Piaf’s voice for the Warner Music biopic, and Viktor Vector’s voice in the Cyberpunk 2077: Phantom Liberty. In 2021 won an Emmy Award for a documentary film, Event of Moon Disaster, with President Nixon’s speech. In 2023 Webby Awards for the best use of AI. Unlike existing AI voice solutions, Respeecher focuses on an ethical approach and makes it a central pillar of the business.

In a groundbreaking development within the 2024 US election cycle, a robocall imitating President Joe Biden was circulated. Several news outlets arrived at the right conclusion that this was an AI-generated audio deepfake that targeted multiple individuals across several US states. However, many mentioned how hard it is to identify the TTS engine used (“It’s nearly impossible to pin down which AI program would have created the audio” – NBC News). This is the challenge we focussed on, and our deep fake analysis suggests that the specific TTS system used was ElevenLabs. Additionally, we showcase how deepfake detection systems work by identifying spectral and temporal deepfake artifacts in this audio.  Read further to find out how Pindrop’s real-time deepfake detection detects liveness using a proprietary continuous scoring approach and provides explainability.   

Pindrop’s deepfake engine analyzed the 39-second audio clip through a four-stage process: audio filtering & cleansing, feature extraction, breaking the audio into 155 segments of 250 milliseconds each, and continuous scoring all the 155 segments of the audio. 

After automatically filtering out the nonspeech frames (e.g., silence, noise, music), we downsampled this audio to an 8 kHz sampling rate, mitigating the influence of wideband artifacts. This replication of end-user listening conditions is crucial for simulating typical phone channel conditions needed for unbiased and authentic analysis. Our system extracts low-level spectro-temporal features, runs through our proprietary deep neural network, and finally outputs an embedding as a “fakeprint.” A fakeprint is a unit-vector low-rank mathematical representation preserving the artifacts that distinguish between machine-generated vs. generic human speech. These fakeprints help make our liveness system explainable. For example, if a deepfake was created using a text-to-speech engine, they allowed us to identify the engine.

Our deepfake detection engine continuously generates scores for each of the 155 segments using our proprietary models that are tested on large and diverse datasets, including data from 122 text-to-speech (TTS) engines and other techniques for generating synthetic speech. 

Our analysis of this deepfake audio clip revealed interesting insights explained below:

Liveness Score 

Using our proprietary deepfake detection engine, we assigned ‘liveness’ scores to each segment, ranging from 0 (synthetic) to 1.0 (authentic). The liveness scores of this Biden robocall consistently indicated an artificial voice. The score fell below the liveness threshold of 0.3 after the first 2 seconds and stayed there for the rest of the call, clearly identifying it as a deepfake.

Liveness analysis of President Biden robocall audio

TTS system revealed  

Explainability is extremely important in deepfake detection systems. Using our fakeprints, we analyzed President Biden’s audio against the 122 TTS systems typically used for deepfakes. Pindrop’s deepfake detection engine found, with a 99% likelihood, that this deepfake is created using ElevenLabs or a TTS system using similar components.  We ensured that this result doesn’t have an overfitting or a bias problem by following research best practices. Once we narrowed down the TTS system used here to ElevenLabs, we then validated it using the ElevenLabs SpeechAI Classifier, and we obtained the result that it is likely that this audio file was generated with ElevenLabs (84% likely probability). Even though the attackers used ElevenLabs this time, it is likely to be a different Generative AI system in future attacks, and hence it is imperative that there are enough safeguards available in these tools to prevent nefarious use. It is great that some Generative AI systems like ElevenLabs are already down this path by offering “deepfake classifiers” as part of their offerings. However, we suggest that they also ensure the consent for creating a voice clone is actually coming from a real human.  

Prior to determining the TTS system used, we first determined that this robocall was created using a text-to-speech engine, implying that this was not simply an instance of a person changing their voice to sound like President Biden using a speech-to-speech system.  Our analysis also confirms that the voice clone of the President was generated using text as input. 

Deepfake artifacts

As we analyzed each segment of the audio clip, we plotted the intensity of the deepfake in each segment as the call progressed. This plot, depicted below, shows that some audio parts have more deepfake artifacts than others. This is the case for phrases like “New Hampshire presidential preference primary”, or “Your vote makes a difference in November.” This is because these phrases are rich with fricatives, like in the words “preference” or “difference,” which tend to be strong spectral identifiers for deepfakes. Additionally, we saw the intensity rise when there were phrases that President Biden is unlikely to have ever said before. For example, there were a lot of deepfake artifacts in the phrase: “If you would like to be removed from future calls, please press two now”. Conversely, phrases that President Biden has used before showed low intensity. For example, the phrase “What a bunch of malarkey”. This is something we understand President Biden uses a lot.

Protecting trust in public information & media 

In summary, the 2024 Joe Biden deepfake robocall incident emphasizes the urgency of distinguishing real from AI-generated voices. Pindrop’s advanced methods identified this deepfake and its use of a text-to-speech engine, highlighting scalability.

Companies addressing deepfake misinformation should consider criteria like continuous content assessment, adaptability to acoustics, analytical explainability, linguistic coverage, and real-time performance when choosing detection solutions.

Acknowledgements:

This work was carried out by the exceptional Pindrop research team.


Partner with Pindrop to defend against AI-driven misinformation.
Contact us here for a custom demo. 

Voice security is
not a luxury—it’s
a necessity

Take the first step toward a safer, more secure future
for your business.