On October 22nd, the nonpartisan group RepresentUs released a public service announcement (PSA) on YouTube, addressing the potential misuse of AI deepfakes in the 2024 election. The PSA warns that malicious actors could use deepfake technology to spread election misinformation on when, where, and how to vote, posing a significant threat to the democratic process.
The PSA features Chris Rock, Amy Schumer, Laura Dern, Orlando Bloom, Jonathan Scott, Michael Douglas, and Rosario Dawson. With the exception of Rosario Dawson and Jonathan Scott, the appearances of these public figures were deepfakes, created to emphasize the deceptive power of AI technology. The PSA encourages Americans to stay vigilant, recognize signs of manipulated media, and ensure they are accurately informed ahead of Election Day.
Given the mix of genuine and synthetic speech, this PSA presented an ideal opportunity to demonstrate the capabilities of Pindrop® Pulse™ Inspect in distinguishing between human and synthetic voices. Our technology can play a crucial role in helping protect election integrity by supporting audiences and organizations in distinguishing between authentic and manipulated media.
Analyzing the Public Service Announcement with Pindrop® Pulse™ Inspect
To start, we ran the PSA through Pindrop® Pulse™ Inspect software to analyze potential deepfake artifacts. Pulse Inspect works by breaking down audio content into segments, analyzing every four seconds of speech, and scoring each segment based on its authenticity:
- Score > 60: AI-generated or other synthetic speech detected
- Score < 40: No AI-generated or other synthetic speech detected
- Scores between 40 and 60: Inconclusive segments, often due to limited spoken content or background noise interference
This initial pass provided a strong overview of synthetic versus human speech throughout the PSA. The four-second segments allowed us to identify precise points in the video where synthetic or human speech was present, making it clear how well our technology highlights the boundaries between authentic and manipulated media.
Breaking Down the Video for Multi-Speaker Analysis
Since many segments featured multiple speakers with mixed human and synthetic voices, we diarized the video to log the start and end times for each speaker, the table below shows the segmented timestamps.
Start Time | End Time | Speaker Label |
0:00 | 0:03.50 | Michael Douglas |
0:03.51 | 0:05.29 | Jonathan Scott |
0:05.80 | 0:07.25 | Rosario Dawson |
0:07.29 | 0:08.96 | Chris Rock |
0:08.97 | 0:10.19 | Michael Douglas |
0:10.25 | 0:14.04 | Jonathan Scott |
0:14.14 | 0:15.41 | Laura Dern |
0:15.58 | 0:16.48 | Amy Schumer |
0:16.52 | 0:19.25 | Jonathan Scott |
0:19.35 | 0:20.90 | Amy Schumer |
0:21.15 | 0:26.51 | Chris Rock |
0:27 | 0:30.93 | Rosario Dawson |
0:31.21 | 0:35.70 | Orlando Bloom |
0:35.79 | 0:38.80 | Laura Dern |
0:39 | 0:44.55 | Rosario Dawson |
0:44.66 | 0:46.06 | Laura Dern |
0:46.13 | 0:48.30 | Jonathan Scott |
0:48.42 | 0:50.49 | Amy Schumer |
0:50.54 | 0:54.06 | Rosario Dawson |
0:54.12 | 0:56.99 | Orlando Bloom |
0:57.06 | 1:00.15 | Jonathan Scott |
1:00.22 | 1:01.79 | Amy Schumer |
1:01.83 | 1:03.40 | Laura Dern |
1:03.50 | 1:05.74 | Rosario Dawson |
1:05.85 | 1:09.69 | Michael Douglas |
1:15.56 | 1:19.28 | Amy Schumer (Actor) |
1:21.52 | 1:23.13 | Laura Dern (Actor) |
1:24.16 | 1:26.29 | Jonathan Scott |
1:26.49 | 1:31.70 | Rosario Dawson |
This speaker diarization enabled us to isolate and analyze each segment individually. For example, here are six clips of Rosario Dawson, all accurately identified as not synthetic—even the first clip, which contains only one second of audio with just 0.68 seconds of speech! By segmenting the PSA at this level, we achieved higher precision in detecting synthetic content while reliably confirming human voices.
Tracing the Source of Deepfake Speech
Lastly, an additional benefit of diarizing and segmenting speakers was that we could stitch together all speech from a single speaker. This provided longer, continuous audio samples for our models to analyze, increasing our technology’s ability to detect markers of synthetic content. With this approach, our deepfake detection models had significantly more speech data to work with.
With the speaker-separated audio files prepared, we leveraged our Source Tracing feature to identify the probable origin of the deepfakes. Source Tracing is our advanced tool designed to pinpoint the AI engine used to generate synthetic audio, helping us understand the technology behind a given deepfake. After analysis, we identified ElevenLabs as the most likely generator for these deepfakes, with PlayHT as a close alternative. This level of insight is essential for media and cybersecurity teams working to trace and counteract the spread of malicious AI-generated content.
Election Integrity: Key Takeaways
This PSA not only serves as a reminder of how convincing deepfakes have become, but also highlights the role of tools like Pindrop® Pulse™ Inspect in identifying and mitigating the spread of manipulated media to prevent election manipulation. Our technology is already in use by organizations committed to protecting public trust and preventing the spread of misinformation. As deepfake technology advances, so must our efforts to safeguard truth and transparency in the information we consume.
News consumption is changing, especially during election cycles
Scrolling on social media for hours on end has yet another unforeseen consequence: it’s altered the way that the American public consumes the news—and, by extension, statements from political leaders. According to the Pew Research Center, “half of US adults [are getting] news at least sometime from social media.” When we consume our news on social media, we may assume that the information we’re seeing is honest and credible. Yet, as a recent parody that uses AI-generated voice cloning of VP Kamala Harris implies, we can’t always believe what we’re hearing.
As AI evolves, one troubling fact is emerging: global leaders and average citizens alike can fall victim to voice cloning without their consent. Though the industry is looking towards safety measures like watermarking and consent systems, those tactics may not be enough.
How it started
At 7:11 PM ET on July 26, 2024, Elon Musk reposted a video on X from account @MrReaganUSA. In a follow-up video, @MrReaganUSA acknowledged that, “the controversy is partially fueled by my use of AI to generate Kamala’s voice.” Our research was able to determine more precisely that the audio is a partial deepfake, with AI-generated speech intended to replicate VP Harris’s vocal likeness alongside audio clips from previous remarks by the VP.
As of July 31, 2024, Musk’s post was still live and had over 133M Views, 245K reposts, and 936K likes. Another parody video of VP Harris was posted to X by @MrReaganUSA on July 31, 2024.
Our analysis of the deepfake
When our research team discovered Musk’s post, they immediately ran an analysis using our award-winning PindropⓇ Pulse technology to determine which parts of the audio were manipulated by AI. Pulse is a tool designed for continuous assessment, producing a segment-by-segment breakdown and analyzing for synthetic audio every 4 seconds. This is especially useful in identifying AI manipulation in specific parts of an audio file—helping to spot partial deepfakes.
Synthetic vs. non-synthetic audio
After denoising the audio to reduce the background music, Pulse detected fifteen 4-second segments as “synthetic” and six 4-second segments that were not synthetic, which leads us to believe that this is likely a partial deepfake.
With Pulse’s liveness detection capability, our research team found three clips of VP Harris’s previous remarks in the parody video. Each clip, however, was removed from its original context. Listen below:
This audio was taken from a real speech, but altered to repeat in a loop.
VP Harris misspoke in this speech. That audio was used here.
This audio is also from a real speech.
Tracing the source and identifying inadequate AI safety measures
Our research team went one step beyond this breakdown: they identified the voice cloning AI system that was used to create the synthetic voice. Our source attribution system identified a popular open-source text-to-speech (TTS) system, TorToise, as the source. TorToise exists on GitHub, HuggingFace, and in frameworks like Coqui. It’s possible that a commercial vendor could be reusing TorToise in their system. It’s also possible that a user employed the open source version.
This incident demonstrates the challenges with watermarking to identify deepfakes and their sources, an issue Pindrop has raised previously. While several of the top commercial vendors are adopting watermarking, numerous open-source AI systems have not adopted watermarking. Several of these systems have been developed outside the US, making enforcement difficult.
Pindrop’s technology doesn’t rely on watermarking. Instead, Pulse detects the “signature” of the AI generating system. Every voice cloning system leaves a unique trace, including the type of input (“text” vs “voice”), the “acoustic model” used, and the “vocoder” used. Pulse analyzes and maps these unique traces against 350+ AI systems to determine the provenance of the audio. Pindrop used this same approach in previous incidents, including the Biden Robocall deepfake in January, which Pulse determined was created by ElevenLabs, a popular commercial TTS system.
Through additional research, we identified three platforms that offer AI-generated speech that mimics VP Harris’s voice. Those include TryParrotAI, 101soundboard, and jammable. We also found that 101soundboard seems to be using the TorToise system.
Some commercial vendors are considering adopting measures, like consent systems, to mitigate the misuse of voice cloning; however, with open-source AI systems, these measures are difficult to enforce. While implementing consent systems is a step in the right direction, there isn’t a consistent standard or third-party validation of these measures.
Why information integrity must be top-of-mind
While this audio was labeled as a “parody” in the original post, now that it’s available online, it can be reshared or reposted without that context. In other situations online, like accepting cookies on a website or verifying your identity with your online bank, governments have established laws to protect consumers. However, AI and deepfakes are a new and rising threat–with little to no guardrails to prevent misuse.
That’s why maintaining the integrity and authenticity of information that’s shared online—especially as we near the 2024 election—should be a top priority. Not doing so can be damaging to public trust and the belief in our most important and foundational systems.
Putting up protections to help preserve truth
Good AI is sorely needed to mitigate the societal effects of bad AI. As a leader in the voice security space for over a decade, Pindrop is leading the fight against deepfakes and misinformation, with the goal of helping to restore and promote trust in the institutions that are the bedrock of our daily lives. Our Pulse solution offers a way to independently analyze audio files and empower organizations with information to determine if what they’re hearing is real. Read more here about our deepfake detection technology and how we’re leading the way in fighting bad AI.
Disclaimer
This is an actively developing investigation. The information presented in this article is based on our current findings and analysis as of August 1, 2024. Our team is actively staying alert, investigating and uncovering new trends in deepfake and voice security-related incidents. Follow us on LinkedIn or X for any new insights.
Paul Carpenter, a New Orleans Street magician, wanted to be famous for fork bending. Instead, he made national headlines on CNN when he got wrapped up in a political scandal involving a fake President Joe Biden robocall sent to more than 20,000 New Hampshire residents urging Democrats not to vote in last month’s primary.
The video and ease with the magician who made it raise concern about the threat of deepfakes and the volume they could be created by anyone in the future. Here are the highlights from the interview and what you should know to protect your company from deepfakes.
Deepfakes can now be made quickly and easily
Carpenter didn’t know how the deepfake he was making would be used. “I’m a magician and a hypnotist. I’m not in the political realm, so I just got thrown into this thing,” says Carpenter. He says he was playing around with AI apps, getting paid a few hundred bucks here and there to make fake recordings. According to text messages shared with CNN, one of those paying was a political operative named Steve Kramer, employed by the Democratic presidential candidate Dean Phillips. Kramer admitted to CNN that he was behind the robocall, and the Phillips campaign cut ties with him, saying they had nothing to do with it.
But this deepfake raised immediate concern over the power of AI from the White House. The call was fake and not recorded by the president or intended for election watchers. For Carpenter, it took 5-10 minutes tops to create it. “I was like, no problem. Send me a script. I will send you a recording, and send me some money,” says Carpenter.
The fake Joe Biden deepfake was distributed within 24-48 hours
The call was also distributed just 24-48 hours before the New Hampshire primary, with little time to stop the intent of the call. Therefore, it could have swayed some people from voting, and it is worrisome to think about when an election is upcoming. When everyone is connected to their devices, it’s hard to intercept fraud in real time. The ability to inject these generative AI into that ecosystem leads some to projects we could be in for something dramatic.
How Pindrop® Pulse works to detect deepfakes
Deepfake expert and Co-Founder and CEO of Pindrop Vijay Balasubramaniyan says there’s no shortage of often free apps that can do it. He’s held various engineering and research roles within Google, Siemens, IBM Research, and Intel before co-creating Pindrop.
“It only requires three seconds of your audio, and you can clone someone’s voice,” says Vijay Balasubramaniyan. At Pindrop, we are testing how quickly you can create an AI voice while leveraging AI to stop it in real time. It’s one of the only companies in today’s market with a product, Pindrop® Pulse, to detect deepfakes, including those zero-day attacks and unseen models, at over 90% accuracy and 99% for previously seen deepfake models. The video featured on CNN of fake Joe Biden took only five minutes of President Biden speaking at any particular event, and that’s what it took to create a clone of his voice.
Pindrop® Pulse is different from the competition
Pulse sets itself apart through real-time liveness detection, continuous assessment, resilience, zero-day attack coverage, and explainability. The explainability part is key as it provides analysis along with results so you can learn from the data in the future to protect your business further. It also provides a liveness score and a reason code with every assessment without dependency on enrolling the speaker’s voice.
Every call is atomically analyzed using fakeprintingTM technology. Last but not least, it’s all fully integrated within the cloud-native capability, eliminating the need for new APIs or system changes.
What your company can do to protect against deepfakes
Pindrop could detect the robocall of fake President Biden’s voice and that it was faked and track down the exact AI company that made it. In today’s environment, AI software detects whether a voice is AI-generated.
It’s only with technology that you could know that it was a deepfake. “You cannot expect a human to do this. You need technology to fight technology, so you need good AI to fight bad AI,” says Vijay Balasubramaniyan. Like magic tricks, AI recordings may not always appear to be what they seem.
Watch the whole segment on CNN to see how easy it is to create a deepfake audio file and how Pindrop® Pulse can help in the future. You’ll see that by adding a voice, these platforms allow you to type whatever you’d like it to say and be able to produce that within minutes. For businesses, it could be as simple as: “I would like to buy a new pair of shoes, but they should be pink,” says Vijay Balasubramaniyan, making it problematic for many businesses to catch fraud going forward. Be sure you plan to detect fraud and protect teams and your company from these mistakes that can happen quickly.
What do we mean by the conversational economy?
This is an economy driven by interaction. Currently, that means always-on internet connectivity, access to products and services anytime/anywhere through a plethora of devices, and platforms that allow people to engage directly with businesses and other consumers.
Businesses already participate in the conversational economy when they immediately respond to customer complaints on social media, engage with prospects through chatbots, or provide seamless omnichannel buying experiences for customers across physical stores, the internet, and the phone.
Why has voice become so popular with consumers?
Ease. Voice is the most natural form of communication and the first one we learn how to use, and it’s ironic that technology has only just now caught up to the rich intricacies of voices. Now that computing resources, internet bandwidth, and technological innovation can handle voice well, we predict that voice applications have become the current gold rush—just as we saw a gold rush with touch-enabled devices (starting with the iPhone)—and spawn an entirely new economy.
Voice already dominates customer interactions and grows exponentially each year. Currently, 78 percent of all customer interactions are by voice. One estimate suggests that voice shopping will increase from a $2 billion industry in 2018 to a $40 billion industry in 2022.
How has the adoption of voice assistants grown?
The adoption of voice assistants and voice activities is also starting to really accelerate. Over 25 percent of the US population has access to a smart device, and a large percentage of people anticipate more voice interactions going forward – such as on cell phones or using voice assistants for shopping. Voice tasks encompassing a variety of situations will increase in adoption over the next 18 months and especially over the next five years – for inside the home, outside the home, and work tasks.
For more answers – download the full Voice Intelligence Report here.