Search
Close this search box.

Written by: Sarosh Shahbuddin

Sr. Director of Product

On October 22nd, the nonpartisan group RepresentUs released a public service announcement (PSA) on YouTube, addressing the potential misuse of AI deepfakes in the 2024 election. The PSA warns that malicious actors could use deepfake technology to spread election misinformation on when, where, and how to vote, posing a significant threat to the democratic process.

The PSA features Chris Rock, Amy Schumer, Laura Dern, Orlando Bloom, Jonathan Scott, Michael Douglas, and Rosario Dawson. With the exception of Rosario Dawson and Jonathan Scott, the appearances of these public figures were deepfakes, created to emphasize the deceptive power of AI technology. The PSA encourages Americans to stay vigilant, recognize signs of manipulated media, and ensure they are accurately informed ahead of Election Day.

Given the mix of genuine and synthetic speech, this PSA presented an ideal opportunity to demonstrate the capabilities of Pindrop® Pulse Inspect in distinguishing between human and synthetic voices. Our technology can play a crucial role in helping protect election integrity by supporting audiences and organizations in distinguishing between authentic and manipulated media.

 

Analyzing the Public Service Announcement with Pindrop® Pulse™ Inspect

To start, we ran the PSA through Pindrop® Pulse™ Inspect software to analyze potential deepfake artifacts. Pulse Inspect works by breaking down audio content into segments, analyzing every four seconds of speech, and scoring each segment based on its authenticity:

 

  • Score > 60: AI-generated or other synthetic speech detected
  • Score < 40: No AI-generated or other synthetic speech detected
  • Scores between 40 and 60: Inconclusive segments, often due to limited spoken content or background noise interference

 

This initial pass provided a strong overview of synthetic versus human speech throughout the PSA. The four-second segments allowed us to identify precise points in the video where synthetic or human speech was present, making it clear how well our technology highlights the boundaries between authentic and manipulated media.

 

 

Breaking Down the Video for Multi-Speaker Analysis

Since many segments featured multiple speakers with mixed human and synthetic voices, we diarized the video to log the start and end times for each speaker, the table below shows the segmented timestamps.

 

Start Time End Time Speaker Label
0:00 0:03.50 Michael Douglas
0:03.51 0:05.29 Jonathan Scott
0:05.80 0:07.25 Rosario Dawson
0:07.29 0:08.96 Chris Rock
0:08.97 0:10.19 Michael Douglas
0:10.25 0:14.04 Jonathan Scott
0:14.14 0:15.41 Laura Dern
0:15.58 0:16.48 Amy Schumer
0:16.52 0:19.25 Jonathan Scott
0:19.35 0:20.90 Amy Schumer
0:21.15 0:26.51 Chris Rock
0:27 0:30.93 Rosario Dawson
0:31.21 0:35.70 Orlando Bloom
0:35.79 0:38.80 Laura Dern
0:39 0:44.55 Rosario Dawson
0:44.66 0:46.06 Laura Dern
0:46.13 0:48.30 Jonathan Scott
0:48.42 0:50.49 Amy Schumer
0:50.54 0:54.06 Rosario Dawson
0:54.12 0:56.99 Orlando Bloom
0:57.06 1:00.15 Jonathan Scott
1:00.22 1:01.79 Amy Schumer
1:01.83 1:03.40 Laura Dern
1:03.50 1:05.74 Rosario Dawson
1:05.85 1:09.69 Michael Douglas
1:15.56 1:19.28 Amy Schumer (Actor)
1:21.52 1:23.13 Laura Dern (Actor)
1:24.16 1:26.29 Jonathan Scott
1:26.49 1:31.70 Rosario Dawson

 

This speaker diarization enabled us to isolate and analyze each segment individually. For example, here are six clips of Rosario Dawson, all accurately identified as not synthetic—even the first clip, which contains only one second of audio with just 0.68 seconds of speech! By segmenting the PSA at this level, we achieved higher precision in detecting synthetic content while reliably confirming human voices.

 

 

Tracing the Source of Deepfake Speech

Lastly, an additional benefit of diarizing and segmenting speakers was that we could stitch together all speech from a single speaker. This provided longer, continuous audio samples for our models to analyze, increasing our technology’s ability to detect markers of synthetic content. With this approach, our deepfake detection models had significantly more speech data to work with.

With the speaker-separated audio files prepared, we leveraged our Source Tracing feature to identify the probable origin of the deepfakes. Source Tracing is our advanced tool designed to pinpoint the AI engine used to generate synthetic audio, helping us understand the technology behind a given deepfake. After analysis, we identified ElevenLabs as the most likely generator for these deepfakes, with PlayHT as a close alternative. This level of insight is essential for media and cybersecurity teams working to trace and counteract the spread of malicious AI-generated content.

 

Election Integrity: Key Takeaways 

This PSA not only serves as a reminder of how convincing deepfakes have become, but also highlights the role of tools like Pindrop® Pulse™ Inspect in identifying and mitigating the spread of manipulated media to prevent election manipulation. Our technology is already in use by organizations committed to protecting public trust and preventing the spread of misinformation. As deepfake technology advances, so must our efforts to safeguard truth and transparency in the information we consume.

More
Blogs