Articles

Pindrop Collaborates with NVIDIA to Protect Against Unauthorized Voice Cloning

logo
Sarosh Shahbuddin

Senior Director, Product Management • June 25, 2025 (UPDATED ON June 25, 2025)

1 minute read

Pindrop is excited to share our collaboration with NVIDIA to advance defenses against unauthorized synthetic speech in support of building safe, robust, and responsibly deployed AI systems.

Why responsible release matters

NVIDIA recently introduced NVIDIA Riva Magpie, a powerful multilingual text-to-speech (TTS) model capable of generating high-quality, natural-sounding speech in English, Spanish, German, and French. But one of its most powerful features, zero-shot voice cloning, was deliberately withheld until safeguards could be put in place.

Pindrop is part of a select group of partners granted early access to help develop and reinforce those safeguards. This early access enables us to evaluate realistic voice clones, improve our detection models, and ensure protections are in place before these capabilities are publicly available.

What makes zero-shot voice cloning risky without safeguards

Zero-shot cloning allows synthetic speech to generate a desired voice using just a few seconds of reference audio. While it unlocks creative applications, it can also create new opportunities for misuse, such as impersonation, fraud, and misinformation.

That’s where early access matters: it allows industry leaders like Pindrop to proactively train detectors against emerging models before they’re widely available.

Understanding zero-day exploits

‘Zero-day’ cloning exploits occur when a new synthetic speech model is used before detection systems have seen or adapted to its artifacts. These blind spots can make even state-of-the-art protections vulnerable.

We’ve written before about how each stage of a TTS system – the text analysis module, acoustic model, and vocoder – can leave behind subtle artifacts like unnatural prosody or spectral anomalies. Pindrop detectors are designed to find those traces.

Riva Magpie uses a T5-TTS encoder-decoder transformer (an AI model that converts text into speech) paired with an audio codec (software that compresses and decompresses audio data). Because we’ve trained our technology on similar architectures, our system generalizes well, even to models we haven’t seen before. In our initial evaluation of Riva Magpie, using a few thousand 5-second utterances, our technology was able to detect over 90% of synthetic samples with false accept rates below 1% (meaning fewer than 1 in 100 synthetic samples are incorrectly classified as genuine).

Why early access protects

Early access to models like Riva Magpie gives organizations like Pindrop a critical head start – we’re able to assess detection accuracy for our technology across a wide range of conditions, including male and female voices, multiple languages, short and long utterances, and varying sampling rates and compression levels. This enables us to fine-tune for performance and generalization before these models are widely released.

After the first evaluation pass resulting in 90% accuracy, we quickly generated an additional 8,000 samples, each 5 seconds long, split across gender and languages. We augmented these samples with the conditions that our solutions most commonly operate in: noisy environments from call centers, varied sampling rates from social media, and varied compressed formats from video conferencing platforms. This small dataset (roughly 40,000 seconds or 11 hours of audio) gave us enough coverage to retrain and adapt our models. As a result, we increased our solution’s detection accuracy to 99.2%, while keeping false accept rates under 1%.

A shared commitment to safer voice AI

The Pindrop + NVIDIA collaboration ensures that as synthetic speech grows more capable, detection systems stay one step ahead. Whether synthetic speech is used in video interviews, virtual customer interactions, social media content, or call centers, the ability to distinguish real from fake audio is critical. At Pindrop, we’re helping to ensure that trust, identity, and authenticity remain protected as voice becomes a primary interface for digital communication.

We’re proud to collaborate with NVIDIA on this important work – and grateful for the opportunity to contribute to the safe deployment of Riva Magpie and future text-to-speech models to come.

Pindrop Dots

Voice security is
not a luxury—it’s
a necessity

Take the first step toward a safer, more secure future
for your business.