Machine Learning Outperforms Humans in Spotting Synthetic Voices

ABOUT PINDROP

5 Tips for Improving Contact Center Productivity

A contact center’s profitability and effectiveness depend on the agents’...

April 23, 2024

Written by: Pindrop

Contact Center Fraud & Authentication Expert

As humans, we often overestimate our performance. This phenomena has been studied so often, it has a name: the Dunning-Kruger effect. The Dunning-Kruger effect states that we have a cognitive bias in believing we are smarter than we actually are. We naturally overestimate our own abilities.

We are obsessed with all things voice here at Pindrop. As the voice experts, we wanted to find out how good we are at detecting real voices vs. synthetic ones. An informal survey was conducted with Pindrop employees to determine if a machine or a person could differentiate between a synthetic voice and a human voice.

Pindrop Lab’s research team really wanted a challenge in trying to have synthetic voice fools both employees and the technology as much as possible. So instead of training voice models with 20 minutes of audio, the team trained the voice models using 5 hours of speech for each sample voice. They tooks the 5 hour trained models and developed 100 sophisticated synthesized voices saying random phrases collected from several sources including blogs, Google, Dessa, etc. The team selected 20 phrases at random of an initial 100 synthesized and genuine voices. With 60 participants from Pindrop, the results were a bit surprising!

Pindropper’s were able to spot the synthetic voice only slightly more than random chance. Then the machines got their shot. Even with the incredibly trained synthetic voice models, the machines still beat the Pindrop employee by over 25%, showing that technology gets the upper hand when spotting fake voice, no matter how convincing.
If you are relying on agents to determine if a caller is who they say they are, what do you think their chances are?