Close this search box.

Written by: Mike Yang

As voice has continued to emerge as one of the key interfaces for new devices and apps, including vehicles, bank accounts, and home automation systems, concerns about the security of these systems have evolved, as well. Now, as both Google and Adobe have demonstrated systems that can insert and replace words in recorded speech or mimic human speech those concerns are becoming more concrete.
The use of voice for authentication or input is not a new phenomenon by any means. Engineers have been building it into various systems for many years, but it’s only recently that voice has become a primary interface rather than a secondary one. With the emergence of voice-command systems in vehicles and devices such as Amazon Echo and Google Home, the interface has become mainstream and actually useful, rather than a novelty. The accuracy of some of these apps is still a work in progress, but the ability of machines to recognize and interpret speech has evolved to the point that they now can take a short chunk of a person’s recorded voice and generate synthesized speech that mimics the voice.
Adobe has revealed a project known as VoCo that has that it has compared to a Photoshop for voice recordings. The app can take a small piece of a person’s recorded voice and give the user the ability to rearrange or insert words or short phrases into the recording. The user types whatever text he wants into the app and the software can then add them into the recording wherever the user specifies.
Google also has been working on a synthetic speech system, known as WaveNet, which models raw audio waveforms to produce speech that sounds more human. Many existing text-to-speech systems rely on a database of recorded words to produce sentences. Google’s model doesn’t have that limitation.

“It is not necessarily the case that VoCo will be effective against biometrics.”

“It is a fully convolutional neural network, where the convolutional layers have various dilation factors that allow its receptive field to grow exponentially with depth and cover thousands of time steps,” the company’s DeepMind engineers said in a post.
“At training time, the input sequences are real waveforms recorded from human speakers. After training, we can sample the network to generate synthetic utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made. Building up samples one step at a time like this is computationally expensive, but we have found it essential for generating complex, realistic-sounding audio.”
Adobe’s VoCo is not commercially available yet, and Google still is working on its WaveNet project, but they both could have some security and privacy implications. Voice authentication systems rely on many factors, not just the user’s voiceprint, in order to identify a subject. But being able to change the words in a fragment of speech or generate it out of whole cloth could help an attacker trick such systems.
“It is an interesting question as to whether VoCo will be able to trick voice biometric software and the only way to know for sure is to test it,” Steven Murdoch, a research fellow in the Information Security Group at University College London, said in an email.
“However biometric vendors say their products look for different features than what people look for, so it is not necessarily the case that VoCo will be effective against biometrics.”
In some voice-interface applications, determining who is speaking is not necessarily that important. For example, telling a vehicle’s audio system to play a song doesn’t really require authentication. But for systems in which identity is a major goal of the voice system, the ability to change or generate speech would be quite useful for an attacker. However, voice recordings have been regarded with suspicion for a while, Murdoch says, so the existence of these new systems shouldn’t change much in that regard.
“There have been products available for some time that will impersonate a person’s voice, and skilled actors have are able to do the same. Voice recordings should therefore be treated with suspicion, and this was the case before Adobe announced VoCo,” he said.
Image: Freestocks, public domain.