While deepfake technology may have legitimate applications in media and entertainment, its misuse poses significant risks for organizations.

AI-generated manipulations, known as deepfakes, can produce convincingly realistic audio and video, leading to significant threats such as financial fraud, identity theft, and the dissemination of false information.

Identifying and addressing these threats is essential for companies—but where can we even start?

Deepfake audits provide a structured and proactive approach to combating these risks. Businesses can protect themselves by identifying vulnerabilities, evaluating the impact of deep learning algorithms, and integrating robust detection tools.

This article explores the importance, components, and actionable steps for effectively implementing deepfake audits. Let’s dive in.

Understanding deepfake technology: How deepfake algorithms work

Deepfake employs machine learning (ML) and artificial intelligence (AI) to produce hyper-realistic synthetic media that can mimic human audio, video, or both.

This technology relies on advanced algorithms, such as Generative Adversarial Networks (GANs), which enable deepfake systems to learn and replicate intricate details of human behavior, such as speech patterns, facial expressions, and movements.

Key components of deepfake algorithms:

  • Training data: Large audio or video recording datasets train AI models. The more data available, the more realistic the deepfake becomes. The higher the quality and diversity of the data, the more accurate and convincing the resulting deepfake becomes.
  • Neural networks, including GANs, analyze and recreate speech patterns, facial movements, and other markers. They function through a generator that creates synthetic content and a discriminator that evaluates its authenticity.

    This iterative process refines the output until the generated content is nearly indistinguishable from real media.
  • Synthetic output: Once trained, the algorithm produces manipulated media to deceive viewers or listeners. For audio deepfakes, the system recreates speech with seamless intonation and fluidity, often bypassing human detection. Video deepfakes involve synchronized lip movements, realistic facial expressions, and body language that align with the audio.

Benefits of conducting deepfake audits

Deepfake technology has progressed significantly, reducing telltale signs of manipulation, such as robotic inflections or visual artifacts. However, these advancements make detection increasingly challenging, even for trained professionals. This is why conducting audits is crucial.

Prevention of financial fraud

Deepfake audits help organizations detect and mitigate fraudulent activities before they escalate. By identifying synthetic audio or video used to impersonate executives, employees, or customers, audits can:

  • Prevent unauthorized financial transactions initiated through voice phishing or deepfake impersonations.
  • Safeguard sensitive financial information from being exploited by attackers.

Proactive approach for reviewing security

Conducting deepfake audits allows organizations to adopt a proactive security strategy. Regular audits help:

  • Identify gaps in current security frameworks, especially in systems reliant on video or voice authentication
  • Test the effectiveness of detection tools and protocols against emerging deepfake threats
  • Build resilience by ensuring that new AI-driven risks are addressed promptly

Protection of organizational reputation

Deepfake attacks can severely damage a company’s brand and stakeholder trust. For example, a deepfake video of an executive or product announcement could mislead stakeholders and harm the company’s credibility. Audits minimize reputational risks by:

  • Flagging manipulative synthetic media before it spreads widely
  • Ensuring that incidents are managed quickly and effectively to maintain customer confidence

Components of a deepfake audit

Identifying deepfake content

The foundation of any deepfake audit is the ability to detect synthetic media. Identifying deepfake content involves:

  • Content analysis: Use advanced detection tools to catch signs of manipulation. Look for inconsistencies in tone, pitch, background noise, or visual distortions, such as unnatural transitions or mismatched lip movements. For example, an AI-generated audio clip might have subtle variations in vocal intonation or background ambiance that don’t align with authentic recordings.
  • Tool-based detection: Technologies like liveness detection and voice biometrics are essential. For instance, Pindrop® Pulse™ Tech excels at analyzing audio patterns to identify anomalies that indicate deepfake attacks. In one of many cases, we flagged suspicious patterns in contact center interactions, exposing fraudulent attempts early. Learn how we did it with our article about identifying patterns of deepfake attacks in call centers.
  • Manual review: While automated tools are essential, having trained experts to review flagged content ensures accuracy. These professionals can validate findings and provide nuanced insights that technology might miss.

Evaluating the impact of deepfakes

Once deepfake content is identified, assessing its potential impact on the organization is vital. This involves:

  • Risk assessment: Determine the level of harm the deepfake could cause. Consider the following:
    • Could it lead to financial fraud or unauthorized transactions?
    • Does it have the potential to damage the organization’s reputation?
    • Could it erode trust among customers or stakeholders?
  • Operational impact: Evaluate how the deepfake could disrupt business operations, such as impersonating executives or compromising internal communications.
  • Compliance Risks: Assess whether the deepfake could lead to regulatory violations, especially involving financial data or personally identifiable information (PII). For example, a deepfake impersonating a CEO to authorize a fraudulent wire transfer could violate financial reporting and not comply with regulations. They can also steal sensitive information, which violates privacy regulations.

By using multifactor authentication, stores can drastically reduce fraudulent return attempts. This process also minimizes disruptions for genuine customers, maintaining a smooth and efficient return experience.

Assessing the reach and spread of deepfake content

Understanding the dissemination and reach of deepfake content is crucial for containment and mitigation. Key steps include:

  • Content tracking: Use digital tools to monitor the spread of deepfake content across platforms. Tools like media monitoring software can flag where the content has been shared or reposted.
  • Audience analysis: Identify the demographic or groups exposed to the deepfake. This helps prioritize mitigation efforts and communication strategies.
  • Impact quantification: Estimate the scale of the damage based on the spread. For instance:
    • How many individuals or entities might have been misled?
    • Are there public relations implications, such as media coverage or social media backlash?

Best practices for conducting deepfake audits

Developing a deepfake detection framework

A well-structured framework is essential for identifying and addressing deepfake threats. Key elements include:

  • Establish clear protocols: Define processes for analyzing and flagging potential deepfake content. This includes:
    • Identifying high-risk areas such as financial transactions or executive communications.
    • Creating escalation procedures or ticketing systems for suspected deepfakes.
  • Integrate detection at multiple levels: Ensure deepfake detection is embedded into every stage of the organization’s workflow, from initial customer interactions to high-level decision-making.
  • Set metrics for evaluation: Measure the effectiveness of detection methods by tracking metrics like false positive rates, detection speed, and the number of confirmed deepfake cases.
  • Simulate scenarios: Conduct regular simulations of deepfake attacks to evaluate the framework’s robustness and train employees on appropriate responses.

Collaborating with experts in AI and cybersecurity

Deepfake threats require specialized knowledge. Collaborating with experts ensures organizations have access to the latest technologies and insights.

You can begin by collaborating with or subscribing to academic institutions or private companies focusing on AI, deep learning, and machine learning. This will allow you to stay updated on the latest deepfake techniques.

Cybersecurity firms can also be a good option to strengthen your organization’s defenses. You can also join industry groups and forums to share knowledge about deepfake mitigation. These platforms provide valuable insights and foster innovation in combating deepfake fraud.

Leverage vendor expertise to gain the knowledge and resources for deepfake detection. They can help you evaluate your defense strategy against deepfakes and provide the tools needed for the job.

Implementing deepfake detection tools

Unsurprisingly, investing in advanced tools is critical to defending against deepfakes. It’s essentially technology vs. technology—and having the right tools makes all the difference.

When evaluating deepfake detection solutions, look for these key features to promote comprehensive protection:

  • Real-time detection: The ability to identify synthetic media as it’s being used, minimizing the window of opportunity for attackers.
  • Continuous assessment: Ongoing evaluation and improvement of detection algorithms to keep pace with advancing threats.
  • Resilience: Tools that adapt to new attack vectors, ensuring robust defense against evolving deepfake tactics.
  • Zero-day attack coverage: Early detection of novel threats, even those not previously encountered, to prevent breaches.
  • Explainability: Insights into how and why a piece of content is flagged as a deepfake, enabling clear communication of risks to stakeholders.

Pindrop offers cutting-edge solutions tailored for real-time deepfake detection, seamlessly integrating into existing security frameworks. With Pindrop® Pulse™ Tech, organizations can:

  • Analyze audio for manipulation using advanced voice analysis and AI-driven algorithms.
  • Detect and block synthetic media in real time, preserving business continuity and protecting sensitive data.
  • Integrate with current security systems, enhancing the overall fraud prevention strategy without overhauling existing workflows.

Pindrop® Solutions help safeguard call centers in various industries such as financial institutions, retail, and more by proactively identifying deepfake content before it can cause harm.

Be proactive with your business’s security with Pindrop Solutions

As deepfake technology continues to evolve, so must your defenses. Proactive measures are key to protecting your organization from synthetic media’s financial, operational, and reputational risks.

Pindrop® Solutions empower businesses like yours to stay ahead of these threats by providing real-time detection, continuous improvement, and seamless integration into existing systems.

Take the next step in safeguarding your organization—schedule a free demo today.

Returns are a standard part of retail, but they’re not without risks. Fraudulent returns can cost businesses a significant amount of losses annually. While restricting returns might seem like the only way to fight against retail fraud, there are better ways to help reduce fraud losses that don’t sacrifice the customer experience. 

Leveraging an advanced voice biometrics analysis solution can help protect customer accounts, spot fraudulent returns, and streamline the call experience. This article will explore the types of return fraud and how to combat it with advanced voice security.

Understanding return fraud

Return fraud involves customers exploiting return policies for personal gain. It comes in various forms, from returning stolen items to abusing liberal return policies. 

According to the National Retail Federation, return fraud costs billions annually and contributes to operational inefficiencies. Retailers often face challenges balancing customer satisfaction with fraud detection.

The most common types of fraud in retail include:

  • Receipt fraud: Customers use fake receipts or receipts from other items to return merchandise
  • Wardrobing: Buying an item, using it briefly, and returning it as “new”
  • Stolen goods returns: Returning stolen goods for refunds or store credits
  • Refund fraud: Manipulating the system to receive more than the value of the returned item

What is voice biometrics in retail?

Voice biometrics is a technology that identifies individuals based on unique vocal characteristics. It analyzes various features of a person’s voice, such as pitch, tone, and rhythm.

This technology can help protect retail contact centers from refund fraud, offering a secure and efficient means of verifying customer voices during transactions, including returns.

Unlike traditional authentication methods, such as passwords, voice biometrics provide an additional layer of security by leveraging something inherently unique to each individual—their voice. When used in tandem with other authentication factors, this advanced technology can assist retailers in combating fraudulent returns while helping create a faster and simpler returns process.

How voice biometrics can detect return fraud

Voice biometric analysis brings multiple benefits to retailers, helping to reduce fraud and improve operational efficiency. 

Real-time authentication

With voice biometrics, you can authenticate customers in real-time, helping to ensure that the person initiating a return is the purchaser. This technology can be particularly useful in contact centers, where authenticating customers through traditional methods is more challenging.

By using multifactor authentication, stores can drastically reduce fraudulent return attempts. This process also minimizes disruptions for genuine customers, maintaining a smooth and efficient return experience.

Fraud detection

Voice biometrics can identify suspicious behavior patterns by the individual attempting the return.

Multifactor authentication 

You can use voice biometrics as part of a multifactor authentication (MFA) approach, combining content-agnostic voice verification with other verification methods like PINs or SMS codes. 

With this approach, even if one method fails, or if some credentials are lost or stolen, you still have a method to detect fraudulent activity.

Secure transactions

Voice biometrics can help create a secure environment for customers during their transactions. Once the system receives authentication information on the customer, it can securely process the return, significantly reducing the chances of refund fraud. This helps protect the retailer from loss and can provide customers with peace of mind, knowing their information is securely handled.

Accelerating return transactions

When using traditional authentication methods, customers can often find the process tedious. Voice biometrics help speed up return transactions, as customers can skip more lengthy verification procedures.

This helps create a faster, hassle-free return process, contributing to a better overall customer experience.

Data protection

Retailers can use voice biometrics to enhance data protection protocols, maintaining their consumers’ trust.

Implementing voice biometrics in your retail system

Integrating voice biometrics into your retail system in a way that’s effective and user-friendly requires careful planning.

Evaluate current systems 

Start by evaluating your existing return processes and fraud detection strategies. Understanding where current vulnerabilities lie will help identify how voice biometric analysis can fill those gaps.

Select a reliable voice biometrics solution provider

Partnering with a reliable voice biometrics provider is crucial. Look for vendors with experience in retail security, a track record of success, and robust data protection measures.

Integrate voice biometrics seamlessly into retail systems

Ensure that voice biometrics integrate smoothly with your existing retail systems. This will reduce disruption during the implementation phase and allow both customers and staff to adapt quickly to the new system.

Train staff on using voice biometrics system 

Training your staff members on how to use the voice biometrics system effectively is critical. Otherwise, no matter how good the technology is, there’s an increased risk of human error that could eventually lead to return fraud. 

Training should include knowing when and how to use the technology and troubleshooting potential issues to prevent delays in the returns process.

Monitor system performance and optimize processes 

After implementation, regularly monitor the system’s performance to ensure it functions as expected. Make necessary adjustments to optimize the system’s capabilities and improve its accuracy and efficiency in supporting fraud prevention efforts. 

Additional benefits of voice biometrics in retail

Beyond helping prevent return fraud, voice biometrics offer additional advantages that enhance the overall retail experience.

  • Reduced fraud costs: By minimizing fraudulent returns, retailers can significantly reduce the financial losses associated with them. This helps merchants optimize their operations, improve profitability, and focus resources on serving genuine customers.
  • Convenience: Voice biometrics streamline the return process by eliminating the need for physical IDs or receipts. Customers can complete their returns quickly and easily, leading to a better shopping experience.
  • Trust and loyalty: Implementing voice biometrics builds trust with customers, as they feel confident that their identities and transactions are secure. This increased level of trust enhances customer loyalty and encourages repeat business.
  • Transparency: Maintaining transparency with customers about the use of voice biometrics for fraud detection can foster confidence. Clear communication regarding how voice analysis is used will help consumers understand the purpose and benefits of this technology.

Adopt a voice biometrics solution to help prevent return fraud

Return fraud is a serious issue affecting retailers worldwide, leading to losses of billions of dollars each year. While strict return policies may be somewhat helpful, retailers need to find better, customer-friendly alternatives. One such approach is voice biometrics, which offers additional defenses against fraudulent returns while improving the customer experience.

Voice biometric solutions can help merchants secure their return processes, reduce fraud costs, and build stronger relationships with customers. Adopting such a technology may seem like a significant shift, but its long-term benefits, both in fraud detection and customer trust, make it the perfect choice for small and large retailers.

More and more incidents involving deepfakes have been making their way into the media, like the one mimicking Kamala Harris’ voice in July 2024. Although AI-generated audio can offer entertainment value, it carries significant risks for cybersecurity, fraud, misinformation, and disinformation.

Governments and organizations are taking action to regulate deepfake AI through legislation, detection technologies, and digital literacy initiatives. Studies reveal that humans aren’t great at differentiating between a real and a synthetic voice. Security methods like liveness detection, multifactor authentication, and fraud detection are needed to combat this and the undeniable rise of deepfake AI. 

While deep learning algorithms can manipulate visual content with relative ease, accurately replicating the unique characteristics of a person’s voice poses a greater challenge. Advanced voice security can detect real or synthetic voices, providing a stronger defense against AI-generated fraud and impersonation. 

What is deepfake AI?

Deepfake AI is synthetic media generated using artificial intelligence techniques, typically deep learning, to create highly realistic but fake audio, video, or images. It works by training neural networks on large datasets to mimic the behavior and features of real people, often employing methods such as GANs (generative adversarial networks) to improve authenticity.

The term “deepfake” combines “deep learning” and “fake content,” showing the use of deep learning algorithms to create authentic-looking synthetic content. These AI-generated deepfakes can range from video impersonations of celebrities to fabricated voice recordings that sound almost identical to the actual person.

What are the threats of deepfake AI for organizations?

Deepfake AI poses serious threats to organizations across industries because of its potential for misuse. From cybersecurity to fraud and misinformation, deepfakes can lead to data breaches, financial losses, and reputational damage and may even alter the public’s perception of a person or issue.

Cybersecurity 

Attackers can use deepfake videos and voice recordings to impersonate executives or employees in phishing attacks. 

For instance, a deepfake voice of a company’s IT administrator could convince employees to disclose their login credentials or install malicious software. Since humans have difficulty spotting the difference between a genuine and an AI-generated voice, the chances of a successful attack are high.

Voice security could help by detecting liveness and using multiple factors to authenticate calls. 

Fraud 

AI voice deepfakes can trick authentication systems in banking, healthcare, and other industries that rely on voice verification. This can lead to unauthorized transactions, identity theft, and financial losses.

A famous deepfake incident led to $25 million in losses for a multinational company. The fraudsters recreated the voice and image of the company’s CFO and several other employees. 

They then proceeded to invite an employee to an online call. The victim was initially suspicious, but seeing and hearing his boss and colleagues “live” on the call reassured him. Consequently, he transferred $25 million into another bank account as instructed by the “CFO.”

Misinformation

Deepfake technology contributes to the spread of fake news, especially on social media platforms. For instance, in 2022, a few months after the Ukraine-Russia conflict began, a disturbing incident took place. 

A video of Ukraine’s President Zelenskyy circulated online, where he appeared to be telling his soldiers to surrender. Despite the gross misinformation, the video stayed online and was shared by thousands of people and even some journals before finally being taken down and labeled as fake.

With AI-generated content that appears credible, it becomes harder for the public to distinguish between real and fake, leading to confusion and distrust.

Other industry-specific threats

The entertainment industry, for example, has already seen the rise of deepfake videos in which celebrities are impersonated for malicious purposes. But it doesn’t stop there—education and even everyday business operations are vulnerable to deepfake attacks. For instance, in South Korea, attackers distributed deepfakes targeting underaged victims in an attack that many labeled as a real “deepfake crisis.”

The ability of deepfake AI to create fake content with near-perfect quality is why robust security systems, particularly liveness detection, voice authentication, and fraud detection, are important.

Why voice security is essential for combating deepfake AI

Voice security can be a key defense mechanism against AI deepfake threats. While you can manipulate images and videos to a high degree, replicating a person’s voice with perfect accuracy remains more challenging.

Unique marker

Voice is a unique marker. The subtle but significant variations in pitch, tone, and cadence are extremely difficult for deepfake AI to replicate accurately. Even the most advanced AI deepfake technologies struggle to capture the complexity of a person’s vocal identity. 

This inherent uniqueness makes voice authentication a highly reliable method for verifying a person’s identity, offering an extra layer of security that is hard to spoof. 

Resistant to impersonation

Even though deepfake technology has advanced, there are still subtle nuances in real human voices that deepfakes can’t perfectly mimic. That’s why you can detect AI voice deepfake attempts by analyzing the micro-details specific to genuine vocal patterns.

Enhanced fraud detection

Integrating voice authentication and liveness detection with other security measures can improve fraud detection. By combining voice verification with existing fraud detection tools, businesses can significantly reduce the risks associated with AI deepfakes.

For instance, voice security systems analyze various vocal characteristics that are difficult for deepfake AI to replicate, such as intonation patterns and micro-pauses in speech. These systems can then catch these indications of synthetic manipulation.

How voice authentication mitigates deepfake AI risks

Voice authentication does more than just help verify identity—it actively helps reduce the risks posed by deepfake AI. Here’s how:

Distinct voice characteristics

A person’s voice has distinct characteristics that deepfake AI struggles to replicate with 100% accuracy. By focusing on these unique aspects, voice authentication systems can differentiate between real human voices and AI-generated fakes.

Real-time authentication

Voice authentication provides real-time authentication, meaning that security systems can detect a deepfake voice as soon as an impersonator tries to use it. This is crucial information for preventing real-time fraud attempts.

Multifactor authentication

Voice authentication can also serve as a layer in a multifactor authentication system. In addition to passwords, device analysis, and other factors, voice adds an extra layer of security, making it harder for AI deepfakes to succeed.

Enhanced security measures

When combined with other security technologies, such as AI models trained to detect deepfakes, voice authentication becomes part of a broader strategy to protect against synthetic media attacks and fake content.

Implementing voice authentication as a backup strategy

For many industries—ranging from finance to healthcare—the use of synthetic media, such as AI-generated voices, has increased the risk of fraud and cybersecurity attacks. To combat these threats, businesses need to implement robust voice authentication systems that can detect and help them mitigate deepfake attempts.

Pindrop, a recognized leader in voice security technology, can offer tremendous help. Our solutions come with advanced solutions for detecting deepfake AI, helping companies safeguard their operations from external and internal threats.

Pindrop® Passport is a robust multifactor authentication solution that allows seamless authentication with voice analysis. The system analyzes various vocal characteristics to verify a caller. 

In real-time interactions, such as phone calls with customer service agents or in financial transactions, Pindrop® Passport continuously analyzes the caller’s voice, providing a secure and seamless user experience.

Pindrop® Pulse™ Tech goes beyond basic authentication, using AI and deep learning to detect suspicious voice patterns and potential deepfake attacks. It analyzes content-agnostic voice characteristics and behavioral cues to flag anomalies, helping organizations catch fraud before it happens. 

Pindrop® Pulse™ Tech provides an enhanced layer of security and improves operational efficiency by spotting fraudsters early in the process. For companies that regularly interact with clients or partners over the phone, this is an essential tool for detecting threats in real time. 

For those in the media, nonprofits, governments, and social media companies, deepfake AI can pose even more problems, as the risk of spreading false information can be high. Pindrop® Pulse™ Inspect offers a powerful solution to this problem by providing rapid analysis of audio files to detect synthetic speech. 

The tool helps verify that content is genuine and reliable by analyzing audio for liveness and identifying segments likely affected by deepfake manipulation. 

The future of voice security and deepfake AI

As deepfake AI technologies evolve, we need appropriate defense mechanisms.

Voice authentication is already proving to be a key factor in the fight against deepfakes, but the future may see even more advanced AI models capable of detecting subtle nuances in synthetic media. With them, organizations can create security systems that remain resilient against emerging deepfake threats.

Adopt a voice authentication solution today

Given the rise of deepfake AI and its growing threats, now is the time to consider implementing voice security in your organization’s security strategy. 

Whether you’re concerned about fraud or the spread of misinformation, voice authentication provides a reliable, effective way to mitigate the risks posed by deepfakes.

David Looney, Nikolay D. Gaubitch

Pindrop Inc., London, UK
[email protected], [email protected]

Abstract

We consider the problem of robust watermarking of speech signals using the spread spectrum method. To date, it has primarily been applied to music signals. Here we discuss differences between speech and music, and the implications this has on the use of spread spectrum watermarking. Moreover, we propose enhancements to the watermarking of speech for the detection of deepfake attacks at call centers using classical signal processing techniques and deep learning.

Index Terms: watermarking, spread spectrum

1. Introduction

With the rise of generative AI, it is becoming increasingly difficult to validate the authenticity of audio and video. A possible solution is to apply a digital watermark to synthetically-generated media content, which can then be used to make users aware that the media is indeed synthetically generated [1, 2]. One area where synthetic speech in particular poses a risk is in call centers. However, synthetic speech is typically generated with high quality at high sampling rates, and by the time it reaches a call center, it inevitably undergoes a series of degradations, such as downsampling and compression. Further degradations include acoustic noise and reverberation if replayed through a loudspeaker. This creates a significant challenge to robust watermarking, which is the topic of this work.

Audio watermarking received attention when music streaming and file sharing platforms were becoming popular in the early 2000s, mostly to facilitate intellectual property protection. It is also in this period that much of the work on the topic was published [3]. The required characteristics of a robust watermark are that it should be imperceptible to a human listener (imperceptibility), capable of withstanding deliberate attacks or anticipated signal degradations (robustness), and able to carry information (capacity) [3, 4]. Furthermore, we are only concerned with blind-watermarking detection methods where the original signal is not available at the decoder. The main approaches for audio watermarking that satisfy these requirements include the insertion of time-domain echoes [5, 6], spread spectrum modulation [7, 8, 9], or quantization index modulation [10]. In recent works, such as [11], end-to-end neural watermarking schemes are proposed that appear promising. However, at this stage, they require long speech utterances and tend to be computationally inefficient. Many techniques make use of the operation of the human auditory system to achieve better imperceptibility [3, 6, 7, 12].

We consider the problem of robust watermarking of speech signals using a spread spectrum method based on [7]. While this method was applied previously to music signals, in Section 3 we discuss differences between speech and music signals and the implications this has on the use of spread spectrum watermarking. Moreover, we propose modifications in the form of spectral shaping, both with respect to the encoder and the decoder, to tailor the spread spectrum method to speech signals. In the case of encoding, in Section 4 we show how linear prediction coding (LPC) analysis can be used to dynamically adjust the watermarking sequence to yield increased imperceptibility with no tradeoff in robustness. In the case of decoding, in Section 5 we show how a deep learning model can replace the standard decoding operation to improve robustness by emphasizing spectral components in a data-driven fashion. The analysis is supported in experimental scenarios matching the call center use-case, where encoding is performed on high-quality speech, and decoding is performed after applying typical telephony degradations (additive noise, downsampling, codec).

2. Spread spectrum watermarking

We want to add a watermark ww to a speech signal s(n)s(n). The watermark is typically a pseudo-random sequence, w∈{±1}Nww \in \{ \pm 1 \} ^{N_w}, which is applied to the signal in some transform domain [7, 8]. Here, the watermark is added to the ll-th frame in the log-spectral domain using the short-time Fourier transform (STFT):

XdB(l)=SdB(l)+δw,

where SdB(l)=20log10(S(k,l)), is the frequency index, S(k,l) is the discrete Fourier transform (DFT) of the l-th frame of s(n), and δ is a scaling parameter to control the watermark strength. The watermarked time-domain signal x(n) is reconstructed from ∣X(k,l) and the phase of S(k,l).

From detection theory, the optimal detector of the watermark is a matched filter:

where σs is the standard deviation of the signal. The false alarm and false rejection probabilities are given by:

where is the complementary error function and τ is a detection threshold parameter. An important metric for this work is the equal error rate (EER), when , which is obtained by setting the threshold to .

3. Watermarking speech vs. music signals

It is known to be more challenging to achieve a balance between imperceptibility and robustness when adding a watermark to speech compared to music [3, 7] due to the more limited spectral content of the former. We view this problem from a different angle by considering the choice of frame-length LL. This is important because LL governs the length of the watermark signal, which in turn is related to robustness as seen in the equations above. For our experiments, we employed two objective audio quality metrics: the speech-specific perceptual evaluation of speech quality (PESQ) [13] and the open source implementation GstPEAQ [14] of perceptual evaluation of audio quality (PEAQ) [15, 16], which has been used previously for the evaluation of watermarking of music [17]. We used 200 randomly selected speech utterances from TIMIT [18] with a sampling rate of 16 kHz. We varied the frame-length between 20 ms and 200 ms, adding a watermark of the same length as the frame-length to each signal according to the equation above.

We observed that as the frame-length increases, the watermark becomes more audible, and at the same time, the EER decreases. On the other hand, if we reduce the watermark strength by scaling δ, we can maintain constant perceptual quality but at the expense of an increasing EER.

Next, we performed the same experiment using TIMIT speech samples and 144 7s music excerpts from MUSDB [19] with sampling rates of 16 kHz and 44.1 kHz measuring the audio quality with PEAQ. We observed that the degradation in speech quality with increasing frame-length is noticeable, but the music audio quality remains unaffected by the frame-length independent of the sampling rate. From these results, we can conclude that for speech signals, the frame-length should be chosen between 20 ms and 30 ms for the best trade-off between imperceptibility and robustness. For music, the frame-length could be much longer for greater robustness without perceptual degradation.

4. LPC Weighting

It has been stated previously that the spread spectrum watermark should be added to the frequency components with the greatest energy to enable robustness to degradations [7, 8]. In the case of speech, there is a counterargument because frequency components with the greatest energy typically correspond to formant peaks, and their disruption will impact speech quality. We introduce a watermark weighting scheme to reduce the strength dynamically along frequency at each frame based on the LPC log-spectrum.

Taking the LPC log-spectrum components of 400 speech utterances (200 female) from the TIMIT corpus, we model the values as a Gaussian distribution with mean zero and standard deviation of 13.5 (see Fig. 2 (left)). We use the Gaussian cumulative distribution function (CDF) F(x) to yield a weighting function γ(x)=(1−F(x))α, which reduces the watermark strength at high-energy spectral components; the parameter α\alpha controls the degree to which these components are attenuated. The revised watermark sequence, wlpcw, is created by obtaining the LPC log-spectrum, Xlpc(k), for each frame of speech and adjusting the watermark strength along frequency as:

4.1. Perceptibility Study

We first study the impact on perceptibility as measured by speech quality using the proposed scheme. In Fig. 3, we show the narrowband (NB) and wideband (WB) PESQ scores for both the standard SS method and the proposed LPC weighted one obtained from 200 TIMIT utterances (100 female). The data is clean speech sampled at 16 kHz, and the watermark length NwN_w is 63 (element width: 50 Hz, watermark start frequency: 500 Hz, watermark end frequency: 3650 Hz, frame length 320 ms, each frame is encoded with the same pseudo-random sequence). As expected, for the same value of δ\delta, the PESQ scores are higher using the proposed scheme as the watermark strength has been reduced in high-energy spectral regions. For instance, at δ=4\delta = 4, the NB and WB PESQ scores are, respectively, 3.50 and 3.24 using the original scheme, and 4.07 using the proposed one. The gender imbalance is also improved.

4.2 Robustness Study

We encoded 200 clean utterances sampled at 16 kHz from the TIMIT corpus (100 female) using both the standard spread spectrum method and the proposed scheme with α = 0.35.

Based on the analysis presented in the previous section, different values for δ were explored such that the narrowband PESQ scores for each scheme would be similar. This allowed us to evaluate any gains in robustness for the same level of imperceptibility. Prior to applying the decoding algorithm, the watermarked and watermark-free utterances were subjected to degradations: added white Gaussian noise for SNRs of 20 dB and 15 dB, and a downsampling operation to 8 kHz.

Fig. 4 shows the PESQ scores versus EER for the standard and proposed methods, where decoding has been performed using (2).

For 20 dB SNR and a NB PESQ of 4.4, the EERs using the standard and proposed methods are, respectively, 5.5% and 1.7% (69% reduction). For 15 dB SNR and the same NB PESQ, the EERs using the standard and proposed methods are 8.2% and 5.9% (28% reduction). Note that the robustness gains are greater at comparable WB PESQ scores; at a WB PESQ of 4.4, the reductions in EER are 83% and 62% for SNRs of 20 dB and 15 dB, respectively.

4.3 Comparison with Related Work

A spread spectrum approach presented in [9] might appear to contradict the objective of the proposed method, as it seeks to adjust the watermark spectrum to more closely match the spectral shape of speech based on the LPC model. We illustrate below how both methods achieve similar outcomes via their application to an example speech signal, the average spectrum of which is shown in Fig. 5 (a).

A core difference between the methods is the domain in which encoding and decoding are performed. In [9], the initial watermark signal is created from a filtered binary phase-shift keying (BPSK) sequence added to the speech signal in the time domain. To enable a comparison with our approach, we obtained the difference between the spectra of the speech signal and the BPSK-watermarked signal on a frame-by-frame basis. Fig. 5 (b) shows the average and standard deviation of that difference, i.e., the mean and standard deviation of the watermark spectrum. Note the flat spectral shape.

Fig. 5 (c) shows the spectral properties of the watermark signal post LPC-filtering. Observe how the spectral mean matches that of the speech signal, which the authors found improved the robustness of the watermark for a comparable speech quality [9].

In contrast, the spread spectrum method considered in this manuscript adds the sequence in the log-spectrum domain. Therefore, the watermark signal already matches the spectral shape of the speech signal by design. However, as shown in Fig. 5 (d), the standard deviation is large relative to the mean. This is expected, as the sequence must be strong in the (log-)spectral domain for robustness, but it can disrupt formant peaks and impact speech quality.

We see the spectral properties after applying the proposed LPC-weighing in Fig. 5 (e) and Fig. 5 (f) for α = 0.35 and α = 1.0, respectively. Note that for increasing α, we reduce the watermark spectral deviation, but this comes at the cost of causing the spectral mean to deviate from that of the speech signal.

As the approaches belong to fundamentally different classes of spread spectrum methods—where the encoding/decoding domains are time and log-spectrum—a direct performance comparison is outside the scope of this work. Nonetheless, our studies indicate better performance for speech when the spread spectrum method is implemented in the log-spectrum domain.

5. Deep Decoding

In the spread spectrum method, applying the dot product to the decoded spectrum, as in (2) and often in combination with a cepstral filter, is the optimal decoding solution in clean and simple degradation scenarios (e.g., added white Gaussian noise). However, in telephony use cases, the degradations are more complex. The speech signal may pass through a loudspeaker, introducing ambient noises and delays, and be subjected to filtering as it is acquired by a microphone. Furthermore, the telephony channel itself adds degradations such as downsampling, packet loss, and codec compression.

To address these challenges, we propose a deep-learning decoding strategy—“deep decoding”—which tailors the spread spectrum method to both the host signals (speech) and complex degradation environments.

We consider two low-complexity models, where the input layer operates on the frequency indices of the cepstral-filtered decoded spectral frame corresponding to the embedded watermark. In this study, we assume a watermark length of 63. The first model (Model A) comprises a pair of dense layers with 64 and 32 units, respectively, using ReLU activation functions, and a single-unit dense layer with linear activation as the output (Fig. 6, left). The second model (Model B) uses a 1D convolutional layer (16 filters, kernel width 3), a 16-unit dense layer, and a single-unit dense layer output (Fig. 6, right).

To yield a training dataset for the deep decoding models, encoding was performed on 4620 clean utterances (training partition of the TIMIT corpus, 462 speakers, 10 utterances per speaker) with a 16 kHz sampling frequency, a frame size of 20 ms, and a watermark length Nw of 63 (element width: 50 Hz, watermark start frequency: 500 Hz, watermark end frequency: 3650 Hz). The same utterances were used to generate watermark-free data. Two degradation scenarios were considered: (1) a downsampling operation to 8 kHz and (2) a downsampling operation to 8 kHz followed by encoding with the Opus codec at 8 kbps. The encoding watermark strength was δ = 1 for the first scenario and δ = 3 for the second. After degradation, a voice activity detector (VAD) was used to retain speech frames.

Decoded frames were obtained from both the watermarked data and watermark-free data as Xb = g(X ̃dB)w, where X ̃dB is the log power spectrum of a speech frame at indices where the watermark is encoded, and g(·) is the cepstral filter.

For each degradation scenario, the models were trained over 25 epochs using the RMSProp method to distinguish between frames with and without the watermark. We evaluated the decoding models on frames obtained from 1680 utterances from the test partition of the TIMIT corpus with the same encoding and decoding parameters, and degradation conditions matching those in training. The standard decoding dot product output was obtained by PXb.

We evaluated performance not only at the single-frame level but also by aggregating the outputs across speech frames within utterances. The single-frame EERs for the first degradation scenario are 27.7%, 25.8%, and 21.8% for the dot product and decoding models A and B, and 37.2%, 35.0%, and 30.0% for the second degradation scenario. Fig. 7 and Fig. 8 show the EERs versus the number of aggregated frames for each degradation scenario. The greater the number of aggregated frames, the more the deep decoding schemes yield performance gains compared to the dot product. Model B enables the largest reductions in EER, facilitating reductions of 70% (10 frames) and 98% (30 frames) for the first degradation scenario, and 59% (10 frames) and 86% (30 frames) for the second degradation scenario.

6. Conclusions

We have studied the popular spread spectrum watermarking method, typically applied to music, for speech signals. Our analysis has revealed that an encoding frame length of approximately 20 ms to 30 ms for speech achieves the optimal balance between watermark robustness and perceptibility. We have introduced extensions of the core method that address the encoding and decoding operations separately, enabling reductions in equal error rates without compromising speech quality. This work shows promise for applying watermarking to synthetic speech data to facilitate malicious-use detection, even in challenging environments such as call centers.

7. References

1. Ricketts, (2023). Ricketts introduces bill to combat deepfakes, require watermarks on A.I.-generated content. Available: https://www.ricketts.senate.gov/wp-content/uploads/2023/09/Advisory-for-AI-Generated-Content-Act.pdf

2. J. Davidson, (2024). Senate pursues action against AI deepfakes in election campaigns. Available: https://www.washingtonpost.com/politics/2024/04/26/senate-deepfakes-campaigns-ban/

3. M. A. Nematollahi and S. A. R. Al-Haddad, “An overview of digital speech watermarking,” Int. J. Speech Technol., vol. 16, no. 4, pp. 471–488, 2013.

4. M. Arnold, “Audio watermarking: Features, applications and algorithms,” in Proc. Intl. Conf. Multimedia and Expo (ICME), vol. 2, 2000, pp. 1013–1016.

5. D. Gruhl, A. Lu, and W. Bender, “Echo hiding,” in Information Hiding, R. Anderson, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 1996, pp. 295–315.

6. G. Hua, J. Goh, and V. L. L. Thing, “Time-spread echo-based audio watermarking with optimized imperceptibility and robustness,” IEEE Trans. Audio, Speech, Lang. Process., vol. 23, no. 2, pp. 227–239, 2015.

7. D. Kirovski and H. S. Malvar, “Spread-spectrum watermarking of audio signals,” IEEE Trans. Signal Process., vol. 51, no. 4, pp. 1020–1033, 2003.

8. H. S. Malvar and D. A. F. Florencio, “Improved spread spectrum: A new modulation technique for robust watermarking,” IEEE Trans. Signal Process., vol. 51, no. 4, pp. 898–905, 2003.

9. C. Qiang and J. Sorensen, “Spread spectrum signaling for speech watermarking,” in Proc. IEEE Intl. Conf. on Acoust., Speech, Signal Process. (ICASSP), vol. 3, 2001, pp. 1337–1340.

10. B. Chen and G. W. Wornell, “Digital watermarking and information embedding using dither modulation,” in IEEE Workshop on Multimedia Signal Processing, 1998, pp. 273–278.

11. P. O’Reilly, Z. Jin, J. Su, and B. Pardo, “MaskMark: Robust neural watermarking for real and synthetic speech,” in Proc. IEEE Intl. Conf. on Acoust., Speech, Signal Process. (ICASSP), 2024, pp. 4650–4654.

12. M. D. Swanson, B. Zhu, A. H. Tewfik, and L. Boney, “Robust audio watermarking using perceptual masking,” Signal Processing, vol. 66, no. 3, pp. 337–355, 1998.

13. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ) – A new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE Intl. Conf. on Acoust., Speech, Signal Process. (ICASSP), vol. 2, 2001, pp. 749–752.

14. M. Holters and U. Zolzer, “GstPEAQ – An open source implementation of the PEAQ algorithm,” in Proc. of the 18th Int. Conference on Digital Audio Effects (DAFx-15), 2015.

15. T. T. et al, “PEAQ – The ITU standard for objective measurement of perceived audio quality,” J. Audio Eng. Soc., vol. 48, no. 1/2, pp. 3–29, 2000.

16. P. M. Delgado and J. Herre, “Can we still use PEAQ? A performance analysis of the ITU standard for the objective assessment of perceived audio quality,” in 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6, 2020.

17. C. Neubauer and J. Herre, “Digital watermarking and its influence on audio quality,” in Proc. of the 105th Convention of the Audio Engineering Society, Sep. 1998.

18. J. S. Garofolo, “Getting started with the DARPA TIMIT CD-ROM: An acoustic phonetic continuous speech database,” National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, Technical Report, Dec. 1988.

19. Z. Rafii, A. Liutkus, F.-R. Stoter, S. I. Mimilakis, and R. Bittner, “The MUSDB18 corpus for music separation,” Dec. 2017.

Nikolay D. Gaubitch and David Looney

Pindrop Inc., London, UK

ABSTRACT

Presentation attack detection (PAD) aims to determine if a speech signal observed at a microphone was produced by a live talker or if it was replayed through a loudspeaker. This is an important problem to address for secure human-computer voice interactions. One characteristic of presentation attacks where recording and replay occur within enclosed reverberant environments is that the observed speech in a live-talker scenario will undergo one acoustic impulse response (AIR) while there will be a pair of convolved AIRs in the replay scenario. We investigate how this physical fact may be used to detect a presentation attack. Drawing on established results in room acoustics, we show that the spectral standard deviation of an AIR is a promising feature for distinguishing between live and replayed speech. We develop a method based on convolutional neural networks (CNNs) to estimate the spectral standard deviation directly from a speech signal, leading to a zero-shot PAD approach. Several aspects of the detectability based on room acoustics alone are illustrated using data from ASVspoof2019 and ASVspoof2021.

Index Terms – Presentation attack detection, reverberation

1. INTRODUCTION

Automatic speaker verification (ASV) systems are becoming increasingly popular in our connected world and thus, there is a growing need to make these not only more accurate but also secure to potential misuse. One critical security aspect is presentation attack where a recording of the target voice is replayed through a loudspeaker to the ASV system. Research effort on this topic has been driven largely by the series of ASVspoof challenges where also the majority of existing literature on the topic may be found. Existing methods typically treat presentation attack detection (PAD) as a classification problem where classifiers are trained on examples of replayed or bonafide recordings using both traditional feature design as well as end-to-end deep learning approaches. There is also related work on one-class detection of modified or synthetic speech.

There have been several indications that room acoustics plays an important role in the ability to detect a presentation attack, however, this has not been studied explicitly. In this paper, we focus on room reverberation where we analyze several qualities of the acoustic impulse response (AIR) and the impact this has on presentation attacks. Furthermore, we use the most promising parameter to train a convolutional neural network (CNN) for estimating that parameter from speech directly. We then demonstrate its ability to successfully separate bonafide from replayed speech using the ASVspoof2019 and ASVspoof2021 evaluation data sets; these results highlight some important aspects of the role that room acoustics play in PAD.

The remainder of this paper is organised as follows. In Section 2 we formulate the problem of a presentation attack from the point of view of the room acoustics and specifically the resulting convolution of two AIRs in the case of a presentation attack. In Section 3 we summarise the key spectral and temporal differences between a single AIR and two convolved AIRs. We define methods for measuring these properties from AIRs in Section 4 and we investigate in Section 5 which of the properties would be suitable for separating between a single and two convolved AIRs. Following, in Section 6 we define a CNN architecture to estimate the most promising parameter directly from a speech signal and we demonstrate in Section 7 how this may be used for PAD but also how it defines the limitation of the relevance of room acoustics. Finally, we summarise the key findings in Section 8.

2. PROBLEM FORMULATION

We assume that a speech signal s(n) is produced by a live talker and captured by a microphone at a distance from the talker at some location A. The observed signal xA(n) is:

xA(n) = s(n) ∗ hA(n) + νA(n), (1)

where ∗ denotes linear convolution while hA(n) and νA(n) denote the AIR and the ambient noise, respectively. Here ‘location’ refers to an acoustic space and some relative position between talker and microphone; this is the bonafide scenario. In the remainder of this work we assume that there is no additive noise, so that νA(n) = 0 in order to emphasise the effects of reverberation. In the case of a presentation attack, speech captured at location A is replayed from location B and the observed signal is given by

xAB(n) = xA(n) ∗ hB(n) = s(n) ∗ hAB(n), (2)

where hB(n) is the AIR of room B and hAB(n) = hA(n) ∗ hB(n) is the composite AIR of the two acoustic spaces. We investigate the effect of two convolved AIRs, the ability to separate hAB(n) from hA(n), and how to do this directly from the observed speech signals xA(n) or xAB(n) in order to detect a presentation attack.

3. REVIEW OF THE SPECTRAL AND TEMPORAL PROPERTIES OF TWO CONVOLVED AIRS

The effects of two convolved AIRs have been studied previously in the context of speech perception and intelligibility with a compre- hensive contribution in [7]. In this section we summarise the key theoretical results from [7] which will serve as the basis of our work.

3.1 Change in Pulse Density

The number of reflections in the AIR after t seconds for a shoebox room is given by [8]

where c is the speed of sound in metres per second and VA is the room volume in cubic metres. It was shown in [7] that the number of reflections for two convolved AIRs is

The number of reflections increases with the power of six rather than the power of three for a single room. Thus, the effect of the sparse strong early reflections is reduced.

3.2 Transient Distortion

The decay of the expected sound intensity for an acoustic space is governed by e^(-t/τA) where τA is the time constant associated with the absorption of the room boundaries and is proportional to the reverberation time. On the other hand, it was shown that the expected temporal envelope of hAB(n) is driven by the term (e^(-t/τB) – e^(-t/τA)). Thus, for two convolved AIRs, the intensity is governed by two exponential functions rather than one. The two additive exponentials have opposite signs, which leads to an initial rise of energy after the onset of the exponential decay. This is different from the typical behavior of a diffuse reverberation tail.

3.3 Change in Decay

The decay time of two convolved AIRs is observed to be longer than each of the decay times separately. Thus, the apparent reverberation time will increase. It was shown that this change will be dominated by the larger of the two decays. In other words, the late reverberation decay will be driven by the AIR with the longer reverberation time.

3.4 Modulation Transfer Function

he modulation transfer function (MTF) is related to the reverberation time and the temporal structure of the AIR. It has been observed that the MTF is lower for hAB(n) compared to that of single rooms – in particular for higher modulation frequencies. Again, this is in line with the increase in reverberation time.

3.5 Spectral Effects

One way to characterize the spectrum is by its modulation strength. The modulation strength for the log-spectrum resulting from the AIR has been shown to be [9]

σA, Lspec = 5.56 dB, (5)

which holds when the source-microphone distance is greater than the critical distance and for frequencies above the Schroeder frequency, given by [8]

fSch ≈ 2000√(T60/VA), (6)

where T60 denotes the reverberation time in seconds. When the source-microphone distance is below the critical distance – where the direct sound energy equals the reverberant sound energy – the spectral strength decreases. This was used to estimate the critical distance. The critical distance, dc, is related to the reverberation time and room volume by [8]

dc = (1/4)√(γSA/π) ≈ 0.057√(γVA/T60), (7)

where γ is the directivity of the source and SA is the total absorption surface. Under the assumption that the spectra are uncorrelated it can be shown that the spectral modulation strength for two convolved AIR is

σAB, Lspec = 8.28 dB.

4. AIR BASED METRICS

We can summarize the findings in Section 3 into three main categories: temporal at reflection level, temporal at decay level, and spectral. While we could use some of the theoretical results for simulated acoustical environments, it is more practical to have metrics based on the AIR. Consequently, we define three methods for quantifying each of these categories.

4.1 Energy Decay Curve

One way to measure and analyze the decay of an AIR is the energy decay curve (EDC), as described in [10]. The EDC is directly linked to the reverberation time and is often used to calculate it from an AIR. It is defined as:

EDC(t) = \int_{t}^{\infty} h^2(\tau) d\tau \] (9)

4.2 Spectral Standard Deviation

The spectral characteristics of an AIR can be quantified using the spectral standard deviation (SSTD) [9, 11], defined as:

\[ \sigma_L = \sqrt{\frac{1}{N} \sum_{k=0}^{N-1} [H(k) – \bar{H}(k)]^2} \] (10)

where \(H(k)\) is the log-spectral magnitude resulting from the discrete N-point discrete Fourier transform (DFT) of \(h(n)\), and \(\bar{H}(k)\) is the average of \(H(k)\) across frequency.

4.3 Late Reverberation Onset

The echo density profile is a metric used to estimate the onset of the diffuse reverberation tail in an AIR. Based on the discussion in Section 3.1, it can be expected that this onset will occur earlier for two convolved impulse responses. Here we use the method to measure the echo density profile proposed in [12], defined as:

is a sliding Hamming window of length

, set at 20 ms, erfc(⋅)\text{erfc}(·) is the complementary error function, and 1⋅1{·} is the indicator function, which returns one if the argument is true and zero otherwise. The late reverberation onset is defined as the time when η(n)≥1\eta(n) \geq 1.

5. SEPARATING A SINGLE AND TWO CONVOLVED AIRS

We now investigate the three parameters for the metrics presented in Section 4 using a set of 31 measured and 500 simulated AIRs. The objective is to study the ability of these parameters to distinguish between a single AIR and two convolved AIRs. The measured impulse responses were taken from the first microphone of the ‘Lin8Ch’ in the ACE database [13] and the first microphone of the binaural measurements (without dummy head) from the AIR database [14]. We simulated AIRs using the source-image method [11]. The room dimensions were chosen at random, drawn from a uniform distribution ranging between 2 m and 15 m for the length and width, and between 2.5 m and 4 m for the height. A randomly selected reverberation time between 0.1 s and 1.2 s was attributed to each room. A source and a microphone were positioned at randomly chosen locations within each room, constraining the distance from any surface to at least 0.5 m and the minimum source-microphone distance to 0.2 m. We considered sampling rates of 16 kHz and 48 kHz.

A randomly selected subset with 30 out of the 531 AIRs representing hA(n)h_A(n) was convolved with every other AIR in that subset to generate hAB(n)h_{AB}(n); only a subset was used in order to keep the data balanced. For each hA(n)h_A(n) and hAB(n)h_{AB}(n), we calculated the slope of the EDC, SSTD, and late reverberation onset time. Histograms of these values for the two cases of a single AIR and two convolved AIRs are shown in Fig. 1(a)-(c) for a sampling rate of 48 kHz and Fig. 1(d)-(f) for a 16 kHz sampling rate; each figure title shows the Kolmogorov-Smirnov (KS) test statistic. We can make the following observations:

– Clear separation between a single AIR and two convolved AIRs for the SSTD, independent of the sampling rate.

– SSTD centers around 5.6 dB for a single AIR and close to 8 dB for the convolved AIRs, as predicted by (5) and (8), respectively. The distributions will overlap when the talker or the replay occurs below the critical distance in (7).

– Late reverberation onset provides reasonable separation at a sampling rate of 48 kHz but less so at 16 kHz, which is largely due to the fact that impulses spread out in time at lower sampling rates.

– The EDC slope provides some level of separation, but overall there is a large overlap, which is not surprising since the reverberation time will be within reasonable limits for most realistic situations.

6. ESTIMATING SSTD FROM SPEECH

In Section 5, we demonstrated that the SSTD gives the best separation between a single AIR and two convolved AIRs. However, these measurements were made directly from the AIRs, which we rarely have access to in practice. Instead, we would like to estimate the SSTD from the observed reverberant speech so that we can perform further PAD studies. To this end, we devised a VGG-like [15] CNN architecture implemented using TensorFlow [16].

The input layer operates on the spectrogram obtained from 0.5 s of speech, which is input to two 16-channel convolutional layers followed by max-pooling, and two 32-channel convolutional layers with max-pooling; the filter sizes are 3×3, and the pooling stride is (2, 2). The convolutional layer outputs are flattened and, following a 25% dropout, input to a 32-channel fully connected layer before the final 1-channel output. All layers use ReLU as an activation function. The network was optimized using Adam with a learning rate of 0.001, and a loss function constituting the mean absolute error (MAE) between estimated and measured SSTD. We used speech utterances from the training partition of TIMIT [17], sampled at 16 kHz, and the AIRs described in Section 5. A random selection of 80 speech utterances was drawn from TIMIT for each AIR.

A pre-emphasis filter was applied in order to counter the inherent spectral decay of speech. This is a common pre-processing step in many speech applications [18], and it was found to be essential when estimating the SSTD. The pre-emphasis filter is defined as:

where 0α1 is the filter coefficient and was set here to α=0.9\alpha = 0.9. The pre-emphasized reverberant speech signals were divided into non-overlapping frames of 0.5 s, and the spectrogram was calculated for each frame with a DFT frame size of 512 samples and 50% overlap. Only frequencies above 200 Hz were considered, which satisfies approximately the Schroeder frequency requirement discussed in Section 3.5. We used 25% of the training data for validation and the remaining 75% for training. The network was trained for 50 epochs. The estimates of the frames were averaged to produce a single SSTD estimate per utterance.

We generated a test set with AIRs simulated for four rooms with volumes of 28, 58, 77, and 120 m³ and with reverberation times (RTs) ranging from 0.1 to 0.7 s in steps of 0.1 s. For each AIR, 10 speech samples were drawn at random from the test portion of TIMIT. Thus, none of the test data was seen in training. The estimation result on the test data is shown in the two-dimensional histogram in Fig. 2, where we see a good match between the estimated and true values. The correlation coefficient is 0.96, and the MAE is 0.29.

7. SSTD for Presentation Attack Detection

We have shown that the SSTD provides good separation between a single AIR and two convolved AIRs, and that we are able to estimate it directly from reverberant speech. As a final step, we investigated to what extent the CNN can be used as a zero-shot method for PAD. We used the estimated SSTD as a score for each speech utterance and explored different thresholds for separation between bonafide and replayed speech. We utilized the ASVSpoof2019 [3] and the ASVspoof2021 [4] evaluation datasets. These datasets were deemed suitable for the task because they specifically consider different reverberant scenarios in a controlled environment—the 2019 data uses simulated reverberation conditions, while the 2021 data contains real room recordings.

The ASVSpoof2019 dataset [3] contains 134,730 samples of bonafide and replayed speech. The data is divided into different categories as shown in Table 1. Each sample is annotated with a triplet (S, R, D_s) and a duple (Z, Q) to form different combinations of room sizes, RTs, and speaker-microphone distances. In addition to the complete dataset (annotated as ‘full’), we focused on subsets of the data that clearly illustrate different aspects of the reverberation-driven PAD. We selected (c,a,a) to represent large rooms with short reverberation times and small source-microphone separation, where we could expect poor PAD performance, (b,c,c) for the most favorable conditions for reverberation-driven PAD, and (b,b,b) for a realistic office-like example. There are 4,990 samples in each subset.

The results for these experiments are shown by the detection error trade-off (DET) plots in Fig. 3(a), where we observe the expected outcome. The equal error rate (EER) for the complete dataset is 22.37% and improves progressively from the worst case (c,a,a) to the best case (b,c,c), with EERs of 33.43% and 2.24%, respectively. To put this in perspective, the best performing baseline of ASVspoof2019 [3] had an EER of 11.04%. Note that the contributions in the ASVspoof challenge, unlike our zero-shot approach, were trained on development data closely linked to the test partition.

We then focused on the scenario of small but realistic reverberant spaces (b,c,c) to study the effect of the attacker-to-talker separation and the replay device quality. We considered three cases—(A,A), (B,B), and (C,C)—representing increasing attacker-talker distance and decreasing replay device quality. The result is shown in Fig. 3(b), where we can clearly see that the most favorable condition for reverberation-driven PAD is given by the case (C,C). In other words, the best separation between bonafide speech and presentation attacks is achieved when both the recorded and the replayed speech are reverberant and at a sufficiently large distance from the microphone, at which point the EER reaches 1.04%. The distance will depend on the room volume and the reverberation time as seen in (7).

The ASVspoof2021 dataset contains real recordings from nine rooms at different source-microphone and attacker-talker distances. We used a similar approach as for the 2019 data and analyzed the complete dataset (indicated as ‘full’) and the cases of ‘d1,’ which is the largest attacker-mic distance of 2 m, with ‘c2,’ ‘c3,’ and ‘c4’ indicating attacker-to-talker distances of 1.5 m, 1 m, and 0.5 m, respectively. The results are shown in Fig. 4, where the effects of reverberation are clearly seen. The EER of the complete data is 36.28%, which is slightly lower than the best challenge baseline of 38.07%. When the reverberation conditions are favorable (largest attacker-talker and attacker-mic distances), the EER decreases to 16.39%.

Interestingly, it was observed in [2] that features based on the frame-level log-energy greatly improved PAD performance. We believe that this could be partially explained by the reverberation analysis provided in this paper. While the focus was largely on SSTD, including more information such as the EDC slope and the late reverberation onset can further improve PAD, especially at higher sampling rates. Further studies of combining the aforementioned ideas with more traditional PAD methods are left for future work.

8. Conclusions

We posed the problem of PAD as the separation between a single AIR and two convolved AIRs, and we summarized several differences derived from room acoustics theory. Our analysis showed that the most significant difference is observed with the SSTD. We used a CNN framework for accurate estimation of the SSTD from speech and applied it for zero-shot PAD. The method was evaluated using the ASVspoof2019 dataset, where we achieved an EER of 22.37% on the complete dataset and 1.04% on a portion of the data where we expect better discriminability. Similar trends were observed for the ASVspoof2021 dataset, where an EER of 36.28% was achieved on the complete dataset—1.79% lower than the challenge baseline. Most importantly, we provided valuable insights into the relevance of room acoustics on PAD.

9. References

[1] T. Kinnunen, Md Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. Aik Lee, “The ASVspoof 2017 Challenge: Assessing the limits of replay spoofing attack detection,” in Proc. Interspeech, 2017, pp. 2–6.

[2] H. Delgado, M. Todisco, Md Sahidullah, N. Evans, T. Kinnunen, K. A. Lee, and J. Yamagishi, “ASVspoof 2017 Version 2.0: Meta-data analysis and baseline enhancements,” in Odyssey 2018 – The Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, June 2018.

[3] M. Todisco, X. Wang, V. Vestman, Md Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. Aik Lee, “ASVspoof 2019: Future horizons in spoofed and fake audio detection,” in Proc. Interspeech, Graz, Austria, Sep 2019.

[4] X. Liu, X. Wang, Md Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “ASVspoof 2021: Towards spoofed and deepfake speech detection in the wild,” IEEE Trans. Audio, Speech, Lang. Process., vol. 31, pp. 2507–2522, 2023.

[5] D. Looney and N. D. Gaubitch, “On the detection of pitch-shifted voice: Machines and human listeners,” in Proc. IEEE Intl. Conf. on Acoust., Speech, Signal Process. (ICASSP), Toronto, Canada, June 2021, pp. 5804–5808.

[6] F. Alegre, A. Amehraye, and N. Evans, “A one-class classification approach to generalized speaker verification spoofing countermeasures using local binary patterns,” in Proc. IEEE Int. Conf. Biometrics: Theory Applications and Systems (BTAS), Arlington, VA, USA, Sept. 2013.

[7] A. Haeussler and S. van de Par, “Crispness, speech intelligibility, and coloration of reverberant recordings played back in another reverberant room (Room-in-Room),” J. Acoust. Soc. Am., vol. 145, no. 2, pp. 931–942, Feb. 2019.

[8] H. Kuttruff, Room Acoustics, Taylor and Francis, London, U.K., 2000.

[9] J. J. Jetzt, “Critical distance measurement of rooms from the sound energy spectral response,” J. Acoust. Soc. Am., vol. 65, no. 5, pp. 1204–1211, May 1979.

[10] M. R. Schroeder, “New method for measuring reverberation time,” J. Acoust. Soc. Am., vol. 37, no. 409, 1965.

[11] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” J. Acoust. Soc. Am., vol. 65, no. 4, pp. 943–950, Apr. 1979.

[12] J. S. Abel and P. Huang, “A simple, robust measure of reverberation echo density,” in Proc. AES 121st Convention, San Francisco, USA, Oct. 2006.

[13] J. Eaton, N. D. Gaubitch, A. H. Moore, and P. A. Naylor, “Estimation of room acoustic parameters: The ACE challenge,” IEEE Trans. Audio, Speech, Lang. Process., vol. 24, no. 10, pp. 1681–1693, Oct. 2016.

[14] M. Jeub, M. Schafer, and P. Vary, “A binaural room impulse response database for the evaluation of dereverberation algorithms,” in Proc. Intl. Conf. Digital Signal Process., Jul 2009.

[15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv, 2014.

[16] M. Abadi et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, Software available from tensorflow.org.

[17] J. S. Garofolo, “Getting started with the DARPA TIMIT CD-ROM: An acoustic phonetic continuous speech database,” Technical report, National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, Dec. 1988.

[18] T. Backström, O. Räsänen, A. Zewoudie, P. P. Zarazaga, L. Koivusalo, S. Das, E. Gomez Mellado, M. Bouafif Mansali, D. Ramos, S. Kadiri, and P. Alku, Introduction to Speech Processing, https://speechprocessingbook.aalto.fi, 2nd edition, 2022.

Five9 + Pindrop: Fraud Detection with Better Customer Experience

 

Five9 + Pindrop are partners in balancing caller experience with security and fraud detection. Discover what that looks like during our collaborative webinar–which includes a success story from our shared customer: the contact center at Michigan State University Federal Credit Union (MSUFCU).

WEBINAR

Five9 + Pindrop: Fraud Detection with Better Customer Experience

Five9 + Pindrop are partners in balancing caller experience with security and fraud detection. Discover what that looks like during our collaborative webinar–which includes a success story from our shared customer: the contact center at Michigan State University Federal Credit Union (MSUFCU).

Caller authentication is necessary, but often time-consuming and detrimental to the customer experience. That’s why Five9 and Pindrop have partnered to bring advanced authentication and fraud detection software to Five9 customers.

Learn how Five9 + Pindrop technologies united to provide secure and efficient experiences for MSUFCU’s member base

Discover how to help protect your business, reduce average call handle time, increase IVR containment, and improve contact center customer experience

Explore the fraud problem in credit unions and how Five9 + Pindrop technologies work together to defend against it

Your expert panel

Skip Lindgren

Sales Leader, Pindrop

Amanda Miller

Director, ISV Partnerships, Five9

David Tevendale

Manager, Partner Programs, Pindrop

Colleen Pitmon

VP of Contact Center, MSUFCU

Often, technological advances in the healthcare industry are viewed in a positive light. Faster, more accurate diagnoses, non-invasive procedures, and better treatment support this view. More recently, artificial intelligence (AI) has improved diagnostics and patient care by assisting in the early detection of diseases like diabetic retinopathy. But these same technologies made room for a new, alarming threat: deepfakes.

As GenAI becomes more accessible, deepfakes in healthcare are increasingly prevalent, posing a threat to patient safety, data security, and the overall integrity of healthcare systems.

What are deepfakes in the healthcare industry? 

“Deepfakes in healthcare” refers to the application of AI technology to create highly realistic synthetic data in the form of images, audio recordings, or video clips within the healthcare industry.

Audio deepfakes that reproduce someone’s voice are emerging as a specific threat to healthcare because of the industry’s dependence on phone calls and verbal communication. Whether used to steal patient data or disrupt operations, audio deepfakes represent a real and growing danger.

AI deepfakes are a growing threat to healthcare

Deepfake technology being used to steal sensitive patient data is one of the biggest fears at the moment, but it is not the only risk present. Tampering with medical results, which can lead to incorrect diagnoses and subsequent incorrect treatment, is another issue heightened by the difficulty humans have spotting deepfakes.

A 2019 study generated deepfake images of CT scans, showing tumors that were not there or removing tumors when these were present. Radiologists were then shown the scans and asked to diagnose patients.

Of the scans with added tumors, 99% were deemed as malignant. Of those without tumors, 94% were diagnosed as healthy. To double-check, researchers then told radiologists the CT scans contained an unspecified number of manipulated images. Even with this knowledge in mind, doctors misdiagnosed 60% of the added tumors and 87% of the removed ones.

Attackers can also use GenAI to mimic the voices of doctors, nurses, or administrators—and potentially convince victims to take actions that could compromise sensitive information.

Why healthcare is vulnerable to deepfakes

While no one is safe from deepfakes, healthcare is a particularly vulnerable sector because of its operations and the importance of the data it works with.

Highly sensitive data is at the core of healthcare units and is highly valuable on the black market. This makes it a prime target for cybercriminals who may use deepfake technology to access systems or extract data from unwitting staff.

The healthcare industry relies heavily on verbal communication, including phone calls, verbal orders, and voice-driven technology. Most people consider verbal interactions trustworthy, which sets the perfect stage for audio deepfakes to exploit this trust.

Plus, both healthcare workers and patients have a deep trust in medical professionals. Synthetic audio can perfectly imitate the voice of a doctor, potentially deceiving patients, caregivers, or administrative staff into taking harmful actions.

How deepfakes can threaten healthcare systems

Deepfakes, especially audio-based ones, pose various risks to healthcare systems. Here are four major ways these sophisticated AI fabrications can threaten healthcare.

1. Stealing patient data

Healthcare institutions store sensitive personal data, including medical histories, social security numbers, and insurance details. Cybercriminals can use audio deepfakes to impersonate doctors or administrators and gain unauthorized access to these data repositories. 

For example, a deepfake of a doctor’s voice could trick a nurse or staff member into releasing confidential patient information over the phone, paving the way for identity theft or medical fraud.

2. Disrupting operations

Deepfakes have the potential to cause massive disruptions in healthcare operations. Imagine a fraudster circulates a deepfake of a hospital director, instructing staff to delay treatment or change a protocol.

Staff might question the order, but that can cause a disruption—and when dealing with emergencies, slight hesitations can lead to severe delays in care.

3. Extortion

Scams using deepfake audios are sadly not uncommon any more. Someone could create a fraudulent audio recording, making it sound like a healthcare professional is involved in unethical or illegal activities.

They can then use the audio file to blackmail the professionals or organizations into paying large sums of money to prevent the release of the fake recordings.

4. Hindered communication and trust

Healthcare relies on the accurate and timely exchange of information between doctors, nurses, and administrators. Deepfakes that impersonate these key figures can compromise this communication, leading to a breakdown of trust. 

When you can’t be sure the voice you’re hearing is genuine or the results you’re looking at are real, it compromises the efficiency of the medical system. Some patients might hesitate to follow medical advice, while doctors might struggle to distinguish between legitimate communications and deepfakes.

Protecting healthcare systems from deepfakes

Healthcare deepfakes are a threat to both patients and healthcare professionals. So, how can we protect healthcare systems? Here are a few important steps.

Taking proactive measures

Catching a deepfake early is better than dealing with the consequences of a deepfake scam, so taking proactive measures should be your first line of defense. One of the most useful tools in combatting deepfakes is voice authentication technologies like Pindrop® Passport, which can analyze vocal characteristics like pitch, tone, and cadence to help verify a caller. 

Investing in an AI-powered deepfake detection software is another effective mitigation option. Systems like Pindrop® Pulse™ Tech can analyze audio content to identify pattern inconsistencies, such as unnatural shifts in voice modulation. AI-powered tools learn from newly developed deepfake patterns, so they can help protect you against both older and newer technologies.

Remember to train your staff. While humans are not great at detecting synthetic voices or images, when people are aware of the risks deepfakes pose, they can better spot potential red flags. 

These include unusual delays in voice interactions, irregular visual cues during telemedicine appointments, or discrepancies in communication. You can also conduct regular phishing simulations to help staff identify and respond to suspicious communications.

Implementing data security best practices

Proactive measures are the first lines of defense, but you shouldn’t forget about data protection.

Multifactor authentication (MFA) is a simple but strong data protection mechanism that can help confirm that only authorized individuals can access sensitive healthcare systems. With it, a person will need more than one form of verification, so if someone steals one set of credentials or impersonates someone’s voice, there will be a second line of defense.

Encrypting communication channels and even stored data is another vital aspect of data security. In healthcare, sending voice, video, and data across networks is common, so encrypting communication is a must. Protecting stored data adds an extra layer of security, as even if a third party gains access, they would still need a key to unlock it.

Remember to update and monitor your data security practices regularly.

Safeguard your healthcare organization from deepfakes today

When artificial technology first came to the public’s attention, its uses were primarily positive. In healthcare, for instance, synthetic media was, and still is, helpful in researching, training, and developing new technologies. 

Sadly, the same technology can also take a darker turn, with fraudsters using it to impersonate doctors, gain access to sensitive patient data, or disrupt operations. Solutions like Pindrop® Passport and the Pindrop® Pulse™ Tech add-on offer a powerful way to authenticate voices and detect audio deepfakes before they can infiltrate healthcare communication channels.

By combining proactive detection tools with strong data security practices, healthcare providers can better protect themselves, their patients, and their operations from the devastating consequences of deepfakes.

Working alongside the Webex Contact Center team, Pindrop has certified Pindrop® Passport and Pindrop® Protect and added them to the Webex App Hub

We are dedicated to helping our customers quickly and easily authenticate inbound calls, drive automation in the IVR (Interactive Voice Response system), and detect fraud. 

With voice-based authentication methods, contact centers can reduce caller frustration, shorten resolution times, and improve security and compliance.

Using the Pindrop® API Connector within the Webex Contact Center, we seamlessly integrate into contact center call flows, enabling quick setup and easy deployment.

How it works

In any partner integration, Pindrop® Technologies captures a copy of an inbound call and runs a thorough analysis. The analysis of an inbound call is predicated upon a deep, carrier-style integration where the Pindrop® Solution ingests the call audio, metadata, keystroke presses, and other signaling. 

This approach allows our technology to perform an accurate, multifactor analysis of the inbound caller’s voice, device, behavior, network, risk, and liveness. This will help you determine if the caller is a genuine consumer or a fraudster.  

For more insight into how fraudsters operate, check out our article on the fraudster playbook

Webex Contact Center: Customer SIPREC integration

The diagram below showcases the robust architecture of the Webex Contact Center + Pindrop integration. It illustrates a scenario where a customer using a premise-based Session Border Controller (SBC) routes calls to Pindrop. Pindrop also supports a flexible Bring Your Own Carrier (BYOC) model, allowing you to route calls directly from your carrier. Contact Pindrop to determine if your carrier is supported.

A high-level architectural diagram illustrating the call flow from an SBC to the Cisco Webex Contact Center and then to the Pindrop network for voice authentication and fraud detection.
This is a high-level architectural diagram illustrating the call flow from an SBC to the Webex Contact Center and then to the Pindrop network for voice authentication and fraud detection.

Key elements of the Webex CC + Pindrop integration

1. Pindrop® API connector

The Pindrop® API Connector enables your organization to establish a secure trust relationship between your Pindrop account and the Webex Contact Center, allowing you to access Pindrop’s voice authentication and fraud detection services seamlessly. 

Once the trust relationship is established, integrating Pindrop’s capabilities is as straightforward as making HTTP requests within your Webex CC call flows. These requests allow you to initiate voice authentication, detect fraud, capture key data points for analysis, and make intelligent routing decisions.

2. Easy-to-use agent UI

Pindrop has constructed a pre-built agent user interface, delivered through the Webex Contact Center agent desktop. 

This helps implement Pindrop intelligence and policy-driven instructions to Webex Contact Center agents as clearly and intuitively as possible. This user-friendly interface helps agents easily understand and apply Pindrop’s capabilities in their daily operations. 

A view of Pindrop's pre-built agent user interface. It showcases call risk status, phone number, call duration, and more.
A view of Pindrop’s pre-built agent user interface. It showcases call risk status, phone number, call duration, and more.

 3. Supportive resources for self-guided implementation

To simplify the process, we have authored a detailed user guide that provides clear, step-by-step instructions to help contact center administrators implement Pindrop® Solutions in their Webex Contact Center environment. 

Additionally, Pindrop resources are readily available for support and guidance, ensuring a smooth and successful integration. 

Real-world success

Some of the largest banks, credit unions, insurance companies, and healthcare providers in the world trust Pindrop to combat fraud and deliver secure, efficient customer service. To read more about how Pindrop integrates with other leading contact center platforms, check out our posts on Five9 + Pindrop authentication and fraud detection or how to integrate Pindrop® Solutions and Genesys Cloud CX.

Ongoing collaboration and future development

At Pindrop, we’re committed to continuous innovation and close collaboration with the Webex Contact Center. We adapt our solutions to address evolving customer needs. Our teams actively monitor and enhance the current integration, exploring new capabilities to support future use cases.

Do you have a call center challenge you’d like Pindrop and Webex Contact Center to address? We’d love to hear from you.

Tianxiang Chen, Avrosh Kumar, Parav Nagarsheth, Ganesh Sivaraman, Elie Khoury

Pindrop, Atlanta, GA, USA
{tchen,akumar,pnagarsheth,gsivaraman,ekhoury}@pindrop.com

Abstract

Recent Audio Deepfakes, technically known as logical-access voice spoofing techniques, have become an increased threat on voice interfaces due to the recent breakthroughs in speech synthesis and voice conversion technologies. Effectively detecting these attacks is critical to many speech applications including automatic speaker verification systems. As new types of speech synthesis and voice conversion techniques are emerging rapidly, the generalization ability of spoofing countermeasures is becoming an increasingly critical challenge. This paper focuses on overcoming this issue by using large margin cosine loss function (LMCL) and online frequency masking augmentation to force the neural network to learn more robust feature embeddings. We evaluate the performance of the proposed system on the ASVspoof 2019 logical access (LA) dataset. Additionally, we evaluate it on a noisy version of the ASVspoof 2019 dataset using publicly available noises to simulate more realistic scenarios. Finally, we evaluate the proposed system on a copy of the dataset that is logically replayed through the telephony channel to simulate spoofing attacks in the call center scenario. Our baseline system is based on residual neural network, and has achieved the lowest equal error rate (EER) of 4.04% among all single-system submissions during the ASVspoof 2019 challenge. Furthermore, the additional improvements proposed in this paper reduce the EER to 1.26%.

1. Introduction

The fast growing voice-based interfaces between humans and computers have led to the need for more accurate voice biometrics strategies. The accuracy of speaker verification technology has improved by leaps and bounds in the past decade with the help of deep learning. At the same time, the ability to spoof and impersonate voices using deep learning based speech synthesis systems have also significantly improved. Such high quality text-to-speech synthesis (TTS) and voice conversion (VC) approaches can successfully deceive both humans and automatic speaker verification systems. This has created the need for systems to detect logical access attacks such as speech synthesis and voice conversion to protect the voicebased authentication systems from such malicious attacks. ASVspoof1 series started in 2015, and aims to foster the research on countermeasure to detect voice spoofing. In 2015 [1], the challenge focused on detecting commonly used state-ofthe-art logical speech synthesis and voice conversion attacks that were largely based on hidden Markov models (HMM), Gaussian mixture models (GMM) and unit selection. Since then, the quality of the speech synthesis and voice conversion systems has drastically improved with the use of deep learning. WaveNet [2], proposed in 2016, was the first end-to-end speech synthesizer that directly uses the raw audio for training, and showed a mean opinion score (MOS) very close to human speech. Similar quality was shown by other TTS systems such as Deep Voice [3] and Tacotron [4], and also by VC systems [5, 6]. These breakthroughs in TTS and VC technologies made the spoofing attacks detection more challenging. 1http://www.asvspoof.org In 2019, the ASVspoof [7] logical access (LA) dataset included seventeen different TTS and VC techniques. The organizers took good care of evaluating spoofing detection systems against unknown spoofing techniques by excluding eleven unknown technologies from train and development datasets. Therefore, strong robustness is required for spoofing detection system in this dataset. The challenge results show that the current biggest problem in a spoofing detection system is its generalization ability. Traditionally, signal processing researchers tried to overcome this problem by engineering different low-level spectro-temporal features. For example, constant-Q cepstral coefficients (CQCC) were proposed in [8], cosine normalized phase and modifiedgroup delay (MGD) were studied in [9, 10]. Although these works have confirmed the effectiveness of various audio processing techniques in detecting synthetic speech, they are not able to narrow down the generalization gap on ASVspoof 2019 dataset with the recent improved TTS and VC technologies. A detailed analysis of 10 different acoustic features, including linear frequency cepstral coefficient (CQCC) and mel frequency cepstral coefficient (MFCC), was made on ASVspoof 2019 dataset in [11]. The results show that none of these acoustic features are able to generalize well on unknown spoofing technologies. Also, using deep learning models to learn discriminate feature embeddings for audio spoofing detection was studied in [12, 13, 14]. A comprehensive study of different traThe Speaker and Language Recognition Workshop (Odyssey 2020) 1-5 November 2020, Tokyo, Japan 132 10.21437/Odyssey.2020-19 ditional acoustic features and learned feature from autoencoder was made in [15]. In this work, we tackle this challenge from a different perspective. Instead of investigating different low level audio features, we try to increase the generalization ability of the model itself. To do so, we use large margin cosine loss function (LMCL) [16] which was initially used for face recognition. The goal of LMCL is to maximize the variance between genuine and spoofed class and, at the same time, minimize intra-class variance. Additionally, inspired by SpecAugment [17], we propose to add FreqAugment, a layer that randomly masks adjacent frequency channels during the DNN training, to further increase the generalization ability of the DNN model. On the ASVspoof 2019 EVAL dataset, we achieve an EER of 1.81% which is significantly better than the baseline. The proposed system is illustrated in Figure 1. Furthermore, we investigate the effectiveness of audio augmentation techniques. We augment the audio files using publicly available noises, including freely available movies and TV shows, music, other noises and room impulse responses to train and evaluate our system under a noisy scenario. Adding augmented data in the training dataset further reduces the EER from 1.81% to 1.64% on the ASVspoof 2019 EVAL dataset. Finally, we study the performance of the proposed spoofing detection system in a call center environment. Therefore, we logically-replay the ASVspoof 2019 dataset through VoIP channel to simulate the spoofing attacks. Interestingly, we found that, by adding those audio samples to the training data, the EER is further reduced from 1.64% to 1.26% on the ASVspoof 2019 EVAL dataset. This paper is organized as follows: Section 2 describes the datasets used to train and evaluate the proposed spoofing detection system. Section 3 details the proposed spoofing detection system. Section 4 presents the experimental results on different evaluation datasets. Section 5 concludes this paper.

2. Datasets

We use three different training protocols and three different evaluation benchmarks as shown in Table 1 and Table 2. The following sections briefly describe the dataset and the data augmentation method used in this work.

2.1 ASVspoof 2019 Challenge Dataset

ASVspoof 2019 [7] logical access (LA) dataset is derived from the VCTK base corpus. It includes seventeen text-to-speech (TTS) and voice conversion (VC) techniques. The spoofing techniques are divided into two groups, six as known techniques, eleven as unknown techniques. The entire dataset is partitioned into training, development and evaluation sets. The train and development sets include spoofed utterances generated from two known voice conversion and four speech synthesis techniques. However, only two known techniques are present in the evaluation set. The remaining spoofed utterances were generated from eleven unknown algorithms. The training and evaluation parts of this data are named T1 and E1, respectively.

  1. and  In order to evaluate our system under noisy conditions, data augmentation is performed on original ASVspoof 2019 dataset by modifying the the data augmentation technique from Kaldi. Two types of distortions were used to augment the ASVspoof 2019 dataset: reverberation, and background noise. Room impulse responses (RIR) for reverberation were chosen from publicly available RIR datasets2 [18, 19, 20]. We chose four different types of background noises for augmentation – music, television, babble, and freesound3 . One part of the background noise files for augmentation were selected from the open source MUSAN noise corpus [21]. We also constructed a television noise dataset using audio segments from publicly available movies and TV shows from Youtube. Around 40 movies and as many TV show videos were downloaded and segmented into 30 second segments to construct the TV-noises set. In all, we collected around 46 hours of TV-noises in our dataset. For music and TV-noises, the audio was reverberated using a randomly selected RIR from the RIR dataset. Then the speech utterances were reverberated using randomly chosen RIRs and then the reverberated noise was added to the reverberated speech utterance. Babble noise was generated by mixing usgov utterances from the MUSAN corpus. The freesound noises were the general noise files from the MUSAN corpus which consisted of files collected from freesound and soundbible. For babble and freesound noises, we added the background noise files to the clean audio and then reverberated the mixture using a randomly selected RIR. The noises were added with a random SNR between 5dB to 20dB. The training part of this data together with T1 is depicted as T2. Similarly, the evaluation part of this data together with E1 is named E1.

    2.3 Logically-Replayed ASVspoof 2019 T

    To simulate voice spoofing in a call center environment, Twilio’s Voice service4 is used to playback ASVSpoof 2019 data over voice calls and recorded at the receiver’s end. The resulting dataset has VoIP channel characteristics and has reduced bandwidth from 16kHz to 8kHz sampling rate. Twilio’s default OPUS codec5 was used for encoding and decoding audio. This dataset is used to evaluate benchmark (E3) to understand how well our spoofing detection system generalizes in a call-center environment. Also, the replayed training set is added to the protocol (T3). During training and testing, the dataset was upsampled to 16kHz. The training part of this data together with T2 is named T3. Similarly, the evaluation part of this data together with E2 is named E3.



    2http://www.openslr.org/28/

    3https://freesound.org/

    4https://support.twilio.com/hc/en-us/articles/360010317333- Recording-Incoming-Twilio-Voice-Calls

    5https://www.opus-codec.org/

Tianxiang Chen, Elie Khoury, Kedar Phatak, Ganesh Sivaraman

Pindrop, Atlanta, GA, USA

a

[email protected], [email protected], [email protected], [email protected]a

Abstract

Voice spoofing has become a great threat to automatic speaker verification (ASV) systems due to the rapid development of speech synthesis and voice conversion techniques. How to effectively detect these attacks has become a crucial need for those systems. The ASVspoof 2021 challenge provides a unique opportunity to foster the development and evaluation of new techniques to detect logical access (LA), physical access (PA), and Deepfake (DF) attacks covering a wide range of techniques and audio conditions. The Pindrop Lab participated in both the LA and DF detection tracks. Our submissions to the challenge consist of a cascade of an embedding extractor and a backend classifier. Instead of focusing on extensive feature engineering and complex score fusion methods, we focus on improving the generalization of the embedding extractor model and the backend classifier model. We use log filter banks as the acoustic features in all our systems. Different pooling methods and loss functions are studied in this work. Additionally, we investigated the effectiveness of stochastic weight averaging, further improving the robustness of the spoofing detection system. Overall, three different variants of the same system have been submitted to the challenge. They all achieved a very competitive performance on both LA and DF tracks, and their combination achieved a min-tDCF of 0.2608 on the LA track and an EER of 16.05% on the DF track.

1. Introduction

Automatic Speaker Verification (ASV) has been widely adopted in many human-machine interfaces. The accuracy of the ASV system has improved greatly in the past decades due to the help of deep learning algorithms. Meanwhile, the deep learning-based text-to-speech synthesis (TTS) and voice conversion (VC) techniques are also able to generate extremely realistic speech utterances. TTS and VC techniques like WaveNet [1], Deep Voice [2], and Tacotron [3] greatly enhanced the quality of the voice-spoofed utterances. These spoofed utterances are often indistinguishable to human ears and are able to deceive state-of-the-art ASV systems. Thus, the detection of these voice spoofing attacks has drawn great attention in the research community and the technology industry.

To benchmark the progress of research in voice spoofing detection and foster the research efforts, the ASVspoof challenge releases a series of spoofing datasets. In 2019, the ASVspoof [6] challenge released two datasets: physical access (PA) and logical access (LA). The PA dataset focuses on replay attacks, and the LA dataset refers to synthesized speech. The LA dataset was largely based on detecting deep learning-based spoofing techniques, and it primarily focused on evaluating the generalization of the spoofing detection model. In total, it includes seventeen different TTS and VC techniques, but only seven of them are in the training and development set. During the ASVspoof 2019 challenge, many submissions focused on investigating different low-level spectro-temporal features [7, 8, 9, 10, 11, 12] and ensemble-based approaches.

In ASVspoof 2021 [13], the challenge has further included more data to simulate more practical and realistic scenarios of different spoofing attacks. There are three sub-challenges: physical access (PA), logical access (LA), and deepfake detection (DF). The PA dataset contains real replayed speech and a small portion of simulated replayed speech. For the LA dataset, while the training and development data remain the same as ASVspoof 2019, various codec and channel transmission effects are added to the evaluation data. This is aimed at simulating telephony scenarios and evaluating the robustness of the spoofing detection model against different channel effects. The challenge has also further extended the LA track to general speech Deepfake detection (DF). Deepfake detection deals with detecting synthesized voice in any audio recording. The speech Deepfake detection task involves different audio compression techniques such as mp3 and m4a, along with additional spoofing techniques. This Deepfake detection task aims to evaluate the spoofing detection system against different unknown conditions. Therefore, the detection systems for both LA and DF tracks need to be robust to unseen attacks and audio compression techniques.

This paper presents the Pindrop Labs’ submissions to the LA and DF tracks and introduces a novel spoofing detection system. Our submissions were among the top-performing systems in the full evaluation sets on both LA and DF tracks. In total, we have trained three systems. The first system is proposed in [14], which is a ResNet-based spoofing detection system trained using large margin cosine loss. The second system is an extension of the first system, using a novel learnable dictionary encoding (LDE) [15] layer to replace the mean and standard deviation pooling layer. The third system also uses the LDE pooling layer but is trained using Softmax activation in the output layer and the cross-entropy loss function. All systems contain two main components: embedding extractor and backend classifier. Figure 1 shows the framework of our spoof detection system. The final submissions to both LA and DF tracks are the fusion of the three spoofing detection systems.

2. Datasets

We use the ASVspoof 2019 official LA train and development datasets to train and evaluate our systems. Various data augmentation methods are performed on the training dataset to increase the amount of data and robustness of the models. The ASVspoof 2019 and 2021 datasets are presented in Sections 2.1 and 2.2. The data augmentation technique is introduced in Section 2.3.

2.1. ASVspoof 2019 Challenge Dataset

ASVspoof 2019 [6] logical access (LA) dataset comprises seventeen different text-to-speech (TTS) and voice conversion (VC) techniques, from traditional vocoders to the recent stateof-the-art neural vocoders. The spoofing techniques are divided into two groups, six as known techniques, and eleven unknown techniques. The train and development sets have six known spoofing techniques, while the evaluation set contains eleven unknown spoofing techniques.

In this work, only the training and development sets are used for developing the spoofing detection systems.

2.2. ASVspoof 2021 LA & DF Dataset

The ASVspoof 2021 LA track is aiming to evaluate the robustness of the spoofing detection model across different channels. Although, the spoofing techniques used in this dataset are the same as in 2019, multiple codec and transmission effects are added to the audio samples. Both bonafide and spoofed samples are transmitted through either a public switched telephone networks or a voice over internet protocol. After passing through different networks, all audio samples are resampled to 16 kHz.

Deepfake detection is an extension of the LA track, focusing on evaluating spoofing detection systems across different audio compressions. The compression algorithms include mp3, m4a, and other unknown techniques.

Deepfake detection track is an extension of the LA track. In contrast to the LA track, DF track is focusing on evaluating the spoofing detection systems across different audio compressions. It represents detecting spoofed audios on social media or other internet platforms, where the audio compression techniques and audio qualities are largely different. The compression algorithms include mp3, m4a and other unknown techniques.

2.3. Data Augmentation

Three different types of data augmentation are applied to the train dataset in this work. The first type augmentation is similar to the augmentation used in [14], we added two types of distortion to the clean samples: reverberation, and background noise. For the reverberation effect, random room impulse responses (RIR) were chosen from publicly available RIR datasets2 . For the background noises, we used three types of noises – music, babble and freesound3 . The freesound noises were the general noise files from the MUSAN corpus which consisted of files collected from freesound and soundbible. For babble and freesound noises, we added the background noise files to the clean audio and then reverberated the mixture using a randomly selected RIR. The noises were added with a random SNR between 5 dB to 20 dB.

Second type of augmentation is to simulate audio compression effects. All clean audio samples were also passed through audio compression. The compression algorithms include mp3 and m4a. Finally, the third type of augmentation was applied to add codec transmission effects. The training dataset were logically played through Twilio’s Voice service4 and recorded at the receiver’s end. The resulting dataset has VoIP channel characteristics and has reduced bandwidth from 16 kHz to 8 kHz sampling rate. Twilio’s default OPUS codec5 was used for encoding and decoding audio.

3. Methodology

In this section, we first describe the low-level features and the preprocessing techniques used during training (Sec 3.1). Then, we present the architectures of the embedding extraction models in the spoofing detection systems. (Sec 3.2). Finally, we describe the back-end classifiers used in all systems (Sec 3.5).

3.1 Features and Preprocessing

Features used in this work are linear filter banks (LFBs) in this work. LFBs are a direct compressed version of the short-time Fourier transforms (SFT) with a linearly spaced filter bank, and thus more adequate for lower computational cost, and has lower risk of overfitting at training time. We use 60-dimensional LFBs extracted on 30 ms windows with an overlap of 10 ms. Mean and variance normalization was performed per utterance during training and testing.

Online frequency masking is applied during training to randomly drop out a contiguous band of frequency channels v = [f0 + f]. The value f is chosen from a uniform distribution from 0 to the parameter F, where defines the maximum number of frequency channels to be masked. Then, f0 is chosen between [0, v − f0]. After creating the frequency mask, an element wise multiplication operation is done between original LFBs and the frequency mask, so that the feature of the selected frequency channel can be set to zero.

2http://www.openslr.org/28/

3https://freesound.org/

4https://support.twilio.com/hc/en-us/articles/360010317333- Recording-Incoming-Twilio-Voice-Calls 5https://www.opus-codec.org/

3.2 Embedding Extractors

For this challenge, three different embedding extractors are used for logical access and Deepfake detection. All three embedding extractors are a modified version of the Residual neural network [16] . Residual neural network architecture has shown a great generalization ability in many classification tasks. It allows us to train an extensively deeper network to achieve more compelling results. In the following sections, all three networks will be explained in detail.

3.2.1. ResNet-L-FM system

The spoof embedding extractor used in the first system is the ResNet18-L-FM model described in [14]. As shown in Table 1, this residual network is a variant of the ResNet-18 [16] where the global average pooling layer is replaced by mean and standard deviation pooling layers [17]. Before the input layer, a random frequency masking augmentation is applied to randomly mask a range of frequency bins. Large margin cosine loss (LMCL) was used during training to increase the generalization ability of the model. The model is trained to classify the audio recordings into two classes: bonafide and spoofed. The spoof embedding is the output of the second fully connect layer, and its dimension is 256.

3.2.2. ResNet-L-LDE system

The ResNet-L-LDE is an evolution of the first ResNet18-L-FM system described above. Similar to the ResNet-L-FM system, the ResNet-L-LDE system also uses the ResNet-18 architecture as an encoder. However, it replaces the mean and standard deviation pooling layer with a learnable dictionary encoding (LDE) layer [15]. The LDE pooling layer assumes the frame level representations after the ResNet encoder are distributed in C clusters. It is motivated by GMM super-vector encoding procedure, it learns the encoding parameters and the inherent dictionary in a full supervised manner. Figure 2 illustrates the forward diagram of LDE layer. In this work, we set the number of components equals to 16 and the hidden feature dimension for each components to 256. Thus, the output size of the LDE pooling layer is 4,096. The ResNet-L-LDE system is also trained using both frequency masking augmentation and LMCL.

The ResNet-S-LDE system is also trained using LDE pooling layer. The architecture is the same as the ResNet-L-LDE model, but is trained using Softmax output and cross-entropy loss. After training the model, we extract the outputs from the LDE pooling layer on the full utterances. The pooling output embeddings are then used to fine-tune the last two fully connected layers. LMCL is used to train the network in the finetuning stage. Because the model is trained on fixed-length twosecond audio chunks, this fine-tuning stage consists of adapting the embedding on the full utterance with variable length.

3.3 Backend Classifier

After extracting the feature embedding, the embeddings are then fed into the backend classifier to classify whether it is spoofed or bonafide audio. The backend classifier is the same architecture proposed in [14]. It is a shallow neural network that consists of one fully connected layer (FC) with 256 neurons, followed by batch normalization layer, Relu activation, dropout layer with dropout rate of 50%, and one Softmax output layer. In order to further increase the generalization ability, we use the stochastic weight averaging [18] (SWA) procedure to train the backend classifier. SWA is a novel Deepfake in the weights space. It averages the weights of the models at different training epochs. By averaging the weights, it can create similar properties compare to traditional ensemble method. Thus, it can provide better generalization ability. The SWA algorithm can be defined as:

4. Experiments

In this section, we report the results of our systems on the official ASVspoof 2021 evaluation set. Two key performance metrics are used to evaluate the systems. The first is EER that represents the point where false rejection rate (FRR) equals the false acceptance rate (FAR). In this case, the negative class is spoofing. The second metric is the minimum normalized tandem detection cost function (t-DCF) [19]. The t-DCF is defined as follows:

where β depends on application parameters (priors, costs) and ASV performance, P cm miss(s) and P cm fa (s) are the countermeasure system miss and false alarm rate at threshold s. In contrast to the EER computation, the negative class in t-DCF computation is either spoofing or zero-effort impostor. Therefore, the ASV scores should be provided.

Table 4 shows the results on the LA track for all our different systems. The ResNet-L-LDE system has the best min t-DCF and EER compared to other systems. Table 2 shows the detailed results based on min-tDCF for all different conditions and spoofing algorithms. It clearly shows that our system is robust to most of the channel effects. Because there is only one type of codec augmentation in our training dataset, thus the system doesn’t perform well on some of the codec and transmission effects such as LA-C3 and LA-C6.

In order to investigate the effectiveness of SWA strategy, we added the system ResNet-L-FM*. The embedding extractor used in ResNet-L-FM* is the same as the ResNet-L-FM system. However, the backend classifier used in ResNet-L-FM* is trained using conventional strategy, while the ResNet-L-FM* is trained using SWA. The result shows that SWA training can provide better generalization to the model.

Detailed results on Deepfake detection evaluation set report in Table 5. The final submission ensembles the three systems at the score level, achieves an EER of 16.05%. It is worth noting that our best single model system has achieved an EER of 16.36%, which is very competitive results compares to the ensemble results. Table 3 reports the EER on different conditions. Although, we only added mp3 and m4a compression effects in our training dataset, our system shows good generalization ability against most of the audio compression artifacts.

5. Conclusions

In this paper, we present the Pindrop Labs’s submission to the ASVspoof 2021 competition. For LA and DF tracks, we combined the Residual Neural Network architecture and different pooling techniques, and achieved very competitive results on the final evaluation set. Our final submission is a fusion of only three systems. We also investigated the effectiveness of the stochastic weight averaging. We used the SWA technique to train the backend classifier and showed good improvement.

Although our system obtained good results on LA evaluation set, it is still not generalizing well on the DF evaluation set. We believe by adding more audio compression augmentations to the training data will further narrow the gap. More research work is needed to further improve the generalization across different audio conditions.

6. References

[2] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, ¨ Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu, “Wavenet: A generative model for raw audio.,” SSW, vol. 125, 2016.

[2] Sercan O Arik, Mike Chrzanowski, Adam Coates, Gre- ¨ gory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al., “Deep voice: Real-time neural text-to-speech,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 195–204.

[3] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Zongheng Yang Jaitly, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, et al., “Tacotron: Towards end-to-end speech synthesis,” in Interspeech, 2017.

[4] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” in Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 195–202.

[5] Tomi Kinnunen, Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, and Zhenhua Ling, “A spoofing benchmark for the 2018 voice conversion challenge: Leveraging from spoofing countermeasures for speech artifact assessment,” in Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 187–194.

[6] Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidullah, Hector Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, and Kong Aik Lee, “Asvspoof 2019: Future horizons in spoofed and fake audio detection,” arXiv preprint arXiv:1904.05441, 2019.

[7] Massimiliano Todisco, Hector Delgado, and Nicholas ´ Evans, “A new feature for automatic speaker verification anti-spoofing: Constant q cepstral coefficients,” in Odyssey 2016, The Speaker and Language Recognition Workshop, 2016.

[8] Rohan Kumar Das, Jichen Yang, and Haizhou Li, “Long range acoustic and deep features perspective on asvspoof 2019,” in IEEE Autom. Speech Recognit. Understanding Workshop, 2019.

[9] Yanmin Qian, Nanxin Chen, Heinrich Dinkel, and Zhizheng Wu, “Deep feature engineering for noise robust spoofing detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1942–1955, 2017.

[10] Hossein Zeinali, Themos Stafylakis, Georgia Athanasopoulou, Johan Rohdin, Ioannis Gkinis, Luka´s Burget, ˇ Jan Cernock ˇ y, et al., “Detecting spoofing attacks using ` vgg and sincnet: But-omilia submission to asvspoof 2019 challenge,” arXiv preprint arXiv:1907.12908, 2019.

[11] Yexin Yang, Hongji Wang, Heinrich Dinkel, Zhengyang Chen, Shuai Wang, Yanmin Qian, and Kai Yu, “The sjtu robust anti-spoofing system for the asvspoof 2019 challenge,” Proc. Interspeech 2019, pp. 1038–1042, 2019.

[12] Balamurali BT, Kin Wah Edward Lin, Simon Lui, JerMing Chen, and Dorien Herremans, “Towards robust audio spoofing detection: a detailed comparison of traditional and learned features,” arXiv preprint arXiv:1905.12439, 2019.

[13] J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans, and H. Delgado, “Asvspoof2021: accelerating progress in spoofed and deep fake speech detection,” in Proc. ASVspoof 2021 Workshop, 2021.

[14] Tianxiang Chen, Avrosh Kumar, Parav Nagarsheth, Ganesh Sivaraman, and Elie Khoury, “Generalization of audio deepfake detection,” in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 2020, pp. 132–137.

[15] Weicheng Cai, Zexin Cai, Xiang Zhang, Xiaoqi Wang, and Ming Li, “A novel learnable dictionary encoding layer for end-to-end language identification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5189–5193.

[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[17] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” ICASSP, 2018.

[18] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson, “Averaging weights leads to wider optima and better generalization,” arXiv preprint arXiv:1803.05407, 2018.

[19] Tomi Kinnunen, Hector Delgado, Nicholas Evans, ´ Kong Aik Lee, Ville Vestman, Andreas Nautsch, Massimiliano Todisco, Xin Wang, Md Sahidullah, Junichi Yamagishi, et al., “Tandem assessment of spoofing countermeasures and automatic speaker verification: Fundamentals,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2195–2210, 2020.

Nicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, Elie Khoury

Pindrop, Atlanta, GA, USA
[email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

Recent progress in generative AI technology has made audio deepfakes remarkably more realistic. While current research on anti-spoofing systems primarily focuses on assessing whether a given audio sample is fake or genuine, there has been limited attention on discerning the specific techniques to create the audio deepfakes. Algorithms commonly used in audio deepfake generation, like text-to-speech (TTS) and voice conversion (VC), undergo distinct stages including input processing, acoustic modeling, and waveform generation. In this work, we introduce a system designed to classify various spoofing attributes, capturing the distinctive features of individual modules throughout the entire generation pipeline. We evaluate our system on two datasets: the ASVspoof 2019 Logical Access and the Multi-Language Audio Anti-Spoofing Dataset (MLAAD). Results from both experiments demonstrate the robustness of the system to identify the different spoofing attributes of deepfake generation systems.

Index Terms: Anti-spoofing, audio deepfake detection, explainability, ASVspoof

1. Introduction

In recent years, deepfake generation and detection have attracted significant attention. On January 21, 2024, an advanced text-to-speech (TTS) system was used to generate fake calls to manipulate the voice of US President, Joe Biden, encouraging voters to skip the 2024 primary election in the state of New Hampshire [1]. This incident underscores the critical need for deepfake detection that is reliable and trusted. Thus, explainability in deepfake detection systems is crucial. Within this research area, the task of deepfake audio source attribution has recently been gaining interest [2-10]. The goal of this task is to predict the source system that generated a given utterance. For example, the study in [2] aims to predict the specific attack systems used to produce utterances in ASVspoof 2019 [11]. This approach of directly identifying the name of the system misses the opportunity to categorize the spoofing systems based on their attributes. Such attribute-based categorization allows for better generalization to spoofing algorithms that are unseen in training but are composed of building blocks, such as acoustic models or vocoders, that are seen.

Along these lines, authors in [3] propose a more generalizable approach by classifying the vocoder used in the spoofing system. Authors in [4] explore classifying both the acoustic model and vocoder, finding that the acoustic model is more challenging to predict. The work in [5] takes this further by proposing to classify several attributes of spoofing systems in ASVspoof 2019 LA: conversion model, speaker representation, and vocoder. However, their findings demonstrate accuracy challenges in discerning speaker representation. Another drawback of their evaluation protocol is that the ASVspoof 2019 dataset is relatively outdated as there have been many advancements in voice cloning techniques in the last five years. Finally, their choice of categories for acoustic model and vocoder are very broad (e.g., “RNN related” for acoustic model and “neural network” for vocoder) and may not be that useful in narrowing down the identity of the spoofing system.

Figure 1: Illustration of proposed frameworks for spoofing attribute-classification. Top: End-to-end learning from audio. Bottom: Two-stage learning that includes a traditional countermeasure (CM) and an auxiliary classifier trained on embeddings.

In this work, we investigate two attribute classification strategies as illustrated in Fig. 1: an end-to-end learning method which trains standalone systems for each attribute and a twostage learning method which leverages the learned representations of existing countermeasure systems. To this end, we leverage three state-of-the-art systems, namely ResNet [12], self-supervised learning (SSL) [13], and Whisper [14]. In addition to identifying the acoustic model and vocoder, we propose classifying the input type (i.e. speech, text, or bonafide) rather than speaker representation. This allows for distinguishing between TTS and VC systems. As an anchor to previous work, we evaluate our methods on the ASVspoof 2019 protocol designed by [5]. To address the limitations of the outdated ASVspoof-based protocol, we design a new protocol based on the recent MLAAD dataset which consists of multilingual utterances produced by 52 systems comprising a variety of state-ofthe-art TTS systems. Compared to the ASVspoof-based protocol, this protocol uses more modern attack systems and replaces vague categories with specific acoustic models and vocoders. We make this novel MLAAD source tracing protocol publicly available2 . To the best of our knowledge, this is the first study of source tracing on a multi-lingual TTS dataset.

1 In [5], the term “conversion model” is used instead of “acoustic model” to refer more generally to the encoder part of the system for both TTS and VC systems.

Table 1: ASVspoof 2019 LA protocol for attribute-classification tasks, adapted from [5].

2. Attribute Classification of Spoof Systems

In this section, we describe our approaches for classifying the input type, acoustic model, and vocoder of the spoofing system used to generate a given audio.

2.1 Proposed Strategies

We present two strategies for leveraging existing state-of-the-art (SOTA) spoofing countermeasure (CM) systems for the task of component classification:

  1. End-to-End (E2E): This approach takes an existing CM architecture and trains the whole model for each of the multi-class component classification tasks separately.
  2. Two-Stage: This approach splits training into two steps: first, an existing CM is trained for the standard binary spoof detection task; next, the CM backbone is frozen, and a lightweight classification head is trained on the CM’s embeddings for each separate component classification task.

While the second approach is limited to the information that the binary-trained CM learns, it is attractive in practice due to reduced computational costs and the ability to leverage existing binary systems trained on significantly more data than available component labels.

2.2 Countermeasures

We used three different CMs to validate our hypothesis:

  • ResNet: This system consists of a front-end spoof embedding extractor and a back-end classifier. The front-end model is known as the ResNet18-L-FM model, as detailed in [12, 15]. To enhance the model’s generalization capability, large margin cosine loss [16] (LMCL) and random frequency masking augmentation are applied during training. The back-end model is trained using the spoof embedding vectors for the classification tasks described in Section 2. The back-end classifier is a feed forward neural network with one FC layer described in [12].
  • Self-Supervised Learning (SSL): SSL-based front-ends have attracted significant attention in the speech community, including spoofing and deepfake detection [13, 17–23]. The SSL-based CM architecture3 is a combination of SSL-based front-end feature extraction and an advanced graph neural network based backend, named AASIST [24]. The 160-dimensional CM embeddings are extracted prior to the final fully-connected output layer. The SSL feature extractor is a pre-trained wav2vec 2.0 model [25, 26], the weights of which are fine-tuned during CM training.
  • Whisper: Based on an encoder-decoder Transformer architecture for automatic speech recognition. The Whisper CM architecture uses a combination of Whisper-based front-end features and a light convolution neural network back-end.

2MLAAD protocol: doi.org/10.5281/zenodo.11593133

3. Datasets and Protocols

Two publicly available spoofing detection benchmarks are used in our study: the ASVspoof 2019 LA [11, 30] and the most recent MLAAD dataset [31].

3.1 ASVspoof 2019

The ASVspoof 2019 LA dataset has three independent partitions: train, development, and evaluation. Spoofed utterances are generated using a set of different TTS, VC, and hybrid TTSVC algorithms [11]. To compare our methods against those presented in [5], we adopt their protocol partition as detailed in Table 1. Notably, it only includes a train and development set, so we do not do any hyper-parameter search on this protocol. While we use the same categories as [5] for the acoustic and vocoder tasks, we create a new “Input type” task which is helpful to separate between TTS and VC systems. Table 1 summarises the statistics for each partition used for the different attribute classification tasks on the ASVspoof 2019 dataset.

3.2 MLAAD

MLAAD consists of TTS attacks only, however it includes 52 different state-of-the-art spoofing algorithms [31]. We manually label the acoustic models and vocoders based on the available metadata.5 Since MLAAD includes only TTS systems, we focus on acoustic model and vocoder classification without any input-type prediction. For end-to-end systems such as VITS and Bark, we use the name of the full system as the acoustic model and vocoder labels. Additionally, while the MLAAD dataset labels 19 different architectures, our protocol groups several systems that are identical aside from their training data. For example, the systems “Jenny”, “VITS”, “VITS-Neon”, and “VITS-MMS” are all labeled with the same acoustic model and vocoder category “VITS”. For the bonafide class, we include bonafide samples from the multilingual M-AILABS dataset [33]. We divide the data into train, development, and evaluation partitions while preventing speaker overlap. To enable this for the spoof samples, we assign voice labels using spherical k-means clustering on embeddings from the stateof-the-art speaker verification system, ECAPA-TDNN [34]. We use the elbow criteria on the inertia values to select K=75 clusters. We remove two vocoders, Griffin-Lim [35] and Fullband-MelGAN [36], since they each have a cluster containing most of their samples. The resulting acoustic model and vocoder labels along with their number of examples in each partition are presented in Table 2 and Table 3, respectively.

3github.com/TakHemlata/SSL_Anti-spoofing 4github.com/piotrkawa/deepfake-whisper-features 5We use the “model name” field provided in the dataset’s accompanying “meta.csv” file. System descriptions for each model name can be found in the Coqui-TTS [32] and HuggingFace repositories.

Table 3: MLAAD protocol for vocoder classification task. Multiband-mel:mul; Wavegrad:w-grad

4. Experimental Results

4.1 Implementation Details

ResNet and SSL models use 4 second (s) raw audio as input, whereas the Whisper model processes on 30s audio. For ResNet, LFCC features are extracted using 20ms window and 10ms frame-shift along with its delta and double delta features. Since fine-tuning large SSL models requires high GPU computation, experiments with SSL are performed with a smaller batch-size of 16 and a lower learning rate of 10−6 . We used the same set-up for SSL and Whisper based models as describe in [13] and [14], respectively. SSL and Whisper based models are fine-tuned on ASVspoof and MLAAD datasets in their respective experiments, whereas the ResNet model is trained from scratch. For the auxiliary classifier, a batch size of 256 and a learning rate of 10−3 is used with no hyper-parameter tuning. The best model is chosen based on Dev set accuracy and average F1-score for ASVspoof and MLAAD experiments, respectively

4.2 Results on ASVspoof 2019

Our results are compared with the previous study [5] on ASVspoof 2019 in terms of unweighted accuracy in Table 4.

Input type classification: This study introduces a novel task, predicting input types, which the previous study did not explore. We train classification heads using fixed ResNet, SSL, and Whisper based binary spoof detection models named as, ResNet (two-stage), SSL (two-stage), and Whisper (two-stage). These experiments achieve 97.8%, 96.7% and 78.4% accuracy, respectively. Our SSL model fine-tuned end-to-end, SSL (E2E), further improves accuracy to 99.9%.

Acoustic model classification: Several of our models surpass the previous study’s highest accuracy of 88.4%, achieved by the multi-task-trained RawNet2 model in [5]. Specifically, SSL (two-stage), ResNet (two-stage), and SSL (E2E) achieve accuracies of 91.4%, 92.6%, and 99.4% (a 12.4% relative improvement over the previous study), respectively. The substantial increase in accuracy may be due to the fact that our models are specifically trained for these tasks, unlike the previous study’s multi-task approach that jointly trained on acoustic, vocoder, and speaker representation tasks.

Vocoder classification: Our SSL (E2E) model slightly outperforms the previous study with an accuracy of 84.6% (a 0.1% relative improvement). Unlike the acoustic model, we do not see the same level of improvement. Analyzing errors from our top-performing model, SSL (E2E), we find that 882 out of 896 mis-predictions occur from predicting attack A07 as “Neural Network”. Attack A07 uses a non-neural WORLD vocoder, however it also uses a GAN-based post filter that identifies areas of the waveform to mask out (See [11] for further details). This post-filter is not seen in training and must have consistently affected the final waveform in a way that mangled the resemblance to traditional vocoder audio. Aside from this one kind of error, our SSL (E2E) model’s accuracy is 99.7%.

4.2 Results on MLAAD

We report results in terms of macro-averaged F1 and accuracy scores in Table 5. With the larger number of specific vocoder and acoustic model categories compared to the ASVspoof protocol, we find that the vocoder is easier to distinguish than the acoustic model, as observed in [4]. Our best performance on each of these tasks is achieved by our ResNet (E2E) model, with average F1-scores of 93.3% for the vocoder and 82.3% for the acoustic model task. Our two-stage strategy performed noticeably worse here, indicating that the binary spoof detection models omitted much architecture-specific information when fitting to the binary task. The auxiliary head models that performed the worst on the acoustic and vocoder classification tasks are the ones that leveraged the ResNet architecture. This is likely due to the ResNet model’s use of the LMCL loss function [16] which minimizes intra-class variation and thus reduces the separability of deepfake examples produced by different architectures.

Error analysis: We analyze the mistakes most commonly made by our top-performing ResNet (E2E) model. In the acoustic model task, we get <90% accuracy on three categories, as can be seen in the confusion matrix illustrated in Fig. 2. Fastpitch is mistaken for Tacotron2-DDC 38% of the time, Overflow 19% of the time, and VITS 16% of the time; GlowTTS is mistaken for VITS 36% of the time; and Neural-HMM is mistaken for VITS 21% of the time. In each of these cases, the predicted and actual acoustic models have a high degree of overlapping voice clusters in the test set. This indicates that the acoustic model embeddings are capturing voice information, and systems that share a common voice in the test set are more challenging to distinguish. In the vocoder task, the ResNet (E2E) model’s performance on the different categories is high. The most mistaken category is bonafide, in which case VITS is mistakenly predicted 7% of the time.

4.4. Embedding visualization

Our top performing models’ embeddings for the acoustic classification task using ASVspoof and MLAAD protocols are visualized using UMAP in Fig. 3. Notably, the acoustic models in the MLAAD dataset exhibit more difficulty in separation. This challenge may stem from overlapping voices among different models in the test set, as discussed in the previous error analysis section. Additionally, we observe distinct clusters of acoustic models with similar architectures: XTTS-v1 and XTTS-v2; as well as Neural-HMM [37] and Overflow [38] (which combines Neural-HMM with normalizing flows).

5. Conclusions and Discussions

In this paper, we propose three multi-class classification tasks to give more explanatory predictions in the place of traditional binary spoof detection: input-type, acoustic model, and vocoder classification. We experiment with two methods of leveraging open source spoof detection systems to accomplish this task and evaluate them on a recently introduced ASVspoof 2019 protocol as well as a new protocol that we design using the more modern MLAAD dataset. Our SSL (E2E) method outperforms the previous study on ASVspoof that we compare to on the acoustic and vocoder tasks with relative improvements in accuracy of 12.4% and 0.1% respectively while achieving 99.9% accuracy on our newly introduced input-type classification task. On our MLAAD protocol which includes a greater number of vocoder and acoustic categories from more modern TTS systems, our ResNet (E2E) model yields an average f1 score of 82.3% for the acoustic model and 93.3% for the vocoder classification task. Our findings support existing literature that suggest that the vocoder is easier to distinguish than the acoustic model. Additionally, we observe that the acoustic models of systems that produce similar voices are more challenging to discriminate. Thus, a potential area of future study is to more explicitly ignore voice-specific information.

Our experiments with two-stage classification methods that leverage embeddings from binary spoof detection systems show promise, though they underperform on MLAAD compared to the full model fine-tuning methods. Future research in this area is crucial as models that augment rather than replace existing binary spoof detection systems are attractive, especially in industry where changes in the behavior of the binary detection system require thorough evaluation. Thus, one possible future experiment is to assess where in the binary model contains the most useful information for discriminating the different spoof system components. Additionally, assessing how the choice of loss function for the binary model affects the downstream multiclass performance could give insight into which existing models are best suited to being leveraged for two-stage learning.

6. References

[1] “Fake biden robocall tells voters to skip new hampshire primary election – bbc news,” https://www.bbc.com/news/ world-us-canada-68064247, Last Accessed: 05/03/2024.

[2] C. Borrelli, P. Bestagini, F. Antonacci, A. Sarti, and S. Tubaro, “Synthetic speech detection through short-term and long-term prediction traces,” EURASIP Journal on Information Security, vol. 2021, pp. 1–14, 2021.

[3] X. Yan, J. Yi, J. Tao, C. Wang, H. Ma, T. Wang, S. Wang, and R. Fu, “An initial investigation for detecting vocoder fingerprints of fake audio,” in Proc. of the 1st International Workshop on Deepfake Detection for Audio Multimedia, 2022.

[4] C. Y. Zhang, J. Yi, J. Tao, C. Wang, and X. Yan, “Distinguishing neural speech synthesis models through fingerprints in speech waveforms,” ArXiv, vol. abs/2309.06780, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:261705832

[5] T. Zhu, X. Wang, X. Qin, and M. Li, “Source tracing: Detecting voice spoofing,” in Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2022.

[6] J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y. Zhang, X. Zhang, Y. Zhao, Y. Ren et al., “ADD 2023: the second audio deepfake detection challenge,” in Proc. IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis, 2023.

[7] X.-M. Zeng, J.-T. Zhang, K. Li, Z.-L. Liu, W.-L. Xie, and Y. Song, “Deepfake algorithm recognition system with augmented data for add 2023 challenge,” in Proc. IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis, 2023.

[8] Y. Tian, Y. Chen, Y. Tang, and B. Fu, “Deepfake algorithm recognition through multi-model fusion based on manifold measure,” in Proc. IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis, 2023.

[9] J. Lu, Y. Zhang, Z. Li, Z. Shang, W. Wang, and P. Zhang, “Detecting unknown speech spoofing algorithms with nearest neighbors,” in Proceedings of IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis, 2023.

[10] J. Deng, Y. Ren, T. Zhang, H. Zhu, and Z. Sun, “Vfd-net: Vocoder fingerprints detection for fake audio,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 151–12 155.

[11] X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V. Vestman, T. Kinnunen, K. A. Lee et al., “ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Computer Speech & Language, vol. 64, p. 101114, 2020.

[12] T. Chen, A. Kumar, P. Nagarsheth, G. Sivaraman, and E. Khoury, “Generalization of audio deepfake detection,” in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 2020. [13] H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” in Proc. The Speaker and Language Recognition (Speaker Odyssey) Workshop, 2022.

[14] P. Kawa, M. Plata, M. Czuba, P. Szymanski, and P. Syga, “Im- ´ proved DeepFake Detection Using Whisper Features,” in Proc. INTERSPEECH, 2023.

[15] T. Chen and E. Khoury, “Spoofprint: a new paradigm for spoofing attacks detection,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 538–543.

[16] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “CosFace: Large margin cosine loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

[17] Z. Jiang, H. Zhu, L. Peng, W. Ding, and Y. Ren, “Self-supervised spoofing audio detection scheme.” in INTERSPEECH, 2020.

[18] Y. Xie, Z. Zhang, and Y. Yang, “Siamese network with wav2vec feature for spoofing speech detection.” in Interspeech, 2021.

[19] X. Wang and J. Yamagishi, “Investigating self-supervised front ends for speech spoofing countermeasures,” in Proc. The Speaker and Language Recognition (Speaker Odyssey) Workshop, 2022.

[20] Y. Eom, Y. Lee, J. S. Um, and H. Kim, “Anti-spoofing using transfer learning with variational information bottleneck,” in Proc. INTERSPEECH, 2022.

[21] X. Wang and J. Yamagishi, “Investigating active-learning-based training data selection for speech spoofing countermeasure,” in Proc. SLT, 2023.

[22] J. M. Mart´ın-Donas and A. ˜ Alvarez, “The vicomtech audio deep- ´ fake detection system based on wav2vec2 for the 2022 add challenge,” in Proc. ICASSP, 2022.

[23] X. Wang and J. Yamagishi, “Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,” in Proc. ICASSP, 2023.

[24] J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” in Proc. ICASSP, 2022.

[25] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in neural information processing systems (NIPS), 2020.

[26] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino et al., “XLS-R: Selfsupervised cross-lingual speech representation learning at scale,” in Proc. INTERSPEECH, 2022.

[27] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023.

[28] X. Wu, R. He, Z. Sun, and T. Tan, “A light CNN for deep face representation with noisy labels,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, pp. 2884–2896, 2018.

[29] M. Sahidullah, T. Kinnunen, and C. Hanilc¸i, “A comparison of features for synthetic speech detection,” in Proc. INTERSPEECH, 2015.

[30] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future horizons in spoofed and fake audio detection,” in Proc. INTERSPEECH, 2019.

[31] N. M. Muller, P. Kawa, W. H. Choong, E. Casanova, E. G ¨ olge, ¨ T. Muller, P. Syga, P. Sperl, and K. B ¨ ottinger, “Mlaad: ¨ The multi-language audio anti-spoofing dataset,” arXiv preprint arXiv:2401.09512, 2024.

[32] G. Eren and The Coqui TTS Team, “Coqui TTS,” Jan. 2021. [Online]. Available: https://github.com/coqui-ai/TTS

[33] T. M. S. Dataset, “The m-ailabs speech dataset,” https://www. caito.de/2019/01/03/the-m-ailabs-speech-dataset/, 2023, Last accessed on 05/03/2024.

[34] B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” in Proc. INTERSPEECH, 2020.

[35] D. Griffin and J. Lim, “Signal estimation from modified shorttime fourier transform,” IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 2, pp. 236–243, 1984.

[36] G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, and L. Xie, “Multiband melgan: Faster waveform generation for high-quality textto-speech,” in IEEE Proc. Spoken Language Technology Workshop (SLT), 2021.

[37] S. Mehta, E. Sz ´ ekely, J. Beskow, and G. E. Henter, “Neural ´ HMMs are all you need (for high-quality attention-free TTS),” in Proc. ICASSP, 2022.

[38] S. Mehta, A. Kirkland, H. Lameris, J. Beskow, Eva Sz ´ ekely, and ´ G. E. Henter, “OverFlow: Putting flows on top of neural transducers for better TTS,” in Proc. INTERSPEECH, 2023.

AbstractWhen it comes to authentication in speaker verification systems, not all utterances are created equal. It is essential to estimate the quality of test utterances to account for varying acoustic conditions. In addition to the net-speech duration of an utterance, this paper observes that phonetic richness is also a key indicator of utterance quality, playing a significant role in accurate speaker verification. Several phonetic histogram-based formulations of phonetic richness are explored using transcripts obtained from an automatic speaker recognition system. The proposed phonetic richness measure is found to be positively correlated with voice authentication scores across evaluation benchmarks. Additionally, the proposed measure, in combination with net speech, helps calibrate the speaker verification scores, obtaining a relative EER improvement of 5.8% on the Voxceleb1 evaluation protocol. The proposed phonetic richness-based calibration provides higher benefits for short utterances with repeated words.



Index Terms—Speech Quality, Speaker Verification, Phonetic Richness, Score Calibration

Introduction

Automatic Speaker Verification (ASV) systems are increasingly used to authenticate users based on their voice for various secure transactions. There is an increasing user-experience driven demand for securely authenticating with shorter utterances of free flowing speech. Short utterances are those which contain 1 to 8 seconds of net-speech [1]. Net-speech is the duration of actual speech content within a given audio as determined by the speech activity detection. When the utterances are shorter, the accuracy of ASV varies greatly based on the amount of net-speech and the signal-to-noise ratio (SNR) [2].

However, in real-world applications where the ASV system is passively used and no fixed phrases are required, it is possible for the speaker to choose to repeat the same word or expression several times. For example, the repetition of the words “agent” or “representative” is very common in call center applications where some customers try to avoid talking to the automated interactive voice response (IVR) system, and want to speak to a real agent. Thus, utterances with repeated words may satisfy the minimum net-speech requirements of an ASV system, but due to their low phonetic diversity, the utterances would still be equivalent to a low quality utterance. It is therefore essential to estimate the quality of the test audio to improve the authentication of short utterances.

Fig. 1. T-SNE plot of ECAPA-TDNN speaker verification embeddings for utterances with similar net-speech but varying number of unique phonemes, spoken by APLAWD speakers ‘a’ through ‘e’. Number markers indicate the number of unique phonemes in the utterance while their color distinguishes speakers. Clusters of utterances with more phonemes are circled in red.

Quality-aware calibration has been explored in various works where a quality measure function (QMF) is used as side information for the calibration model to adjust the raw score. This enables the calibration model to adapt to varying conditions in the enrollment and test utterances. Various QMFs have been explored in the literature, including: the utterance embedding magnitude [3]; the expected mean imposter score [3], a measure related to the popular adaptive symmetric normalization (as-norm) [4] method; noise features such as SNR [5], [6]; and duration-related features such as duration or net speech, where both the duration of the enrollment and the test utterance can be leveraged [3], [6]–[10].

Duration QMFs are of particular interest in short utterance speaker verification as the performance of ASV systems is known to degrade when there is an insufficient quantity of speech on either the enrollment or test side [11]. There is a consensus amongst these works that the short duration is challenging due to the higher variation in the lexical content leading to higher variation of the speaker embeddings belonging to the same speaker [12], [13]. For example, a speaker simply blabbering “blah-blah”, coughing, or clearing their throat would be considered as poor quality utterances as they do not contain enough phonetic content normally seen in continuous speech.

In Fig. 1, we observe that utterances with more unique phonemes appear to be clustered more tightly, while those with fewer unique phonemes are more spread out despite all visualized utterances in the plot having similar net-speech. Authors in [13] experimented with addressing this issue at the i-vector level through short-utterance-variancenormalization and [10] applied duration-based score calibration. However, neither of these works have explicitly focused on the issue of phonetic variance in short utterances directly, but rather utilized duration as a proxy.

In this paper, we explore the importance of the diversity of phonetic content for speaker verification. Particularly, we define phonetic richness as the number of unique phonemes present in a given utterance. The phonetic richness measure is positively correlated to the raw cosine score between the enrollment and test embedding. In order to explore the unequal contribution of phonemes towards speaker verification score, we define a phoneme-weighted phonetic richness measure and learn the phonetic weights from data. Section II describes the proposed phonetic richness measures. We perform score calibration using the phonetic richness measures on shortutterance speaker verification protocols. The details of the datasets and evaluation protocols used are provided in III and that of the speaker verification system is given in IV. The experiments and results presented in V show that the proposed phonetic richness measures provide improved score calibration performance over net-speech and are also complimentary to net-speech as a quality measure for score calibration.

Phonetic Richness Measures

The phonetic richness of an utterance can be quantified using the orthographic transcription and a grapheme to phoneme conversion [14], [15] system. We define phonetic richness in terms of the number of unique phonemes present in the utterance, consistent with the phonetic measure explored in [10]. We obtain automatic phonetic transcription of an utterance using 39 phonetic alphabets of the English language. For a given phonetic transcription of an utterance, we can create an N-dimensional binary vector pu of phoneme presence, where pu,i,(i = 1..N) is the binary flag indicating the presence of the phoneme Pi in the transcription. Count-unique (CU) is thus defined as

In order to explore whether some phonemes are more important than others for speaker verification, we define the weighted count-unique (WCU) measure which aggregates a weighted sum of phoneme presence. Formally, WCU for a given utterance u is defined as:

where w is a vector of positive real-valued weights and pu is a binary vector of phoneme presences for a given utterance.

TABLE 1:

TEST UTTERANCE NET SPEECH AND PHONETIC RICHNESS ACROSS EVALUATION PROTOCOLS ( MEAN (STD) ).

Protocol Net Speech (s) CU
Voxceleb1 7.3 (5.3) 26.4 (5.0)
voxceleb1.5s 4.3 (0.4) 23.2 (3.4)
voxceleb1.2s 1.8 (0.2) 13.9 (3.8)
voxceleb1.1s 0.8 (0.2) 7.6 (3.7)
aplawd 0.6 (0.1) 3.0 (0.8)
aplawd-repetitive 3.6 (2.0) 9.0 (5.0)

utterance. w is learned by fitting a linear regression model to predict the speaker-match score w · pu = scoreu using least squares regression where scoreu is a score that a given speaker verification system gave to utterance u when tested against the same speaker’s enrollment. Thus, WCU combines individual phonetic contributions to estimate the speaker match score for a positive trial. We create a set of 3,150 positive enrollment-test pairs on which to fit these weights using the TIMIT dataset [16]. We extract pu vectors for each of the test utterances as described above, use them to fit the regression model, and fix the weights of the WCU. These weights are used for all the experiments in the paper.

We use the Quartznet automatic speech recognition (ASR) system [17] to obtain the transcripts of the test utterances. Quartznet is a 1D time-separable convolutional neural network architecture trained on both telephonic conversational and read-speech wideband datasets. The character-level transcription is obtained by performing n-best hypothesis using a beam width of 50. We convert the transcription to a sequence of phonemes using the grapheme to phoneme converter [15]. Finally, the set of unique phonemes in the utterance form the phoneme presence indicator vector pu and the CU and WCU measures are computed using Eq. 1 and 2.

Datasets

The Voxceleb1-E [18] protocol is a popular speaker verification benchmark consisting of 40 speakers, 4,715 speaker models, and 4,713 test utterances. The official protocol consists of 18,860 positive and negative model-test trials each. In addition, several short utterance versions of this protocol are created following the method in [19]: a 5s-, 2s-, and 1sVoxceleb1-E protocol are each made by taking random clips of the target size of each test probe. Probes are repeated to reach sufficient duration if needed prior to random clipping. We discard any clips with content that is unintelligible by our ASR model. The enrollment audios are left unchanged.

Next we create a protocol using the Aplawd dataset [20]. The Aplawd dataset consists of 10 subjects–5 males and 5 females. The subjects provide speech samples for 5 phonetically diverse sentences, 10 digits, and 66 isolated words, all repeated 10 times1 . Each speaker’s sentences are reserved for forming a single high-quality enrollment, leading to a mean model net speech of 117.8s (σ=15.4s) and a constant CU of 39 (the maximum value). Meanwhile, the short one-word utterances are used for testing. The net-speech of the short test utterances is around 0.6 seconds on average and the CU is around 3. Matching gender trials are created resulting in 5,107 positive and 20,428 negative trials. Notably, this dataset has much cleaner audio compared to Voxceleb1 and can better assess phonetic richness’s utility in the absence of noise.

1Some subjects do not supply recordings for all of the transcripts.

In order to test the impact of utterances with low phonetic variability but higher net-speech, we create a test dataset and evaluation protocol containing repeated words. This protocol uses the Aplawd dataset, and thus we name it AplawdRepetitive. Again, we designate the spoken sentences for enrolling the speakers. Test utterances are constructed by concatenating the single word audios so that net speech and phonetic richness can be independently controlled by varying the utterance length in words and the number of unique words to use respectively. Each probe consists of between 2 and 10 concatenated single-word utterances and between 1 and 10 unique words. The number of total words and unique words are chosen independently for each probe. The resulting probes contain an average of 2.5 repeated words per probe. Note that we do not concatenate digital copies of single-word utterances, but instead make use of the repeated utterances in the Aplawd dataset. This protocol contains 1,600 positive and 6,400 negative trials. See table I for the details of all our evaluation protocols2 .

System Description

Speaker Verification System

We perform speaker verification experiments using the Emphasized Channel Attention Propagation and Aggregation Time-Delay Neural Network (ECAPA-TDNN) architecture [21]. The ECAPA-TDNN architecture which improves the TDNN architecture with multi-headed attention and squeeze excitations has achieved state-of-the-art performance on several speaker verification benchmarks. We use the pretrained model checkpoint3 trained on the Voxceleb1 [18] + Voxceleb2 [22] datasets provided in the SpeechBrain toolkit [23] for all of our experiments. At inference time, the output of the final fully connected layer is used as the 192-dimensional speaker embedding. The voice match scoring is performed using cosine distance between the enrollment and test embeddings. The cosine distance scores are further calibrated using net-speech and phonetic richness to compute the final score.

Score Calibration

We assess our phonetic richness features, net speech, and their combinations in their ability to add value in calibrating the authorization score. We fit a logistic regression (LR) model [24], [25] on the combination of raw scores with netspeech, CU, and WCU to classify matched from mismatched trials. We perform stratified 5-fold cross validation with class weighting to fit the LR model and compute the results on the test sets. Finally, the performance of the resulting calibrated scores is reported in terms of equal error rate (EER) as well as minCprimary, a performance metric defined by [26] as an average of the detection cost function at two operating points.

TABLE 1:

TEST UTTERANCE NET SPEECH AND PHONETIC RICHNESS ACROSS EVALUATION PROTOCOLS ( MEAN (STD) ).

Protocol CU WCU Log NS
voxceleb1 0.148 0.142 0.166
voxceleb1 5s 0.033 0.028 0.032
voxceleb1.2s 0.087 0.090 0.066
voxceleb1.1s 0.168 0.163 0.153
aplawd 0.201 0.197 0.202
aplawd-repetitive 0.633 0.608 0.348

Fig. 2. ASV score as a function of phonetic richness measures and net speech for Aplawd-Repetitive test utterances. Positive (blue) and negative (orange) pairs are plotted separately to observe class separation patterns, and Kendall’s τ is computed for each class.

2The Aplawd and Aplawd-Repetitive protocols are publicly available at https://doi.org/10.5281/zenodo.11663092

3https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb

Experiments + Results

Correlation Analysis

We assess our phonetic richness measures and net-speech in terms of their correlation with the ASV score, utilizing Kendall’s τ correlation coefficient [27]. Additionally, we visualize these relationships for the Aplawd-Repetitive protocol since it is constructed to have lower correlation between net speech and phonetic richness.

In table II, we observe that the correlation between ASV score and our phonetic richness measures is similar to netspeech except on the Aplawd-Repetitive protocol. By design, the examples in Aplawd-Repetitive have less correlation between phonetic richness and net speech and therefore the independent effects of each on ASV score are easier to distinguish. We observe that phonetic richness is significantly more correlated with ASV score than net speech, indicating that it is a more accurate measure of audio quality when it comes to speech where there is repetitive content as shown in figure 2. In figure 2, we additionally observe that the separation between the positive and negative trials is better for utterances with high phonetic richness compared to high net-speech.

Phonetic Richness-Based Score Calibration

Table III summarizes the results of score calibration using information about the test utterances’ phonetic richness and net speech. We observe that score calibration with phonetic richness improves the EER on all but one of the benchmarks. Consistent with the findings of our correlation analysis, phonetic richness is significantly more useful for score calibration than net speech on aplawd-repetitive which contains examples of repetitive speech. The EER on Voxceleb1 evaluation reduces from 1.04% to 1.00% with score calibration using CU. When CU is combined with log net-speech (LNS), the EER reduces further to 0.98%. Notably, the combination of net speech and phonetic richness features outperforms net speech when used alone on all protocols except Voxceleb1-2s. Finally, we see that as the utterance length of Voxceleb1 is decreased (see Voxceleb1-2s and Voxceleb1-1s results), the benefit that phonetic richness provides over no-calibration increases and the benefit of adding phonetic richness over net speech alone increases, indicating that phonetic richness provides greater benefit for short utterance verification.

Fig. 3. Distribution of learned phoneme-specific weights compared with the frequency that each phoneme occurs in the data used for fitting the weights. Phonemes having a larger weight relative to their frequency suggests that they are more useful for carrying speaker-identifying information.

Analysis of Learned Phoneme-Weights

The WCU-measure models the speaker verification score of same-speaker enrollment-test utterance pairs as an additive model, where each phoneme in a given test utterance contributes a non-negative component to the resulting verification score. Thus the learned phoneme-specific weights can be interpreted to give insights into the importance of different phonemes for carrying speaker-identifying information. To facilitate this analysis, the learned weights are normalized to sum to one and compared against the train dataset’s corresponding phoneme frequencies in figure 3. The normalized weights are often similar to the phoneme frequencies. Cases where the weight differs greatly from the frequency may reflect on the phoneme’s ability to provide speaker information. A previous work that studied the utility of different syllable categories in identifying speakers found that the nasals (‘M’, ‘N’, ‘NG’) and affricates (‘ZH’, ‘CH’, ‘SH’, ‘R’, ‘J’, ‘Q’) combined with the vowels are the most useful for carrying speaker identifying information [28]. We observe that the learned weights for each of the nasal consonant phonemes along with the affricates ‘CH’ and ‘JH’ are all significantly higher than their corresponding phoneme frequencies. Furthermore, the ‘N’ and ‘M’ phonemes received the top 1 and 3 largest weights respectively. Thus, these results support the previous findings in the literature.

Conclusions

In this paper, we investigate phonetic richness as a measure of utterance quality for speaker verification. We define the count unique (CU) and phoneme-weighted count unique (WCU) measures which are positively correlated (τ = 0.6) with the cosine scores of the speaker verification system for positive examples in the short utterance Aplawd-Repetitive dataset. Phonetic richness is more correlated with the speakermatch score for short utterances. The proposed CU and WCU measures are useful for score calibration on the Voxceleb-1 benchmark as well as the Aplawd dataset. Based on the results, we conclude that phonetic richness is more helpful than net speech for calibration when speech contains repetitive content. Phonetic richness and net speech are complimentary, yielding improved performance over net speech alone on all but one of our evaluation protocols. The phoneme weighted phonetic richness measure for calibration is not found to be better on short utterances than the non-weighted measure. In the future, we plan to explore other measures of phonetic richness without explicitly performing ASR transcription. We also plan to explore calibration based on the phonetic richness of the enrollments. Further study is needed to explore the efficacy of this method on multiple languages and cross-lingual speaker verification.

 

 

 

REFERENCES

  1. H. Zeinali, K. A. Lee, J. Alam, and L. Burget, “Short-duration speaker verification (sdsv) challenge 2021: the challenge evaluation plan,” 12 2019. [Online]. Available: https://arxiv.org/abs/1912.06311v3
  2. [2] M. I. Mandasari, R. Saeidi, and D. A. Van Leeuwen, “Quality measures based calibration with duration and noise dependency for speaker recognition,” Speech Communication, vol. 72, pp. 126–137, 2015.
  3. [3] J. Thienpondt, B. Desplanques, and K. Demuynck, “The idlab voxsrc-20 submission: Large margin fine-tuning and quality-aware score calibration in dnn based speaker verification,” ICASSP 2021 – 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5814–5818, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:225041173
  4. [4] Z. N. Karam, W. M. Campbell, and N. Dehak, “Towards reduced false-alarms using cohorts,” in 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2011, pp. 4512–4515.
  5. [5] Z. Tan, M. wai Mak, and B. K.-W. Mak, “Dnn-based score calibration with multitask learning for noise robust speaker verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, pp. 700–712, 2018. [Online]. Available: https://api.semanticscholar.org/ CorpusID:3402667
  6. [6] H. Rao, K. Phatak, and E. Khoury, “Improving speaker recognition with quality indicators,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 338–343.
  7. [7] G. Lavrentyeva, S. Novoselov, A. Shulipa, M. Volkova, and A. Kozlov, “Investigation of different calibration methods for deep speaker embedding based verification systems,” ArXiv, vol. abs/2203.15106, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID: 247778556
  8. [8] S. Cumani and S. Sarni, “A generative model for durationdependent score calibration,” in Interspeech, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:239711242
  9. [9] M. I. Mandasari, R. Saeidi, M. McLaren, and D. A. van Leeuwen, “Quality measure functions for calibration of speaker recognition systems in various duration conditions,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, pp. 2425–2438, 2013. [Online]. Available: https://api.semanticscholar.org/CorpusID:13979425
  10. [10] T. Hasan, R. Saeidi, J. H. L. Hansen, and D. A. van Leeuwen, “Duration mismatch compensation for i-vector based speaker recognition systems,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 7663–7667.
  11. [11] A. Poddar, M. Sahidullah, and G. Saha, “Speaker verification with short utterances: a review of challenges, trends and opportunities,” IET Biom., vol. 7, pp. 91–101, 2017. [Online]. Available: https: //api.semanticscholar.org/CorpusID:3923424
  12. [12] A. Larcher, K. A. Lee, B. Ma, and H. Li, “The RSR2015: Database for Text-Dependent Speaker Verification using Multiple Pass-Phrases,” in Annual Conference of the International Speech Communication association (Interspeech), Portland, United States, Sep. 2012. [Online]. Available: https://hal.science/hal-01927726
  13. [13] A. Kanagasundaram, D. Dean, S. Sridharan, J. Gonzalez-Dominguez, J. Gonzalez-Rodriguez, and D. Ramos, “Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques,” Speech Communication, vol. 59, pp. 69– 82, 2014. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S0167639314000053
  14. [14] S. Yolchuyeva, G. Nemeth, and B. Gyires-T ´ oth, “Grapheme-to-phoneme ´ conversion with convolutional neural networks,” Applied Sciences, vol. 9, no. 6, p. 1143, 2019.
  15. [15] K. Park and J. Kim, “g2pe,” https://github.com/Kyubyong/g2p, 2019.
  16. [16] J. S. Garofolo, “Timit acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993, 1993.
  17. [17] S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, and Y. Zhang, “Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions,” 2019.
  18. [18] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in INTERSPEECH, 2017.
  19. [19] S. M. Kye, Y. Jung, H. B. Lee, S. J. Hwang, and H. Kim, “Metalearning for short utterance speaker recognition with imbalance length pairs,” 2020.
  20. [20] R. Serwy, “Aplawd markings database,” https://github.com/serwy/ aplawdw, 2017.
  21. [21] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Interspeech 2020, H. Meng, B. Xu, and T. Fang Zheng, Eds. ISCA, 2020, pp. 3830–3834.
  22. [22] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
  23. [23] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong et al., “Speechbrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021.
  24. [24] S. Pigeon, P. Druyts, and P. Verlinde, “Applying logistic regression to the fusion of the nist’99 1-speaker submissions,” Digital Signal Processing, vol. 10, no. 1-3, pp. 237–248, 2000.
  25. [25] N. Brummer, L. Burget, J. Cernocky, O. Glembek, F. Grezl, M. Karafiat, D. A. Van Leeuwen, P. Matejka, P. Schwarz, and A. Strasheim, “Fusion of heterogeneous speaker recognition systems in the stbu submission for the nist speaker recognition evaluation 2006,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 7, pp. 2072–2084, 2007.
  26. [26] S. O. Sadjadi, C. S. Greenberg, E. Singer, D. A. Reynolds, and L. Mason, “Nist 2020 cts speaker recognition challenge evaluation plan,” 2020.
  27. [27] H. Abdi, “The kendall rank correlation coefficient,” Encyclopedia of measurement and statistics, vol. 2, pp. 508–510, 2007.
  28. [28] N. Fatima and T. F. Zheng, “Syllable category based short utterance speaker recognition,” in 2012 International Conference on Audio, Language and Image Processing, 2012, pp. 436–441.

Authentication 101: The Basics of Call Center Authentication

This is part one in a three-part series on authentication in the contact center. In this first session, we will explore a few key concepts around authentication: what authentication is, what it means to identify and verify someone, and the basics of multi-factor authentication. We will discuss the pros and cons of the different authentication factors that are recognized in the industry and take a look at the various tools and technologies that are being used for authentication today.

Thank You for Registering!

The webinar details will be sent to your email address shortly, where you can save to your calendar. We look forward to seeing you there.

In the meantime, please check out our 2024 Voice Intelligence and Security Report for the latest fraud trends and solutions.

If you have any questions, please reach out to [email protected].

 

Thank You for Registering!

The webinar details will be sent to your email address shortly, where you can save to your calendar. We look forward to seeing you there.

In the meantime, please check out our resources around fraud prevention and detection:

If you have any questions, please reach out to [email protected].

 

Thank You for Registering!

The webinar details will be sent to your email address shortly, where you can save to your calendar. We look forward to seeing you there.

In the meantime, please check out our resources around fraud prevention and detection:

If you have any questions, please reach out to [email protected].

 

Thank You for Registering!

The webinar details will be sent to your email address shortly, where you can save to your calendar. We look forward to seeing you there.

In the meantime, please check out our resources around fraud prevention and detection:

If you have any questions, please reach out to [email protected].

 

Launch Recap and Q&A with Pindrop CMO Mark Horne

Pindrop has just announced an evolution for Pindrop® Protect, our industry-leading anti-fraud solution now extending its protection into the IVR and finds more fraud leveraging Trace, graphic analysis technology.

Tomorrow, join Pindrop Chief Marketing Officer, Mark Horne as he talks about the new technology, and the future of graphic analytics to predict fraud. We will also open up time for Q&A.

What is Pindrop Trace? 

Trace connects seemingly unrelated activities to reveal patterns that indicate fraudulent activity and provides:

  • Increase accuracy, reduce false positives, and improve cross-channel fraud detection.
  • A more complete view of your company’s “fraud universe”
  • Fraud predictions by analyzing relationships between behaviors, accounts, other parameters
  • Connections to seemingly disconnected activities across time, accounts, and activities