Publication

Robust spread spectrum speech watermarking using linear prediction and deep spectral shaping

David Looney, Nikolay D. Gaubitch

Pindrop Inc., London, UK
[email protected], [email protected]

Abstract

We consider the problem of robust watermarking of speech signals using the spread spectrum method. To date, it has primarily been applied to music signals. Here we discuss differences between speech and music, and the implications this has on the use of spread spectrum watermarking. Moreover, we propose enhancements to the watermarking of speech for the detection of deepfake attacks at call centers using classical signal processing techniques and deep learning.

Index Terms: watermarking, spread spectrum

1. Introduction

With the rise of generative AI, it is becoming increasingly difficult to validate the authenticity of audio and video. A possible solution is to apply a digital watermark to synthetically-generated media content, which can then be used to make users aware that the media is indeed synthetically generated [1, 2]. One area where synthetic speech in particular poses a risk is in call centers. However, synthetic speech is typically generated with high quality at high sampling rates, and by the time it reaches a call center, it inevitably undergoes a series of degradations, such as downsampling and compression. Further degradations include acoustic noise and reverberation if replayed through a loudspeaker. This creates a significant challenge to robust watermarking, which is the topic of this work.

Audio watermarking received attention when music streaming and file sharing platforms were becoming popular in the early 2000s, mostly to facilitate intellectual property protection. It is also in this period that much of the work on the topic was published [3]. The required characteristics of a robust watermark are that it should be imperceptible to a human listener (imperceptibility), capable of withstanding deliberate attacks or anticipated signal degradations (robustness), and able to carry information (capacity) [3, 4]. Furthermore, we are only concerned with blind-watermarking detection methods where the original signal is not available at the decoder. The main approaches for audio watermarking that satisfy these requirements include the insertion of time-domain echoes [5, 6], spread spectrum modulation [7, 8, 9], or quantization index modulation [10]. In recent works, such as [11], end-to-end neural watermarking schemes are proposed that appear promising. However, at this stage, they require long speech utterances and tend to be computationally inefficient. Many techniques make use of the operation of the human auditory system to achieve better imperceptibility [3, 6, 7, 12].

We consider the problem of robust watermarking of speech signals using a spread spectrum method based on [7]. While this method was applied previously to music signals, in Section 3 we discuss differences between speech and music signals and the implications this has on the use of spread spectrum watermarking. Moreover, we propose modifications in the form of spectral shaping, both with respect to the encoder and the decoder, to tailor the spread spectrum method to speech signals. In the case of encoding, in Section 4 we show how linear prediction coding (LPC) analysis can be used to dynamically adjust the watermarking sequence to yield increased imperceptibility with no tradeoff in robustness. In the case of decoding, in Section 5 we show how a deep learning model can replace the standard decoding operation to improve robustness by emphasizing spectral components in a data-driven fashion. The analysis is supported in experimental scenarios matching the call center use-case, where encoding is performed on high-quality speech, and decoding is performed after applying typical telephony degradations (additive noise, downsampling, codec).

2. Spread spectrum watermarking

We want to add a watermark ww to a speech signal s(n)s(n). The watermark is typically a pseudo-random sequence, w∈{±1}Nww \in \{ \pm 1 \} ^{N_w}, which is applied to the signal in some transform domain [7, 8]. Here, the watermark is added to the ll-th frame in the log-spectral domain using the short-time Fourier transform (STFT):

XdB(l)=SdB(l)+δw,

where SdB(l)=20log10(S(k,l)), is the frequency index, S(k,l) is the discrete Fourier transform (DFT) of the l-th frame of s(n), and δ is a scaling parameter to control the watermark strength. The watermarked time-domain signal x(n) is reconstructed from ∣X(k,l) and the phase of S(k,l).

From detection theory, the optimal detector of the watermark is a matched filter:

where σs is the standard deviation of the signal. The false alarm and false rejection probabilities are given by:

where is the complementary error function and τ is a detection threshold parameter. An important metric for this work is the equal error rate (EER), when , which is obtained by setting the threshold to .

3. Watermarking speech vs. music signals

It is known to be more challenging to achieve a balance between imperceptibility and robustness when adding a watermark to speech compared to music [3, 7] due to the more limited spectral content of the former. We view this problem from a different angle by considering the choice of frame-length LL. This is important because LL governs the length of the watermark signal, which in turn is related to robustness as seen in the equations above. For our experiments, we employed two objective audio quality metrics: the speech-specific perceptual evaluation of speech quality (PESQ) [13] and the open source implementation GstPEAQ [14] of perceptual evaluation of audio quality (PEAQ) [15, 16], which has been used previously for the evaluation of watermarking of music [17]. We used 200 randomly selected speech utterances from TIMIT [18] with a sampling rate of 16 kHz. We varied the frame-length between 20 ms and 200 ms, adding a watermark of the same length as the frame-length to each signal according to the equation above.

We observed that as the frame-length increases, the watermark becomes more audible, and at the same time, the EER decreases. On the other hand, if we reduce the watermark strength by scaling δ, we can maintain constant perceptual quality but at the expense of an increasing EER.

Next, we performed the same experiment using TIMIT speech samples and 144 7s music excerpts from MUSDB [19] with sampling rates of 16 kHz and 44.1 kHz measuring the audio quality with PEAQ. We observed that the degradation in speech quality with increasing frame-length is noticeable, but the music audio quality remains unaffected by the frame-length independent of the sampling rate. From these results, we can conclude that for speech signals, the frame-length should be chosen between 20 ms and 30 ms for the best trade-off between imperceptibility and robustness. For music, the frame-length could be much longer for greater robustness without perceptual degradation.

4. LPC Weighting

It has been stated previously that the spread spectrum watermark should be added to the frequency components with the greatest energy to enable robustness to degradations [7, 8]. In the case of speech, there is a counterargument because frequency components with the greatest energy typically correspond to formant peaks, and their disruption will impact speech quality. We introduce a watermark weighting scheme to reduce the strength dynamically along frequency at each frame based on the LPC log-spectrum.

Taking the LPC log-spectrum components of 400 speech utterances (200 female) from the TIMIT corpus, we model the values as a Gaussian distribution with mean zero and standard deviation of 13.5 (see Fig. 2 (left)). We use the Gaussian cumulative distribution function (CDF) F(x) to yield a weighting function γ(x)=(1−F(x))α, which reduces the watermark strength at high-energy spectral components; the parameter α\alpha controls the degree to which these components are attenuated. The revised watermark sequence, wlpcw, is created by obtaining the LPC log-spectrum, Xlpc(k), for each frame of speech and adjusting the watermark strength along frequency as:

4.1. Perceptibility Study

We first study the impact on perceptibility as measured by speech quality using the proposed scheme. In Fig. 3, we show the narrowband (NB) and wideband (WB) PESQ scores for both the standard SS method and the proposed LPC weighted one obtained from 200 TIMIT utterances (100 female). The data is clean speech sampled at 16 kHz, and the watermark length NwN_w is 63 (element width: 50 Hz, watermark start frequency: 500 Hz, watermark end frequency: 3650 Hz, frame length 320 ms, each frame is encoded with the same pseudo-random sequence). As expected, for the same value of δ\delta, the PESQ scores are higher using the proposed scheme as the watermark strength has been reduced in high-energy spectral regions. For instance, at δ=4\delta = 4, the NB and WB PESQ scores are, respectively, 3.50 and 3.24 using the original scheme, and 4.07 using the proposed one. The gender imbalance is also improved.

4.2 Robustness Study

We encoded 200 clean utterances sampled at 16 kHz from the TIMIT corpus (100 female) using both the standard spread spectrum method and the proposed scheme with α = 0.35.

Based on the analysis presented in the previous section, different values for δ were explored such that the narrowband PESQ scores for each scheme would be similar. This allowed us to evaluate any gains in robustness for the same level of imperceptibility. Prior to applying the decoding algorithm, the watermarked and watermark-free utterances were subjected to degradations: added white Gaussian noise for SNRs of 20 dB and 15 dB, and a downsampling operation to 8 kHz.

Fig. 4 shows the PESQ scores versus EER for the standard and proposed methods, where decoding has been performed using (2).

For 20 dB SNR and a NB PESQ of 4.4, the EERs using the standard and proposed methods are, respectively, 5.5% and 1.7% (69% reduction). For 15 dB SNR and the same NB PESQ, the EERs using the standard and proposed methods are 8.2% and 5.9% (28% reduction). Note that the robustness gains are greater at comparable WB PESQ scores; at a WB PESQ of 4.4, the reductions in EER are 83% and 62% for SNRs of 20 dB and 15 dB, respectively.

4.3 Comparison with Related Work

A spread spectrum approach presented in [9] might appear to contradict the objective of the proposed method, as it seeks to adjust the watermark spectrum to more closely match the spectral shape of speech based on the LPC model. We illustrate below how both methods achieve similar outcomes via their application to an example speech signal, the average spectrum of which is shown in Fig. 5 (a).

A core difference between the methods is the domain in which encoding and decoding are performed. In [9], the initial watermark signal is created from a filtered binary phase-shift keying (BPSK) sequence added to the speech signal in the time domain. To enable a comparison with our approach, we obtained the difference between the spectra of the speech signal and the BPSK-watermarked signal on a frame-by-frame basis. Fig. 5 (b) shows the average and standard deviation of that difference, i.e., the mean and standard deviation of the watermark spectrum. Note the flat spectral shape.

Fig. 5 (c) shows the spectral properties of the watermark signal post LPC-filtering. Observe how the spectral mean matches that of the speech signal, which the authors found improved the robustness of the watermark for a comparable speech quality [9].

In contrast, the spread spectrum method considered in this manuscript adds the sequence in the log-spectrum domain. Therefore, the watermark signal already matches the spectral shape of the speech signal by design. However, as shown in Fig. 5 (d), the standard deviation is large relative to the mean. This is expected, as the sequence must be strong in the (log-)spectral domain for robustness, but it can disrupt formant peaks and impact speech quality.

We see the spectral properties after applying the proposed LPC-weighing in Fig. 5 (e) and Fig. 5 (f) for α = 0.35 and α = 1.0, respectively. Note that for increasing α, we reduce the watermark spectral deviation, but this comes at the cost of causing the spectral mean to deviate from that of the speech signal.

As the approaches belong to fundamentally different classes of spread spectrum methods—where the encoding/decoding domains are time and log-spectrum—a direct performance comparison is outside the scope of this work. Nonetheless, our studies indicate better performance for speech when the spread spectrum method is implemented in the log-spectrum domain.

5. Deep Decoding

In the spread spectrum method, applying the dot product to the decoded spectrum, as in (2) and often in combination with a cepstral filter, is the optimal decoding solution in clean and simple degradation scenarios (e.g., added white Gaussian noise). However, in telephony use cases, the degradations are more complex. The speech signal may pass through a loudspeaker, introducing ambient noises and delays, and be subjected to filtering as it is acquired by a microphone. Furthermore, the telephony channel itself adds degradations such as downsampling, packet loss, and codec compression.

To address these challenges, we propose a deep-learning decoding strategy—“deep decoding”—which tailors the spread spectrum method to both the host signals (speech) and complex degradation environments.

We consider two low-complexity models, where the input layer operates on the frequency indices of the cepstral-filtered decoded spectral frame corresponding to the embedded watermark. In this study, we assume a watermark length of 63. The first model (Model A) comprises a pair of dense layers with 64 and 32 units, respectively, using ReLU activation functions, and a single-unit dense layer with linear activation as the output (Fig. 6, left). The second model (Model B) uses a 1D convolutional layer (16 filters, kernel width 3), a 16-unit dense layer, and a single-unit dense layer output (Fig. 6, right).

To yield a training dataset for the deep decoding models, encoding was performed on 4620 clean utterances (training partition of the TIMIT corpus, 462 speakers, 10 utterances per speaker) with a 16 kHz sampling frequency, a frame size of 20 ms, and a watermark length Nw of 63 (element width: 50 Hz, watermark start frequency: 500 Hz, watermark end frequency: 3650 Hz). The same utterances were used to generate watermark-free data. Two degradation scenarios were considered: (1) a downsampling operation to 8 kHz and (2) a downsampling operation to 8 kHz followed by encoding with the Opus codec at 8 kbps. The encoding watermark strength was δ = 1 for the first scenario and δ = 3 for the second. After degradation, a voice activity detector (VAD) was used to retain speech frames.

Decoded frames were obtained from both the watermarked data and watermark-free data as Xb = g(X ̃dB)w, where X ̃dB is the log power spectrum of a speech frame at indices where the watermark is encoded, and g(·) is the cepstral filter.

For each degradation scenario, the models were trained over 25 epochs using the RMSProp method to distinguish between frames with and without the watermark. We evaluated the decoding models on frames obtained from 1680 utterances from the test partition of the TIMIT corpus with the same encoding and decoding parameters, and degradation conditions matching those in training. The standard decoding dot product output was obtained by PXb.

We evaluated performance not only at the single-frame level but also by aggregating the outputs across speech frames within utterances. The single-frame EERs for the first degradation scenario are 27.7%, 25.8%, and 21.8% for the dot product and decoding models A and B, and 37.2%, 35.0%, and 30.0% for the second degradation scenario. Fig. 7 and Fig. 8 show the EERs versus the number of aggregated frames for each degradation scenario. The greater the number of aggregated frames, the more the deep decoding schemes yield performance gains compared to the dot product. Model B enables the largest reductions in EER, facilitating reductions of 70% (10 frames) and 98% (30 frames) for the first degradation scenario, and 59% (10 frames) and 86% (30 frames) for the second degradation scenario.

6. Conclusions

We have studied the popular spread spectrum watermarking method, typically applied to music, for speech signals. Our analysis has revealed that an encoding frame length of approximately 20 ms to 30 ms for speech achieves the optimal balance between watermark robustness and perceptibility. We have introduced extensions of the core method that address the encoding and decoding operations separately, enabling reductions in equal error rates without compromising speech quality. This work shows promise for applying watermarking to synthetic speech data to facilitate malicious-use detection, even in challenging environments such as call centers.

7. References

1. Ricketts, (2023). Ricketts introduces bill to combat deepfakes, require watermarks on A.I.-generated content. Available: https://www.ricketts.senate.gov/wp-content/uploads/2023/09/Advisory-for-AI-Generated-Content-Act.pdf

2. J. Davidson, (2024). Senate pursues action against AI deepfakes in election campaigns. Available: https://www.washingtonpost.com/politics/2024/04/26/senate-deepfakes-campaigns-ban/

3. M. A. Nematollahi and S. A. R. Al-Haddad, “An overview of digital speech watermarking,” Int. J. Speech Technol., vol. 16, no. 4, pp. 471–488, 2013.

4. M. Arnold, “Audio watermarking: Features, applications and algorithms,” in Proc. Intl. Conf. Multimedia and Expo (ICME), vol. 2, 2000, pp. 1013–1016.

5. D. Gruhl, A. Lu, and W. Bender, “Echo hiding,” in Information Hiding, R. Anderson, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 1996, pp. 295–315.

6. G. Hua, J. Goh, and V. L. L. Thing, “Time-spread echo-based audio watermarking with optimized imperceptibility and robustness,” IEEE Trans. Audio, Speech, Lang. Process., vol. 23, no. 2, pp. 227–239, 2015.

7. D. Kirovski and H. S. Malvar, “Spread-spectrum watermarking of audio signals,” IEEE Trans. Signal Process., vol. 51, no. 4, pp. 1020–1033, 2003.

8. H. S. Malvar and D. A. F. Florencio, “Improved spread spectrum: A new modulation technique for robust watermarking,” IEEE Trans. Signal Process., vol. 51, no. 4, pp. 898–905, 2003.

9. C. Qiang and J. Sorensen, “Spread spectrum signaling for speech watermarking,” in Proc. IEEE Intl. Conf. on Acoust., Speech, Signal Process. (ICASSP), vol. 3, 2001, pp. 1337–1340.

10. B. Chen and G. W. Wornell, “Digital watermarking and information embedding using dither modulation,” in IEEE Workshop on Multimedia Signal Processing, 1998, pp. 273–278.

11. P. O’Reilly, Z. Jin, J. Su, and B. Pardo, “MaskMark: Robust neural watermarking for real and synthetic speech,” in Proc. IEEE Intl. Conf. on Acoust., Speech, Signal Process. (ICASSP), 2024, pp. 4650–4654.

12. M. D. Swanson, B. Zhu, A. H. Tewfik, and L. Boney, “Robust audio watermarking using perceptual masking,” Signal Processing, vol. 66, no. 3, pp. 337–355, 1998.

13. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ) – A new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE Intl. Conf. on Acoust., Speech, Signal Process. (ICASSP), vol. 2, 2001, pp. 749–752.

14. M. Holters and U. Zolzer, “GstPEAQ – An open source implementation of the PEAQ algorithm,” in Proc. of the 18th Int. Conference on Digital Audio Effects (DAFx-15), 2015.

15. T. T. et al, “PEAQ – The ITU standard for objective measurement of perceived audio quality,” J. Audio Eng. Soc., vol. 48, no. 1/2, pp. 3–29, 2000.

16. P. M. Delgado and J. Herre, “Can we still use PEAQ? A performance analysis of the ITU standard for the objective assessment of perceived audio quality,” in 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6, 2020.

17. C. Neubauer and J. Herre, “Digital watermarking and its influence on audio quality,” in Proc. of the 105th Convention of the Audio Engineering Society, Sep. 1998.

18. J. S. Garofolo, “Getting started with the DARPA TIMIT CD-ROM: An acoustic phonetic continuous speech database,” National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, Technical Report, Dec. 1988.

19. Z. Rafii, A. Liutkus, F.-R. Stoter, S. I. Mimilakis, and R. Bittner, “The MUSDB18 corpus for music separation,” Dec. 2017.

Voice security is
not a luxury—it’s
a necessity

Take the first step toward a safer, more secure future
for your business.