Publication

On the Role of Room Acoustics in Audio Presentation Attack Detection

Nikolay D. Gaubitch and David Looney

Pindrop Inc., London, UK

ABSTRACT

Presentation attack detection (PAD) aims to determine if a speech signal observed at a microphone was produced by a live talker or if it was replayed through a loudspeaker. This is an important problem to address for secure human-computer voice interactions. One characteristic of presentation attacks where recording and replay occur within enclosed reverberant environments is that the observed speech in a live-talker scenario will undergo one acoustic impulse response (AIR) while there will be a pair of convolved AIRs in the replay scenario. We investigate how this physical fact may be used to detect a presentation attack. Drawing on established results in room acoustics, we show that the spectral standard deviation of an AIR is a promising feature for distinguishing between live and replayed speech. We develop a method based on convolutional neural networks (CNNs) to estimate the spectral standard deviation directly from a speech signal, leading to a zero-shot PAD approach. Several aspects of the detectability based on room acoustics alone are illustrated using data from ASVspoof2019 and ASVspoof2021.

Index Terms – Presentation attack detection, reverberation

1. INTRODUCTION

Automatic speaker verification (ASV) systems are becoming increasingly popular in our connected world and thus, there is a growing need to make these not only more accurate but also secure to potential misuse. One critical security aspect is presentation attack where a recording of the target voice is replayed through a loudspeaker to the ASV system. Research effort on this topic has been driven largely by the series of ASVspoof challenges where also the majority of existing literature on the topic may be found. Existing methods typically treat presentation attack detection (PAD) as a classification problem where classifiers are trained on examples of replayed or bonafide recordings using both traditional feature design as well as end-to-end deep learning approaches. There is also related work on one-class detection of modified or synthetic speech.

There have been several indications that room acoustics plays an important role in the ability to detect a presentation attack, however, this has not been studied explicitly. In this paper, we focus on room reverberation where we analyze several qualities of the acoustic impulse response (AIR) and the impact this has on presentation attacks. Furthermore, we use the most promising parameter to train a convolutional neural network (CNN) for estimating that parameter from speech directly. We then demonstrate its ability to successfully separate bonafide from replayed speech using the ASVspoof2019 and ASVspoof2021 evaluation data sets; these results highlight some important aspects of the role that room acoustics play in PAD.

The remainder of this paper is organised as follows. In Section 2 we formulate the problem of a presentation attack from the point of view of the room acoustics and specifically the resulting convolution of two AIRs in the case of a presentation attack. In Section 3 we summarise the key spectral and temporal differences between a single AIR and two convolved AIRs. We define methods for measuring these properties from AIRs in Section 4 and we investigate in Section 5 which of the properties would be suitable for separating between a single and two convolved AIRs. Following, in Section 6 we define a CNN architecture to estimate the most promising parameter directly from a speech signal and we demonstrate in Section 7 how this may be used for PAD but also how it defines the limitation of the relevance of room acoustics. Finally, we summarise the key findings in Section 8.

2. PROBLEM FORMULATION

We assume that a speech signal s(n) is produced by a live talker and captured by a microphone at a distance from the talker at some location A. The observed signal xA(n) is:

xA(n) = s(n) ∗ hA(n) + νA(n), (1)

where ∗ denotes linear convolution while hA(n) and νA(n) denote the AIR and the ambient noise, respectively. Here ‘location’ refers to an acoustic space and some relative position between talker and microphone; this is the bonafide scenario. In the remainder of this work we assume that there is no additive noise, so that νA(n) = 0 in order to emphasise the effects of reverberation. In the case of a presentation attack, speech captured at location A is replayed from location B and the observed signal is given by

xAB(n) = xA(n) ∗ hB(n) = s(n) ∗ hAB(n), (2)

where hB(n) is the AIR of room B and hAB(n) = hA(n) ∗ hB(n) is the composite AIR of the two acoustic spaces. We investigate the effect of two convolved AIRs, the ability to separate hAB(n) from hA(n), and how to do this directly from the observed speech signals xA(n) or xAB(n) in order to detect a presentation attack.

3. REVIEW OF THE SPECTRAL AND TEMPORAL PROPERTIES OF TWO CONVOLVED AIRS

The effects of two convolved AIRs have been studied previously in the context of speech perception and intelligibility with a compre- hensive contribution in [7]. In this section we summarise the key theoretical results from [7] which will serve as the basis of our work.

3.1 Change in Pulse Density

The number of reflections in the AIR after t seconds for a shoebox room is given by [8]

where c is the speed of sound in metres per second and VA is the room volume in cubic metres. It was shown in [7] that the number of reflections for two convolved AIRs is

The number of reflections increases with the power of six rather than the power of three for a single room. Thus, the effect of the sparse strong early reflections is reduced.

3.2 Transient Distortion

The decay of the expected sound intensity for an acoustic space is governed by e^(-t/τA) where τA is the time constant associated with the absorption of the room boundaries and is proportional to the reverberation time. On the other hand, it was shown that the expected temporal envelope of hAB(n) is driven by the term (e^(-t/τB) – e^(-t/τA)). Thus, for two convolved AIRs, the intensity is governed by two exponential functions rather than one. The two additive exponentials have opposite signs, which leads to an initial rise of energy after the onset of the exponential decay. This is different from the typical behavior of a diffuse reverberation tail.

3.3 Change in Decay

The decay time of two convolved AIRs is observed to be longer than each of the decay times separately. Thus, the apparent reverberation time will increase. It was shown that this change will be dominated by the larger of the two decays. In other words, the late reverberation decay will be driven by the AIR with the longer reverberation time.

3.4 Modulation Transfer Function

he modulation transfer function (MTF) is related to the reverberation time and the temporal structure of the AIR. It has been observed that the MTF is lower for hAB(n) compared to that of single rooms – in particular for higher modulation frequencies. Again, this is in line with the increase in reverberation time.

3.5 Spectral Effects

One way to characterize the spectrum is by its modulation strength. The modulation strength for the log-spectrum resulting from the AIR has been shown to be [9]

σA, Lspec = 5.56 dB, (5)

which holds when the source-microphone distance is greater than the critical distance and for frequencies above the Schroeder frequency, given by [8]

fSch ≈ 2000√(T60/VA), (6)

where T60 denotes the reverberation time in seconds. When the source-microphone distance is below the critical distance – where the direct sound energy equals the reverberant sound energy – the spectral strength decreases. This was used to estimate the critical distance. The critical distance, dc, is related to the reverberation time and room volume by [8]

dc = (1/4)√(γSA/π) ≈ 0.057√(γVA/T60), (7)

where γ is the directivity of the source and SA is the total absorption surface. Under the assumption that the spectra are uncorrelated it can be shown that the spectral modulation strength for two convolved AIR is

σAB, Lspec = 8.28 dB.

4. AIR BASED METRICS

We can summarize the findings in Section 3 into three main categories: temporal at reflection level, temporal at decay level, and spectral. While we could use some of the theoretical results for simulated acoustical environments, it is more practical to have metrics based on the AIR. Consequently, we define three methods for quantifying each of these categories.

4.1 Energy Decay Curve

One way to measure and analyze the decay of an AIR is the energy decay curve (EDC), as described in [10]. The EDC is directly linked to the reverberation time and is often used to calculate it from an AIR. It is defined as:

EDC(t) = \int_{t}^{\infty} h^2(\tau) d\tau \] (9)

4.2 Spectral Standard Deviation

The spectral characteristics of an AIR can be quantified using the spectral standard deviation (SSTD) [9, 11], defined as:

\[ \sigma_L = \sqrt{\frac{1}{N} \sum_{k=0}^{N-1} [H(k) – \bar{H}(k)]^2} \] (10)

where \(H(k)\) is the log-spectral magnitude resulting from the discrete N-point discrete Fourier transform (DFT) of \(h(n)\), and \(\bar{H}(k)\) is the average of \(H(k)\) across frequency.

4.3 Late Reverberation Onset

The echo density profile is a metric used to estimate the onset of the diffuse reverberation tail in an AIR. Based on the discussion in Section 3.1, it can be expected that this onset will occur earlier for two convolved impulse responses. Here we use the method to measure the echo density profile proposed in [12], defined as:

is a sliding Hamming window of length

, set at 20 ms, erfc(⋅)\text{erfc}(·) is the complementary error function, and 1⋅1{·} is the indicator function, which returns one if the argument is true and zero otherwise. The late reverberation onset is defined as the time when η(n)≥1\eta(n) \geq 1.

5. SEPARATING A SINGLE AND TWO CONVOLVED AIRS

We now investigate the three parameters for the metrics presented in Section 4 using a set of 31 measured and 500 simulated AIRs. The objective is to study the ability of these parameters to distinguish between a single AIR and two convolved AIRs. The measured impulse responses were taken from the first microphone of the ‘Lin8Ch’ in the ACE database [13] and the first microphone of the binaural measurements (without dummy head) from the AIR database [14]. We simulated AIRs using the source-image method [11]. The room dimensions were chosen at random, drawn from a uniform distribution ranging between 2 m and 15 m for the length and width, and between 2.5 m and 4 m for the height. A randomly selected reverberation time between 0.1 s and 1.2 s was attributed to each room. A source and a microphone were positioned at randomly chosen locations within each room, constraining the distance from any surface to at least 0.5 m and the minimum source-microphone distance to 0.2 m. We considered sampling rates of 16 kHz and 48 kHz.

A randomly selected subset with 30 out of the 531 AIRs representing hA(n)h_A(n) was convolved with every other AIR in that subset to generate hAB(n)h_{AB}(n); only a subset was used in order to keep the data balanced. For each hA(n)h_A(n) and hAB(n)h_{AB}(n), we calculated the slope of the EDC, SSTD, and late reverberation onset time. Histograms of these values for the two cases of a single AIR and two convolved AIRs are shown in Fig. 1(a)-(c) for a sampling rate of 48 kHz and Fig. 1(d)-(f) for a 16 kHz sampling rate; each figure title shows the Kolmogorov-Smirnov (KS) test statistic. We can make the following observations:

– Clear separation between a single AIR and two convolved AIRs for the SSTD, independent of the sampling rate.

– SSTD centers around 5.6 dB for a single AIR and close to 8 dB for the convolved AIRs, as predicted by (5) and (8), respectively. The distributions will overlap when the talker or the replay occurs below the critical distance in (7).

– Late reverberation onset provides reasonable separation at a sampling rate of 48 kHz but less so at 16 kHz, which is largely due to the fact that impulses spread out in time at lower sampling rates.

– The EDC slope provides some level of separation, but overall there is a large overlap, which is not surprising since the reverberation time will be within reasonable limits for most realistic situations.

6. ESTIMATING SSTD FROM SPEECH

In Section 5, we demonstrated that the SSTD gives the best separation between a single AIR and two convolved AIRs. However, these measurements were made directly from the AIRs, which we rarely have access to in practice. Instead, we would like to estimate the SSTD from the observed reverberant speech so that we can perform further PAD studies. To this end, we devised a VGG-like [15] CNN architecture implemented using TensorFlow [16].

The input layer operates on the spectrogram obtained from 0.5 s of speech, which is input to two 16-channel convolutional layers followed by max-pooling, and two 32-channel convolutional layers with max-pooling; the filter sizes are 3×3, and the pooling stride is (2, 2). The convolutional layer outputs are flattened and, following a 25% dropout, input to a 32-channel fully connected layer before the final 1-channel output. All layers use ReLU as an activation function. The network was optimized using Adam with a learning rate of 0.001, and a loss function constituting the mean absolute error (MAE) between estimated and measured SSTD. We used speech utterances from the training partition of TIMIT [17], sampled at 16 kHz, and the AIRs described in Section 5. A random selection of 80 speech utterances was drawn from TIMIT for each AIR.

A pre-emphasis filter was applied in order to counter the inherent spectral decay of speech. This is a common pre-processing step in many speech applications [18], and it was found to be essential when estimating the SSTD. The pre-emphasis filter is defined as:

where 0α1 is the filter coefficient and was set here to α=0.9\alpha = 0.9. The pre-emphasized reverberant speech signals were divided into non-overlapping frames of 0.5 s, and the spectrogram was calculated for each frame with a DFT frame size of 512 samples and 50% overlap. Only frequencies above 200 Hz were considered, which satisfies approximately the Schroeder frequency requirement discussed in Section 3.5. We used 25% of the training data for validation and the remaining 75% for training. The network was trained for 50 epochs. The estimates of the frames were averaged to produce a single SSTD estimate per utterance.

We generated a test set with AIRs simulated for four rooms with volumes of 28, 58, 77, and 120 m³ and with reverberation times (RTs) ranging from 0.1 to 0.7 s in steps of 0.1 s. For each AIR, 10 speech samples were drawn at random from the test portion of TIMIT. Thus, none of the test data was seen in training. The estimation result on the test data is shown in the two-dimensional histogram in Fig. 2, where we see a good match between the estimated and true values. The correlation coefficient is 0.96, and the MAE is 0.29.

7. SSTD for Presentation Attack Detection

We have shown that the SSTD provides good separation between a single AIR and two convolved AIRs, and that we are able to estimate it directly from reverberant speech. As a final step, we investigated to what extent the CNN can be used as a zero-shot method for PAD. We used the estimated SSTD as a score for each speech utterance and explored different thresholds for separation between bonafide and replayed speech. We utilized the ASVSpoof2019 [3] and the ASVspoof2021 [4] evaluation datasets. These datasets were deemed suitable for the task because they specifically consider different reverberant scenarios in a controlled environment—the 2019 data uses simulated reverberation conditions, while the 2021 data contains real room recordings.

The ASVSpoof2019 dataset [3] contains 134,730 samples of bonafide and replayed speech. The data is divided into different categories as shown in Table 1. Each sample is annotated with a triplet (S, R, D_s) and a duple (Z, Q) to form different combinations of room sizes, RTs, and speaker-microphone distances. In addition to the complete dataset (annotated as ‘full’), we focused on subsets of the data that clearly illustrate different aspects of the reverberation-driven PAD. We selected (c,a,a) to represent large rooms with short reverberation times and small source-microphone separation, where we could expect poor PAD performance, (b,c,c) for the most favorable conditions for reverberation-driven PAD, and (b,b,b) for a realistic office-like example. There are 4,990 samples in each subset.

The results for these experiments are shown by the detection error trade-off (DET) plots in Fig. 3(a), where we observe the expected outcome. The equal error rate (EER) for the complete dataset is 22.37% and improves progressively from the worst case (c,a,a) to the best case (b,c,c), with EERs of 33.43% and 2.24%, respectively. To put this in perspective, the best performing baseline of ASVspoof2019 [3] had an EER of 11.04%. Note that the contributions in the ASVspoof challenge, unlike our zero-shot approach, were trained on development data closely linked to the test partition.

We then focused on the scenario of small but realistic reverberant spaces (b,c,c) to study the effect of the attacker-to-talker separation and the replay device quality. We considered three cases—(A,A), (B,B), and (C,C)—representing increasing attacker-talker distance and decreasing replay device quality. The result is shown in Fig. 3(b), where we can clearly see that the most favorable condition for reverberation-driven PAD is given by the case (C,C). In other words, the best separation between bonafide speech and presentation attacks is achieved when both the recorded and the replayed speech are reverberant and at a sufficiently large distance from the microphone, at which point the EER reaches 1.04%. The distance will depend on the room volume and the reverberation time as seen in (7).

The ASVspoof2021 dataset contains real recordings from nine rooms at different source-microphone and attacker-talker distances. We used a similar approach as for the 2019 data and analyzed the complete dataset (indicated as ‘full’) and the cases of ‘d1,’ which is the largest attacker-mic distance of 2 m, with ‘c2,’ ‘c3,’ and ‘c4’ indicating attacker-to-talker distances of 1.5 m, 1 m, and 0.5 m, respectively. The results are shown in Fig. 4, where the effects of reverberation are clearly seen. The EER of the complete data is 36.28%, which is slightly lower than the best challenge baseline of 38.07%. When the reverberation conditions are favorable (largest attacker-talker and attacker-mic distances), the EER decreases to 16.39%.

Interestingly, it was observed in [2] that features based on the frame-level log-energy greatly improved PAD performance. We believe that this could be partially explained by the reverberation analysis provided in this paper. While the focus was largely on SSTD, including more information such as the EDC slope and the late reverberation onset can further improve PAD, especially at higher sampling rates. Further studies of combining the aforementioned ideas with more traditional PAD methods are left for future work.

8. Conclusions

We posed the problem of PAD as the separation between a single AIR and two convolved AIRs, and we summarized several differences derived from room acoustics theory. Our analysis showed that the most significant difference is observed with the SSTD. We used a CNN framework for accurate estimation of the SSTD from speech and applied it for zero-shot PAD. The method was evaluated using the ASVspoof2019 dataset, where we achieved an EER of 22.37% on the complete dataset and 1.04% on a portion of the data where we expect better discriminability. Similar trends were observed for the ASVspoof2021 dataset, where an EER of 36.28% was achieved on the complete dataset—1.79% lower than the challenge baseline. Most importantly, we provided valuable insights into the relevance of room acoustics on PAD.

9. References

[1] T. Kinnunen, Md Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. Aik Lee, “The ASVspoof 2017 Challenge: Assessing the limits of replay spoofing attack detection,” in Proc. Interspeech, 2017, pp. 2–6.

[2] H. Delgado, M. Todisco, Md Sahidullah, N. Evans, T. Kinnunen, K. A. Lee, and J. Yamagishi, “ASVspoof 2017 Version 2.0: Meta-data analysis and baseline enhancements,” in Odyssey 2018 – The Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, June 2018.

[3] M. Todisco, X. Wang, V. Vestman, Md Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. Aik Lee, “ASVspoof 2019: Future horizons in spoofed and fake audio detection,” in Proc. Interspeech, Graz, Austria, Sep 2019.

[4] X. Liu, X. Wang, Md Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “ASVspoof 2021: Towards spoofed and deepfake speech detection in the wild,” IEEE Trans. Audio, Speech, Lang. Process., vol. 31, pp. 2507–2522, 2023.

[5] D. Looney and N. D. Gaubitch, “On the detection of pitch-shifted voice: Machines and human listeners,” in Proc. IEEE Intl. Conf. on Acoust., Speech, Signal Process. (ICASSP), Toronto, Canada, June 2021, pp. 5804–5808.

[6] F. Alegre, A. Amehraye, and N. Evans, “A one-class classification approach to generalized speaker verification spoofing countermeasures using local binary patterns,” in Proc. IEEE Int. Conf. Biometrics: Theory Applications and Systems (BTAS), Arlington, VA, USA, Sept. 2013.

[7] A. Haeussler and S. van de Par, “Crispness, speech intelligibility, and coloration of reverberant recordings played back in another reverberant room (Room-in-Room),” J. Acoust. Soc. Am., vol. 145, no. 2, pp. 931–942, Feb. 2019.

[8] H. Kuttruff, Room Acoustics, Taylor and Francis, London, U.K., 2000.

[9] J. J. Jetzt, “Critical distance measurement of rooms from the sound energy spectral response,” J. Acoust. Soc. Am., vol. 65, no. 5, pp. 1204–1211, May 1979.

[10] M. R. Schroeder, “New method for measuring reverberation time,” J. Acoust. Soc. Am., vol. 37, no. 409, 1965.

[11] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” J. Acoust. Soc. Am., vol. 65, no. 4, pp. 943–950, Apr. 1979.

[12] J. S. Abel and P. Huang, “A simple, robust measure of reverberation echo density,” in Proc. AES 121st Convention, San Francisco, USA, Oct. 2006.

[13] J. Eaton, N. D. Gaubitch, A. H. Moore, and P. A. Naylor, “Estimation of room acoustic parameters: The ACE challenge,” IEEE Trans. Audio, Speech, Lang. Process., vol. 24, no. 10, pp. 1681–1693, Oct. 2016.

[14] M. Jeub, M. Schafer, and P. Vary, “A binaural room impulse response database for the evaluation of dereverberation algorithms,” in Proc. Intl. Conf. Digital Signal Process., Jul 2009.

[15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv, 2014.

[16] M. Abadi et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, Software available from tensorflow.org.

[17] J. S. Garofolo, “Getting started with the DARPA TIMIT CD-ROM: An acoustic phonetic continuous speech database,” Technical report, National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, Dec. 1988.

[18] T. Backström, O. Räsänen, A. Zewoudie, P. P. Zarazaga, L. Koivusalo, S. Das, E. Gomez Mellado, M. Bouafif Mansali, D. Ramos, S. Kadiri, and P. Alku, Introduction to Speech Processing, https://speechprocessingbook.aalto.fi, 2nd edition, 2022.

Voice security is
not a luxury—it’s
a necessity

Take the first step toward a safer, more secure future
for your business.