Abstract

Environmental sound recordings often contain intelligible speech, raising privacy concerns that limit analysis, sharing, and reuse of data. In this paper, we introduce a method that renders speech unintelligible while preserving both the integrity of the acoustic scene and the overall audio quality. Our approach involves reversing waveform segments to distort speech content. This process is enhanced through a voice activity detection and speech separation pipeline, which allows for more precise targeting of speech.

In order to demonstrate the effectiveness of the proposed approach, we consider a three-part evaluation protocol that assesses:
1) Speech intelligibility using Word Error Rate (WER),
2) Sound sources detectability using Sound Source Classification Accuracy-Drop (SCAD) from a widely used pre-trained model, and
3) Audio quality using the Fréchet Audio Distance (FAD), computed with our reference dataset that contains unaltered speech.

Experiments on this simulated evaluation dataset, which consists of linear mixtures of speech and environmental sound scenes, show that our method achieves satisfactory speech intelligibility reduction (97.9% WER), minimal degradation of the sound sources detectability (2.7% SCAD), and high perceptual quality (FAD of 1.38). An ablation study further highlights the contribution of each component of the pipeline. We also show that incorporating random splicing into our speech content privacy enforcement method can enhance the algorithm’s robustness to attempts to recover the clean speech, at a slight cost of audio quality.

This is the companion page for the paper: TBF

Please, cite as: TBF

Experience code is available in this GitHub repository.

Main Evaluation

Below, the voice extracts generated from the different methods on two audio files: 61-70968-0035__11_004030.wav and 01_005499.wav.

Method Speech+Siren Speech+Music Speech+Chainsaw Dog
Original audio
Cohen-Adria et al.
Burkhardt et al.
Ours (Conv-TasNet)
Ours

As you can hear, since the dog audio resembles speech, the source separation and VAD do not produce perfect results in our method. However, you can hear that it preserves the sound scene better than the other methods.

Ablation Study

Below, the voice extracts generated from ablation methods on two audio files: 61-70968-0035__11_004030.wav and 01_005499.wav.

Method Speech+Siren Speech+Music Speech+Chainsaw Dog
Ours
-w/o VAD
-w/o Source Sep.

Robustness Study

Below, the voice extracts evaluating robustness to content recovery, using our method and "mixframe" variants. The audio 61-70968-0035__11_004030.wav (Anonymizedx2) indicates the audio where we applied the privacy enforcement method on the already privacy enforced 61-70968-0035__11_004030 file.

Method 61-70968-0035__11_004030.wav 61-70968-0035__11_004030.wav (Anonymizedx2)
Ours
Ours (with mixframe)