Modan Tailleur, Mathieu Lagrange, Pierre Aumond, Vincent Tourre
Contact: modan.tailleur@ls2n.fr
We present Diffspec, a method that transforms fast third-octave data into audio waveforms. The algorithm first converts FTOs into Mel spectrograms using a super-resolution model, then uses a vocoder to generate the final audio. This approach makes it possible to listen to acoustic scenes from fast third-octave data.
Experience code is available in this GitHub repository.
Below, we present some audio examples. These are generated by first computing fast third-octave data from the original audio files, then reconstructing the audio waveforms using different methods.
The "Pseudo-inverse + Vocoder" method uses a basic signal processing approach: it estimates the Mel spectrogram from the fast-third-octave data using a pseudo-inverse algorithm, then applies a vocoder to synthesize the audio.
This spectrogram super-resolution method does not involve any machine learning and serves as a baseline for comparison.
The "FAST-TO-WAV" method is our proposed approach, which uses a diffusion model to generate
higher-quality spectrograms before vocoding.
For reference, we also include the original audio.
Method | Pseudo-inverse+vocoder | FAST-TO-WAV | Original |
---|---|---|---|
Birds+Voices+Traffic | |||
Crows+Owls | |||
Crowd | |||
Car | |||
Chainsaw | |||
Train | |||
Street Music |