Abstract

We present Diffspec, a method that transforms fast third-octave data into audio waveforms. The algorithm first converts FTOs into Mel spectrograms using a super-resolution model, then uses a vocoder to generate the final audio. This approach makes it possible to listen to acoustic scenes from fast third-octave data.

Experience code is available in this GitHub repository.

Audio examples

Below, we present some audio examples. These are generated by first computing fast third-octave data from the original audio files, then reconstructing the audio waveforms using different methods.

The "Pseudo-inverse + Vocoder" method uses a basic signal processing approach: it estimates the Mel spectrogram from the fast-third-octave data using a pseudo-inverse algorithm, then applies a vocoder to synthesize the audio. This spectrogram super-resolution method does not involve any machine learning and serves as a baseline for comparison.

The "FAST-TO-WAV" method is our proposed approach, which uses a diffusion model to generate higher-quality spectrograms before vocoding.

For reference, we also include the original audio.

Method Pseudo-inverse+vocoder FAST-TO-WAV Original
Birds+Voices+Traffic
Crows+Owls
Crowd
Car
Chainsaw
Train
Street Music