Abstract

Environmental audio recordings offer rich information for soundscape analysis but are often too lengthy for practical listening. To address this, we introduce an automatic method for generating audio summaries, which we call audio skims, designed to highlight key characteristics of a full-length audio. Our approach balances two objectives: Temporal Consistency (TC), which reflects how closely the summary preserves the temporal structure of the full-length audio, and Source Diversity (SD), which reflects how well the summary represents the diversity of sound sources of the full-length audio. We define a controllable parameter z to modulate this trade-off and propose a two-stage audio skim generation pipeline: Main Events Identification (MEI) via K-means clustering of audio embeddings, and Sequencing of Main Events (SME) using a greedy algorithm optimized for the z-weighted balance between SD and TC. We validate our method on a dataset of 24-hour urban recordings by generating 1-minute audio skims. Through both objective and subjective evaluations, we show that our method produces high-quality audio skims and enables controllable trade-offs between TC and SD. This approach supports practical and customizable access to environmental recordings, with applications in soundscape research, urban planning, and public communication.

This is the companion page for the paper: TBF

Please, cite as: TBF

Experience code is available in this GitHub repository.

Audio Skims

Below, we present audio skims created for full-length audios recorded in 4 distinct environments: the city center of a large french city, a neighboorhood closeby to the city center, a residential area, and a peri-urban area. For each of those environments, we first show the skims produced by our baselines, which are random sampling and downsampling, None of those methods allow a controllable trade-off between temporal consistency and source diversity. Then, we show the skims produced by our method, and the expert reference produced by a sound engineer, which allow to control this trade-off using a user-defined parameter z. As time is not linear in the audio skims, we display the time elapsed in the video using a clock.

City Center

Method
Random
Downsampling
High TC, Low SD (z=0.0) Mid TC, Mid SD (z=0.5) Low TC, High SD (z=1.0)
Ours
Expert Reference

City Neighborhood

Method
Random
Downsampling
High TC, Low SD (z=0.0) Mid TC, Mid SD (z=0.5) Low TC, High SD (z=1.0)
Ours
Expert Reference

Residential Area

Method
Random
Downsampling
High TC, Low SD (z=0.0) Mid TC, Mid SD (z=0.5) Low TC, High SD (z=1.0)
Ours
Expert Reference

Peri-Urban Area

Method
Random
Downsampling
High TC, Low SD (z=0.0) Mid TC, Mid SD (z=0.5) Low TC, High SD (z=1.0)
Ours
Expert Reference

As you can hear, since the dog audio resembles speech, the source separation and VAD do not produce perfect results in our method. However, you can hear that it preserves the sound scene better than the other methods.

Audio-Visual Examples

Audio skims are easier to interpret when accompanied by time-period representations, as the time elapsed can potentially be non-linear. However, using video solely to indicate elapsed time makes limited use of its potential as a medium. We propose here an approach to enhance the video with additional information to help visually the user in exploring the soundscape represented in the audio skim. In particular, we display the detected sound sources using the top two class outputs from PANNs, an audio classifier similar to BEATs presented in the paper, and computed with sliding windows. The size of each displayed word reflects the model's confidence in its prediction. Only the most relevant classes are shown, while less important ones are filtered out.

Method High TC, Low SD (z=0.0) Mid TC, Mid SD (z=0.5) Low TC, High SD (z=1.0)
Ours