Hearing and sight are the two most important sensors for humans to perceive the world. Audio and visual signals are usually coexisting and complementary. For example, when we hear a dog bark or a siren wail,
we might see a dog or ambulance around accordingly.
Recently, audio-visual representation learning has attracted lots of attention and spawned some interesting tasks, such as Audio-Visual Correspondence (AVC), Audio-Visual Event Localization (AVEL) and video parsing (AVVP), Sound Source Localization (SSL), etc.
In this study, we explore the Audio-Visual Segmentation (AVS) problem that aims to generate pixel-level segmentation map of the object(s) that produce sound at the time of the image frame.
An illustration of the AVS task is shown in Figure 1. It is a fine-grained audio-visual learning problem, the pixel-level audio-visual correspondence/correlation/matching is supposed to learn.
To facilitate this research, we propose the AVSBench dataset (details are introduced in the next section). With AVSBench, we study two settings of AVS: 1) semi-supervised single sound source segmentation (S4); 2) fully-supervised multiple sound source segmentation (MS3).
AVSBench is an open pixel-level audio-visual segmentation dataset that provides ground truth labels for sounding objects. We divide our AVSBench dataset into two subsets, depending on the number of sounding objects in the video (single- or multi-source). [Note] In practice, for the Multi-source set, it may contain some videos where multiple objects are visible in the video frame but not all of the objects are emitting sounds. We think these videos are still helpful because the model is still required to distinguish which object is producing sound and segment the correct one from the multiple potential sound sources.
AVSBench statistics. The videos are split into train/valid/test. The asterisk (*) indicates that, for Single-source training, one annotation per video is provided all others contain 5 annotations per video. (Since there are 5 clips per video, this is 1 annotation per clip.) Together, these yield the total annotated frames.
For the Single-source subset, the detailed video categories and video numbers of each category are displayed in Figure 2.
Existing audio-visual dataset statistics. Each benchmark is shown with the number of videos and the annotated frames. The final column indicates whether the frames are labeled by category, bounding boxes, or pixel-level masks.
Some video samples in the AVSBench dataset. Through these examples, we can have a better understanding of the dataset and the AVS task.
The csv file that contains the video ids for downloading the raw YouTube videos and the annotated ground truth segmentation maps can be downloaded from following links :
The AVSBench dataset on this page is copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.
To solve the AVS problem, we propose an end-to-end model that utilizes a standard encoder-decoder architecture but with
a new temporal pixel-wise audio-visual interaction (TPAVI) module to better
introduce the audio semantics for guiding visual segmentation. We also propose
a loss function to utilize the correlation of audio-visual signals, which further
enhances segmentation performance.
More details are shown in our paper.
An overview of the proposed framework is illustrated in below figure. It follows a hierarchical Encoder-Decoder pipeline. The encoder takes the video frames and the entire audio clip as inputs, and outputs visual and audio features, respectively denoted as Fi and A. The visual feature map Fi at each stage is further sent to the ASPP module and then our TPAVI module. ASPP provides different receptive fields for recognizing visual objects, while TPAVI focuses on the temporal pixel-wise audio-visual interaction. The decoder progressively enlarges the fused feature maps by four stages and finally generates the output mask M for sounding objects.
We first compare the proposed AVS baseline with several methods from related tasks, namely Sound Source Localization (SSL), Video Object Segmentation (VOS), and Salient Object Detection (SOD). The quantative results are shown in the Table below. Our AVS method surpasses all of these methods. Please refer to our paper for more analysis and qualitative results.
We then conduct some ablation studies to explore the impact of the audio signal and the TPAVI module. The results are shown below. The middle row indicates directly adding the audio and visual features, which already improves performance under the MS3 setting. The TPAVI module further enhances the results over all settings and backbones.
We also display some qualitative examples under the semi-supervised S4 setting and fully-supervised MS3 settings. As shown in Figures 3 and 4, these results indicate that the audio signals provide positive support for segmenting the correct sounding object and outlining better shapes.
In our paper, we provide more experimental results, such as additional comparison with a two-stage baseline, visualization of the audio-visual attention map, segmenting unseen objects, etc. Please refer to the paper for more details.
Some video segmentation demos. The segmentation maps are generated by the PVT-v2 based AVS model.
If you find our work useful in your research, please cite our ECCV 2022 paper:
@inproceedings{zhou2022avs,
title = {Audio-Visual Segmentation},
author = {Zhou, Jinxing and Wang, Jianyuan and Zhang, Jiayi and Sun, Weixuan and Zhang, Jing and Birchfield, Stan and Guo, Dan and Kong, Lingpeng and Wang, Meng and Zhong, Yiran},
booktitle = {European Conference on Computer Vision},
year = {2022}
}