Audio-Visual Segmentation

[ECCV 2022]

    *Jinxing Zhou1,2, *Jianyuan Wang3, Jiayi Zhang2,4, Weixuan Sun2,3,
    Jing Zhang3, Stan Birchfield5, Dan Guo1, Lingpeng Kong6,7,
    📧Meng Wang1, 📧Yiran Zhong2,7
      1Hefei University of Technology, 2SenseTime Research,
      3Australian National University, 4Beihang University, 5NVIDIA,
      6The University of Hong Kong, 7Shanghai Artificial Intelligence Laboratory
      (*Equal contribution, 📧Corresponding author)

[Paper]  [Dataset]  [Code]


Update

  • 18 Oct 2022: We have completed the collection and annotation of AVSBench-V2. It contains ~7k multi-source videos covering 70 categories, and the ground truths are provided in the form of multi-label semantic maps (labels of V1 are also updated). We will release it as soon as possible.
  • 13 Jul 2022: We are preparing the AVSBench-V2 which is much larger than AVSBench and will pay more attention to multi-source situation.
  • 11 Jul 2022: The dataset has been uploaded to Google Drive and Baidu Netdisk (password: shsr), welcome to download and use!
  • 10 Jul 2022: The AVSBench dataset has been released, please see Download for details.
  • 10 Jul 2022: Code has been released here!
  • 08 Jul 2022: Our paper is accepted to ECCV-2022. Camera-ready version and code will be released soon!

Audio-Visual Segmentation task

Hearing and sight are the two most important sensors for humans to perceive the world. Audio and visual signals are usually coexisting and complementary. For example, when we hear a dog bark or a siren wail, we might see a dog or ambulance around accordingly. Recently, audio-visual representation learning has attracted lots of attention and spawned some interesting tasks, such as Audio-Visual Correspondence (AVC), Audio-Visual Event Localization (AVEL) and video parsing (AVVP), Sound Source Localization (SSL), etc.
In this study, we explore the Audio-Visual Segmentation (AVS) problem that aims to generate pixel-level segmentation map of the object(s) that produce sound at the time of the image frame. An illustration of the AVS task is shown in Figure 1. It is a fine-grained audio-visual learning problem, the pixel-level audio-visual correspondence/correlation/matching is supposed to learn. To facilitate this research, we propose the AVSBench dataset (details are introduced in the next section). With AVSBench, we study two settings of AVS: 1) semi-supervised single sound source segmentation (S4); 2) fully-supervised multiple sound source segmentation (MS3).

Figure 1. Comparison of the proposed AVS task with the SSL task. Sound source localization (SSL) estimates a rough location of the sounding objects in the visual frame, at a patch level. We propose AVS to estimate pixel-wise segmentation masks for all the sounding objects, no matter the number of visible sounding objects. Left: Video of dog barking. Right: Video with two sound sources (man and piano).

AVSBench Dataset

statistics and samples of our dataset and annotations

AVSBench is an open pixel-level audio-visual segmentation dataset that provides ground truth labels for sounding objects. We divide our AVSBench dataset into two subsets, depending on the number of sounding objects in the video (single- or multi-source). [Note] In practice, for the Multi-source set, it may contain some videos where multiple objects are visible in the video frame but not all of the objects are emitting sounds. We think these videos are still helpful because the model is still required to distinguish which object is producing sound and segment the correct one from the multiple potential sound sources.

AVSBench statistics. The videos are split into train/valid/test. The asterisk (*) indicates that, for Single-source training, one annotation per video is provided all others contain 5 annotations per video. (Since there are 5 clips per video, this is 1 annotation per clip.) Together, these yield the total annotated frames.

For the Single-source subset, the detailed video categories and video numbers of each category are displayed in Figure 2.

Figure 2. Statistics of the whole Single-source subset of AVSBench. The texts represent the category names. For example, the 'helicopter' category contains 311 video samples.

Existing audio-visual dataset statistics. Each benchmark is shown with the number of videos and the annotated frames. The final column indicates whether the frames are labeled by category, bounding boxes, or pixel-level masks.


Some video samples in the AVSBench dataset. Through these examples, we can have a better understanding of the dataset and the AVS task.

Download

dataset publicly available for research purposes

Data and Download


The csv file that contains the video ids for downloading the raw YouTube videos and the annotated ground truth segmentation maps can be downloaded from following links :

For the processed videos and audios:
    - Please send an email to zhoujxhfut@gmail.com, with your name and institution.
    - We also provide some scripts to process the raw video data and extract the frames/mel-spectrogram features at our Github repository.



Copyright

The AVSBench dataset on this page is copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.

A Simple Baseline for AVS Task

the proposed audio-visual segmentation method, experimental results and simple analysis

To solve the AVS problem, we propose an end-to-end model that utilizes a standard encoder-decoder architecture but with a new temporal pixel-wise audio-visual interaction (TPAVI) module to better introduce the audio semantics for guiding visual segmentation. We also propose a loss function to utilize the correlation of audio-visual signals, which further enhances segmentation performance. More details are shown in our paper.

Audio-Visual Segmentation Framework

An overview of the proposed framework is illustrated in below figure. It follows a hierarchical Encoder-Decoder pipeline. The encoder takes the video frames and the entire audio clip as inputs, and outputs visual and audio features, respectively denoted as Fi and A. The visual feature map Fi at each stage is further sent to the ASPP module and then our TPAVI module. ASPP provides different receptive fields for recognizing visual objects, while TPAVI focuses on the temporal pixel-wise audio-visual interaction. The decoder progressively enlarges the fused feature maps by four stages and finally generates the output mask M for sounding objects.


Experiments

We first compare the proposed AVS baseline with several methods from related tasks, namely Sound Source Localization (SSL), Video Object Segmentation (VOS), and Salient Object Detection (SOD). The quantative results are shown in the Table below. Our AVS method surpasses all of these methods. Please refer to our paper for more analysis and qualitative results.

We then conduct some ablation studies to explore the impact of the audio signal and the TPAVI module. The results are shown below. The middle row indicates directly adding the audio and visual features, which already improves performance under the MS3 setting. The TPAVI module further enhances the results over all settings and backbones.

We also display some qualitative examples under the semi-supervised S4 setting and fully-supervised MS3 settings. As shown in Figures 3 and 4, these results indicate that the audio signals provide positive support for segmenting the correct sounding object and outlining better shapes.

Figure 3. Qualitative results under the semi-supervised S4 setting. Predictions are generated by the ResNet50-based AVS model. Two benefits are noticed by introducing the audio signal (TPAVI): 1) learning the shape of the sounding object, e.g., guitar in the video (Left); 2) segmenting according to the correct sound source, e.g., the gun rather than the man (Right).
Figure 4. Qualitative results under the fully-supervised MS3 setting. The predictions are obtained by the PVT-v2 based AVS model. Note that AVS with TPAVI uses audio information to perform better in terms of 1) filtering out the distracting visual pixels that do not correspond to the audio, i.e., the human hands (Left); 2) segmenting the correct sound source in the visual frames that matches the audio more accurately, i.e., the singing person (Right).

In our paper, we provide more experimental results, such as additional comparison with a two-stage baseline, visualization of the audio-visual attention map, segmenting unseen objects, etc. Please refer to the paper for more details.

Some video segmentation demos. The segmentation maps are generated by the PVT-v2 based AVS model.

Citation

If you find our work useful in your research, please cite our ECCV 2022 paper:

            
            @inproceedings{zhou2022avs,
                title     = {Audio-Visual Segmentation},
                author    = {Zhou, Jinxing and Wang, Jianyuan and Zhang, Jiayi and Sun, Weixuan and Zhang, Jing and Birchfield, Stan and Guo, Dan and Kong, Lingpeng and Wang, Meng and Zhong, Yiran},
                booktitle = {European Conference on Computer Vision},
                year      = {2022}
            }
            
            

Acknowledgement

  • Thanks to all the co-authors for the helpful discussion and suggestions.
  • Thanks to all the anonymous reviewers for their valuable suggestions and feedback.
  • Thanks to the SenseTime Research for providing access to the GPUs used for conducting experiments.
  • Thanks to Guangyao for sharing this website template.