A Simple and Powerful Global Optimization for Unsupervised Video Object Segmentation

WACV 2023

Georgy Ponimatkin¹, Nermin Samet¹, Yang Xiao¹, Yuming Du¹,
Renaud Marlet^1,2 Vincent Lepetit¹

¹LIGM, École des Ponts, Univ Gustave Eiffel, CNRS ²Valeo.ai

Paper arXiv Code

(Left) Segmentation obtained by our spectral clustering formulation on the self-supervised image features from DINO in a single frame. (Center) Segmentation obtained using the same self-supervised image features and optical flow from ARFlow, but still from a single frame. (Right) Final segmentation obtained with our complete method, after optimization on the full frame sequence, using the same features and optical flow estimated by ARFlow.

Abstract

We propose a simple, yet powerful approach for unsupervised object segmentation in videos. We introduce an objective function whose minimum represents the mask of the main salient object over the input sequence. It only relies on independent image features and optical flows, which can be obtained using off-the-shelf self-supervised methods. It scales with the length of the sequence with no need for superpixels or sparsification, and it generalizes to different datasets without any specific training. This objective function can actually be derived from a form of spectral clustering applied to the entire video. Our method achieves on-par performance with the state of the art on standard benchmarks (DAVIS2016, SegTrack-v2, FBMS59), while being conceptually and practically much simpler.

Approach Overview

SSL-VOS overview. Given a video sequence, starting from first estimates for the object masks obtained by spectral clustering on each image independently, we optimize the masks so that they remain close to the first estimates while being consistent with the optical flow. The objective function we optimize to retrieve the masks can be derived from spectral clustering applied to the video sequence. Our method can rely on self-supervised visual features only.

BibTeX

@inproceedings{ponimatkin2023sslvos,
      title= {{{A}} {{Simple}} and {{Powerful}} {{Global}} {{Optimization}} for {{Unsupervised}} {{Video}} {{Object}} {{Segmentation}}},
      author={G. {Ponimatkin} and N. {Samet} and Y. {Xiao} and Y. {Du} and R. {Marlet} and V. {Lepetit}},
      booktitle={Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV)},
      year={2023}
}

Acknowledgements

This work was granted access to the HPC resources of IDRIS under the allocation 2021-AD011012896 made by GENCI and supported in part by the Chistera IPalm project.