
SSL-VOS overview. Given a video sequence, starting from first estimates for the object masks obtained by spectral clustering on each image independently, we optimize the masks so that they remain close to the first estimates while being consistent with the optical flow. The objective function we optimize to retrieve the masks can be derived from spectral clustering applied to the video sequence. Our method can rely on self-supervised visual features only.