Authored By: Ziyuan Huang, Shiwei Zhang, Jianwen Jiang, Mingqian Tang, Rong Jin, Marcelo Ang

Self-supervised Motion Learning from Static Images

Jan 26, 2024

Abstract

Motions are an integral part of videos, displaying as the movement of pixels where actions are essentially patterns of inconsistent motions between the foreground and the background. To well distinguish these actions, especially those with complicated spatio-temporal interactions, properly locating the significant motion areas is of crucial importance. Unfortunately, most motion information in existing videos are difficult to annotate and training a model with precise motion representations requires a significant amount of human labour for annotation. This article proposes to address this problem through self-supervised learning. Specifically, the aim is to learn Motion from Static Images (MoSI). MoSI encourages the model to encode motion information by classifying pseudo motions.

Introduction

Understanding motion patterns serves as a critical challenge in various video understanding problems such as action recognition, action localization, and action detection. An appropriate way to encode motions can significantly boost performance in these tasks. Early works represent motions using hand-crafted features based on dense trajectories, optical flow, and the application of deep neural networks. However, 3D convolutional models require a large amount of manually labeled videos to achieve good generalized performance. This is where self-supervised learning has emerged as a powerful technique for training the model without labeled data in both image and video paradigms.

Pseudo Motions

The creation process of pseudo motions consists of cropping a continuous sequence of images from the input image or source image. The generated pseudo motion sequence is then used as the input to the video model for motion classification. To generate the samples with different speeds, moving distance from the start to the end of the pseudo motion sequences is defined. The model is trained by cross-entropy loss.

Static Masks

By correctly classifying pseudo motions with different directions and magnitudes, the model is able to recognize different motion patterns. However, another crucial component is the introduction of static masks. Static masks divide the spatial location into two groups, masked area and unmasked area. The masked area is regarded as the background and the motions within this area are removed. Meanwhile, the original contents (i.e., the motions) are retained in the unmasked area of the image sequence.

Data Preparations & Augmentations

The source image is resized and a square area is randomly cropped from the source image. The side length of the unmasked area in the static mask is randomly sampled as well. The model is expected to learn not only global motion patterns, but also inconsistent motions between the foreground and the background. Furthermore, the location and size of the unmasked area are randomized for better model training.

Experiments and Results

The proposed MoSI reached new state-of-the-art results for learning video representations using RGB modality. Grad-CAM visualization indicated that models trained by MoSI were able to locate significant motion areas and even transfer the knowledge onto real videos with more complex spatio-temporal relations to discover locate areas with a large motion across different frames.

Conclusion

In conclusion, the application of self-supervised learning proves to be crucial in training models to understand motion patterns from static images. MoSI not only identifies global motion patterns but also locates significant motion areas, making it highly beneficial for tasks requiring the understanding of complex scenes and motions like action recognition.