Authored By: Li Haopeng, Ke Qiuhong, Gong Mingming, Zhang Rui

Video Joint Modelling Based on Hierarchical Transformer for Co-summarization

Jan 24, 2024

Introduction

The immense accumulation of video data thanks to the exponential rise of online video platforms such as YouTube, Vimeo, Facebook Watch, etc is tremendous. With more than 500 hours of videos being uploaded on YouTube every minute, as per 2019 statistics, efficiently browsing or retrieving essential information from this enormous volume of video data becomes a daunting task. To address this difficulty, numerous video summarization techniques have been developed in recent years.

Video summarization aims to automatically generate a narrative or a concise version of a video that can expedite large-scale video retrieval and browsing. However, most existing methods singularly perform video summarization, negating the correlations present among similar videos. These correlations hold great importance for video understanding and video summarization.

Video Joint Modelling for Co-summarization

We propose Video Joint Modelling based on Hierarchical Transformer (VJMHT) for co-summarization to address this limitation. VJMHT incorporates the semantic dependencies across videos, resulting in an integrated cross-video high-level pattern distinctly modelled and learned for the summarization of individual videos. VJMHT consists of two layers of Transformer: the first layer extracts semantic representation from individual shots of similar videos, while the second layer executes shot-level video joint modelling to aggregate cross-video semantic information. Using Transformer-based video representation reconstruction, we manage to maximize the high-level similarity between the summary and the original video.

Unsupervised Video Summarization

Unsupervised methods for video summarization focus on manually designed criteria such as the representativeness of the summary with respect to the original video, and the variety of the frames/shots in the summary. Some even use machine-learning techniques such as clustering or dictionary learning. Still, nowadays deep-learning-based unsupervised methods for video summarization are getting more attention.

Supervised Video Summarization

Recurrent Neural Network (RNN)-based models dominate the supervised methods for video summarization. Variants of RNN are created to encode and summarize the video considering the hierarchical structure of videos. RNNs capture the temporal dependencies within a video very effectively.

Inter-Video Communication

Besides considering the temporal dependencies, the inter-video communication is also important to consider. Inter-Video communication has been exploited in various computer vision fields such as action localization, video object detection, and video person re-identification.

Conclusion

This new framework takes into consideration the semantic dependencies across videos and the internal hierarchical structure of videos to obtain high-level video representations. Video joint modelling can be a great way to uncover more information from multiple sources and help in improving the video summarization process.