Jul 24, 2022
Zhi Li, Lu He, Huijuan Xu
In the era of fine granularity, action understanding has evolved considerably as most human behaviors in real life have only minor differences. To capture these subtle intricasies of fine-grained actions, a novel approach using weakly-supervised temporal action detection model for videos is introduced, which is represented by a visual representation hierarchy of four levels: clip level, atomic action level, fine action class level, and coarse action class level. Research shows that such a model can efficiently capture commonality and individuality of fine-grained actions, achieving state-of-the-art results in action detection.
Introduction
Fine-grained actions, such as mopping the floor or sweeping the floor, exhibit minor visual differences and are more closer to the distribution of actions in real life. Algorithms designed to detect such actions would greatly aid in acquiring new skills as people usually learn from continuous instructional videos. Additionally, fine-grained action detection algorithms that can perceive actions happening in homes and assist appropriately, catering to specific needs.
Challenges
Data annotation in the form of start and end frames for temporal action detection suffers from high annotation cost due to the large volume of videos and consistency issues among various annotators. The distinction between fine-grained actions is not so obvious, and the time granularity of action annotation is also more refined, making the task more challenging. For this reason, we propose weakly-supervised action detection that relies only on video-level action labels without the need for temporal annotations of when the actions take place, in fine-grained videos.
Solution
Traditional weakly-supervised action detection models are based on Multiple Instance Learning (MIL) which may fall short in the fine-grained setting as the differences in actions manifest in small details. To tackle this, we introduce atomic actions defined as short temporal parts representing a single semantically meaningful component of an action. While the boundaries of atomic actions are hard to define in advance, we propose to automatically discover these using self-supervised clustering in the feature level. These atomic actions capture subtle differences between fine-grained actions, thereby facilitating fine-action classification.
Results
Our proposed Hierarchical Atomic Action Network (HAAN) helps model the commonality and individuality of fine-grained actions in the MIL framework, allowing weakly-supervised fine-grained temporal action detection. We have conducted multiple experiments on two large-scale fine-grained video datasets, FineAction and FineGym, and the results highlight the benefit and efficacy of our proposed weakly-supervised model for fine-grained action detection. Our method has consistently achieved state-of-the-art results.
Conclusion
Our level-wise approach and the introduction of atomic actions as reusable visual concepts offer promising advancements in weakly-supervised temporal action detection for fine-grained videos. By mapping these visual concepts to fine and coarse action labels, our approach ensures label efficiency and also improves accuracy in identifying start and end frames in the videos. Looking forward, we hope to benchmark previous weakly-supervised approaches for general action detection on fine-grained datasets and refine the atomic action building process, which we believe has enormous potential for the learning and detection of fine action details.
Sign up to AI First Newsletter
Characterizing Video Responses in Social...
ByFabricio Benevenuto,...
Apr 30, 2008
Morita Equivalence of C^*-Crossed Produc...
ByNandor Sieben
Oct 3, 2010
StegNet: Mega Image Steganography Capaci...
ByPin Wu, Yang Yang, X...
Jun 17, 2018
Toward Ethical Robotic Behavior in Human...
ByShengkang Chen, Vidu...