Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with Hierarchical Atomic Actions

Jul 24, 2022

Zhi Li, Lu He, Huijuan Xu

In the era of fine granularity, action understanding has evolved considerably as most human behaviors in real life have only minor differences. To capture these subtle intricasies of fine-grained actions, a novel approach using weakly-supervised temporal action detection model for videos is introduced, which is represented by a visual representation hierarchy of four levels: clip level, atomic action level, fine action class level, and coarse action class level. Research shows that such a model can efficiently capture commonality and individuality of fine-grained actions, achieving state-of-the-art results in action detection.

Introduction

Fine-grained actions, such as mopping the floor or sweeping the floor, exhibit minor visual differences and are more closer to the distribution of actions in real life. Algorithms designed to detect such actions would greatly aid in acquiring new skills as people usually learn from continuous instructional videos. Additionally, fine-grained action detection algorithms that can perceive actions happening in homes and assist appropriately, catering to specific needs.

Challenges

Data annotation in the form of start and end frames for temporal action detection suffers from high annotation cost due to the large volume of videos and consistency issues among various annotators. The distinction between fine-grained actions is not so obvious, and the time granularity of action annotation is also more refined, making the task more challenging. For this reason, we propose weakly-supervised action detection that relies only on video-level action labels without the need for temporal annotations of when the actions take place, in fine-grained videos.

Solution

Traditional weakly-supervised action detection models are based on Multiple Instance Learning (MIL) which may fall short in the fine-grained setting as the differences in actions manifest in small details. To tackle this, we introduce atomic actions defined as short temporal parts representing a single semantically meaningful component of an action. While the boundaries of atomic actions are hard to define in advance, we propose to automatically discover these using self-supervised clustering in the feature level. These atomic actions capture subtle differences between fine-grained actions, thereby facilitating fine-action classification.

Results

Our proposed Hierarchical Atomic Action Network (HAAN) helps model the commonality and individuality of fine-grained actions in the MIL framework, allowing weakly-supervised fine-grained temporal action detection. We have conducted multiple experiments on two large-scale fine-grained video datasets, FineAction and FineGym, and the results highlight the benefit and efficacy of our proposed weakly-supervised model for fine-grained action detection. Our method has consistently achieved state-of-the-art results.

Conclusion

Our level-wise approach and the introduction of atomic actions as reusable visual concepts offer promising advancements in weakly-supervised temporal action detection for fine-grained videos. By mapping these visual concepts to fine and coarse action labels, our approach ensures label efficiency and also improves accuracy in identifying start and end frames in the videos. Looking forward, we hope to benchmark previous weakly-supervised approaches for general action detection on fine-grained datasets and refine the atomic action building process, which we believe has enormous potential for the learning and detection of fine action details.

Sign up to AI First Newsletter

Recommended

We use our own cookies as well as third-party cookies on our websites to enhance your experience, analyze our traffic, and for security and marketing. Select "Accept All" to allow them to be used. Read our Cookie Policy.