Improving Human Action Recognition by Non-action Classification

Apr 21, 2016

Yang Wang, Minh Hoai

In realistic video, recognizing human actions can be tremendously challenging due to the dominance of irrelevant factors that curb the subtle human actions. This article substantiates the benefits of removing non-action video segments or those segments that do not portray any human action at all. To do this, we learn about a non-action classifier and explore how it can be even used to down-weight irrelevant video segments. This non-action classifier is trained by employing ActionThread, a dataset with shot-level annotation for the incidence or absence of human action.

1. Introduction

Recognizing human actions in video applications has many potential applications in various fields, from entertainment and robotics to security and healthcare. However, most current recognition systems flounder on the inability to separate human actions from irrelevant factors. This is particularly problematic for human action recognition in TV material, where a single human action may be dispersedly portrayed in a video clip that also includes video shots for setting the scene and advancing dialog. With this in mind, we first introduce our findings on the benefits of utilizing purified action clips where irrelevant video shots are removed.

2. Related Works

In this research, we propose to learn a non-action classifier to predict whether a video subsequence is an action instance. This is related to defining and measuring various visual tasks like object proposal and action recognition systems.

3. Benefits of Pruning Irrelevant Shots

We now present our findings on the statistics of non-action shots in a typical human action dataset and the benefits of removing them for human action recognition. We consider the ActionThread dataset for our studies. On average, one video contains about six shots, 60% of which are non-action shots. We observed that the ability to prune non-action shots leads to significant performance gain. The average performance gain is 13.7%, and it is as high as 34.1% for DriveCar task.

4. Non-Action Classification

Having confirmed that removing non-action shots leads to large performance gain in the action recognition task, we describe in this section our approach for learning a classifier to differentiate between action shots from non-action shots. We combine Dense Trajectory Descriptors and deep-learned features from a Two-stream ConvNet to identify non-action shots. The computation of temporal feature vectors is similar to that of spatial feature vectors. This way, a single feature vector of can be obtained.

5. Experiments

We performed experiments using different action recognition methods and described the performance gains in several tests. The mean improvement throughout the tests was 13.7%. Furthermore, we also verified that the benefits of the non-action classifier can be generalised to action categories that do not exist in the training set of the non-action classifier.

6. Conclusion

We made a significant step towards the ultimate solution for human action recognition in realistic videos by studying the benefits of removing non-action shots and proposing a method for detecting them. Even though the non-action classifier is far from perfect, it signals a promising start towards a more advanced solution for human action recognition in upcoming research.

Sign up to AI First Newsletter

Recommended

We use our own cookies as well as third-party cookies on our websites to enhance your experience, analyze our traffic, and for security and marketing. Select "Accept All" to allow them to be used. Read our Cookie Policy.