STAIR Actions: A Video Dataset of Everyday Home Actions

Apr 12, 2018

Yuya Yoshikawa, Jiaqing Lin, Akikazu Takeuchi

1. Introduction

In recent years, the field of human action recognition - distinguishing what actions people are performing in a featured video - has gained significant attention as one of the major themes in video analysis. With cameras being integrated into smartphones, robots, cars, home appliances and more, action recognition technologies are expected to progressively substitute human recognition. Most current studies exploiting deep neural networks (DNNs) prioritize factors like dataset size and action label selection. However, existing datasets for human action recognition are generally developed without targeting specific tasks, focusing more on the convenience of data collection.

In light of this, we introduce a new large-scale video dataset for human action recognition, STAIR Actions. The dataset includes 100 categories of action labels representing fine-grained everyday home actions. Targeting applications in healthcare, caregiving, security, and more, each action category boasts around 1,000 videos sourced from YouTube or produced by crowdsource workers, each video carrying a single action label. With the shortest and longest videos running between 3 and 10 seconds, the average video duration sits between 5 and 6 seconds. In this paper, we delve into the process of constructing STAIR Actions and its unique characteristics in comparison to existing datasets for human action recognition.

2. Related Work

A number of action video datasets have been produced over the last two decades. In the past, datasets comprised tens to hundreds of videos. For example, the Kinetics dataset contains 400 human action categories with at least 400 video clips for each action. Each clip lasts around 10 seconds, taken from a different YouTube video. The AVA2.0 dataset provides densely annotated atomic visual actions, resulting in 740,000 action labels in total.

STAIR Actions aligns in size both in terms of videos and action labels with ActivityNet, Kinetics, and AVA. Additionally, STAIR Actions is finely focused, with videos pertaining only to everyday home actions.

3. Data Collection

We describe the detailed process of constructing the STAIR Actions dataset. First, we selected 100 action labels for STAIR Actions from the Japanese basic verb list, concentrating on verbs associated with everyday home and office actions, as well as those specific to rooms like the bathroom, kitchen, and living room.

Many verbs alter their meanings depending on the object they are associated with. For instance, the verb “open” is aligned with diverse actions like “opening a door,” “opening a refrigerator door,” or “opening a bottle.” Thus, we defined action labels using the format of “verb + object” for these cases. The chosen 100 action labels are detailed in Table 1. Note that the action labels were selected from actions that need to be recognized in the home and office, contrasting with other datasets that select keywords with a large number of associated videos on YouTube.

4. Videos from YouTube

Approximately half of the videos in STAIR Actions are from YouTube. These videos were labelled in four steps: gathering videos from YouTube, extracting 5-second videos, annotating the videos with action labels, and reviewing the quality of the annotated labels.

For the first step, we restricted the YouTube search to videos less than 4 minutes in duration. Subsequently, animations, slide-shows, and scenes without humans were removed. At this stage, the resulting videos were divided into 5-second video clips to ensure manageability of the data and that enough time was provided for action recognition. Next, the action labels were annotated by crowdsource workers who were first given an overview of annotation guidelines and tested on their comprehension. To make the annotation work efficient, we developed a web-based system where workers are shown a video and asked to select one label from a set of 10.

In conclusion, this large-scale video dataset of everyday home actions prioritizes a balanced distribution of videos over 100 action categories and can train large models to achieve a competitive performance. With STAIR Actions, researchers and developers can effectively recognize and group everyday actions based on their availability in home situations.

Sign up to AI First Newsletter

Recommended