Authored By: Arghya Pal, Vineeth N Balasubramanian

Zero-Shot Task Transfer

Jan 25, 2024

1. Introduction

The major driving force behind modern computer vision, machine learning, and deep neural network models is the availability of large amounts of curated labeled data. Deep models have shown state-of-the-art performances on different vision tasks. Effective models that work in practice entail a requirement of very large labeled data due to their large parameter spaces. Expecting availability of large-scale hand-annotated datasets for every vision task is not practical. Some tasks require extensive domain expertise, long hours of human labor, expensive data collection sensors - which collectively make the overall process very expensive. Even when data annotation is carried out using crowdsourcing (e.g. Amazon Mechanical Turk), additional effort is required to measure the correctness (or goodness) of the obtained labels. Due to this, many vision tasks are considered expensive, and practitioners either avoid such tasks or continue with lesser amounts of data that can lead to poorly performing models. We seek to address this problem in this work, viz., to build an alternative approach that can obtain model parameters for tasks without any labeled data. Extending the definition of zero-shot learning from basic recognition settings, we call our work Zero-Shot Task Transfer.

2. Related Work

We divide our discussion of related work into subsections that capture earlier efforts that are related to ours from different perspectives. A prior and then iterate to learn a joint space of tasks, while other methods do not use a prior but learn a joint space of tasks during the process of learning. Distributed multi-task learning methods address the same objective when tasks are distributed across a network. However, unlike our method, a binding thread for all these methods is that there is an explicit need of having labeled data for all tasks in the setup. These methods cannot solve a zero-shot target task without labeled samples.

3. Methodology

The primary objective of our methodology is to learn a meta-learning algorithm that regresses nearly optimum parameters of a novel task for which no ground truth (data or labels) is available. To this end, our meta-learner seeks to learn from the model parameters of known tasks (with ground truth) to adapt to a novel zero-shot task.

4. Results

To evaluate our proposed framework, we consider the vision tasks defined in [42]. In this section, we consider four of the tasks as unknown or zero-shot: surface normal, depth estimation, room layout, and camera-pose estimation. We have curated this list based on the data acquisition complexity and the complexity associated with the learning process using a deep network.

4.1. Dataset

We evaluated TTNet on the Taskonomy dataset [42], a publicly available dataset comprised of more than 150K RGB data samples of indoor scenes. It provides the ground truths of 26 tasks given the same RGB images, which is the main reason for considering this dataset. We considered 120K images for training, 16K images for validation, and, 17K images for testing.

4.2. Implementation Details

Network Architecture: Following Section 3, each data network is considered an autoencoder, and closely follows the model architecture of [42]. The encoder is a fully convolutional ResNet 50 model without pooling, and the decoder comprises of 15 fully convolutional layers for all pixel-to-pixel tasks, e.g. normal estimation, and for low dimensional tasks, e.g. vanishing points, it consists of 2-3 FC layers.

Regressing Zero-Shot Task Parameters:

Once we learn the optimal parameters W for F(:) using Algorithm 1, we use this to regress zero-shot task parameters, i.e. FW(E1 , 1;j); ;(Em , m;j) for allj= (m+ 1); ; T. (We note that the implementation of Algorithm 1 was found to be independent of the ordering of the tasks, 1; ; m.) At inference time (for zero-shot task transfer),F(:) operates in transfer mode.