Authored By: Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Chengqi Zhang

Tensorized Self-Attention: Efficiently Modeling Pairwise and Global Dependencies Together

Jan 26, 2024

Introduction

Recent advancements in recurrent neural networks (RNN) and convolutional neural networks (CNN) have now made these the preferred choice as context fusion modules for many natural language processing (NLP) tasks. Be it contextual feature modeling or syntactic dependency modeling, attention mechanism infused RNN/CNN has proven to be highly effective. More so with the addition of self-attention mechanisms that exert fewer parameters and computation load, while simultaneously providing better empirical performance.

Self-attention mechanisms can be loosely categorized into two types, depending on the kind of dependency they aim to model. The first category is termed as token2token self-attention, where the primary objective is to capture the syntactic dependency between every two tokens in a sequence. The second category is the source2token self-attention mechanism intending to capture global dependency. Despite their widespread usage, it's hard to reach state-of-the-art performance with these mechanisms due to the lack of both pairwise and local dependencies.

Introducing Multi-mask Tensorized Self-Attention (MTSA)

This blog introduces a new and novel attention mechanism, termed multi-mask tensorized self-attention (MTSA) that can efficiently model both pairwise and global dependancies. MTSA offers the computation and memory efficiency of a CNN, but significantly outperforms previous CNN, RNN, and attention-based models.

MTSA uses a tensor to represent the feature-wise alignment scores, delivering superior expressive power, with the only requirement being parallelizable matrix multiplications. It also combines multi-head with multi-dimensional attentions, with a distinct positional mask applied to each head.

Utility of MTSA

CNN/RNN-free models based on MTSA achieve state-of-the-art performance on nine NLP benchmarks, delivering both time and memory efficiency. MTSA also manages to balance memory consumption versus sequence length effectively.

Importantly, MTSA integrates two compatibility functions from two different self-attention mechanisms. Firstly, the scaled dot-product self-attention is used to capture dependencies between every two tokens. Then, a multi-dim source2token self-attention mechanism is used to estimate the contribution of each token to the given task on each feature dimension. This approach effectively models both global and pairwise dependencies.

Furthermore, an efficient computation scheme for MTSA avoids any high-rank tensor computation. This is critical given that the attention scores and probabilities existing as nndh tensors could result in memory explosion and computational bottleneck on long sequences.

In conclusion, with the synergistic combination of parallelizable computation, lightweight structure, and ability to capture both long-range and local dependencies, MTSA provides us with an invaluable tool to boost the expressive power and performance of neural networks used in natural language processing tasks.