Tensor Product Attention Is All You Need

Yifan Zhang; Yifeng Liu; Huizhuo Yuan; Zhen Qin; Yang Yuan; Quanquan Gu; Andrew Chi-Chih Yao

Abstract

Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. We propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank components (contextual factorization) and seamlessly integrating with RoPE, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new architecture for sequence modeling. Across language-modeling tasks, T6 exceeds standard Transformer baselines including MHA, MQA, GQA, and MLA on perplexity and a range of benchmarks. Notably, TPA’s memory efficiency enables processing significantly longer sequences under fixed resource constraints.

Tensor Product Attention

We propose Tensor Product Attention (TPA), factorizing Q, K, and V activations using contextual tensor decompositions to achieve 10× or more reduction in inference-time KV cache size relative to standard attention [Vaswani et al., 2017], with improved performance over MHA, MQA, GQA, and MLA.
We unify existing attention mechanisms by revealing that MHA, MQA, and GQA all arise naturally as non-contextual variants of TPA.
We introduce the Tensor ProducT ATTenTion Transformer (T6), a TPA-based architecture that consistently improves validation perplexity and downstream performance with reduced KV cache size.
We show TPA integrates seamlessly with RoPE [Su et al., 2024], easing adoption in foundation-model architectures such as LLaMA and Gemma.

Tensor Factorization of Queries, Keys, and Values

KV Caching and Memory Reduction

Experimental Results

Training loss of medium (353M), large (773M), and XL (1.5B) models with different attention mechanisms on FineWeb-Edu 100B.

Validation loss of medium (353M), large (773M), and XL (1.5B) models on FineWeb-Edu 100B.

Medium-size (353M) evaluation — Downstream evaluation (zero-shot and two-shot) on ARC, BoolQ, HellaSwag, OBQA, PIQA, WinoGrande, and MMLU via the lm-evaluation-harness.

Large-size (773M) evaluation — Downstream evaluation (zero-shot and two-shot) on ARC, BoolQ, HellaSwag, OBQA, PIQA, WinoGrande, and MMLU via the lm-evaluation-harness.

Citation

If you use TPA or the T6 Transformer, please cite:

@article{zhang2026tensor,
  title={Tensor product attention is all you need},
  author={Zhang, Yifan and Liu, Yifeng and Yuan, Huizhuo and Qin, Zhen and Yuan, Yang and Gu, Quanquan and Yao, Andrew},
  journal={Advances in Neural Information Processing Systems},
  volume={38},
  pages={112206--112251},
  year={2026}
}