Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. We propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank components (contextual factorization) and seamlessly integrating with RoPE, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new architecture for sequence modeling. Across language-modeling tasks, T6 exceeds standard Transformer baselines including MHA, MQA, GQA, and MLA on perplexity and a range of benchmarks. Notably, TPA’s memory efficiency enables processing significantly longer sequences under fixed resource constraints.
Abstract
Tensor Product Attention

- We propose Tensor Product Attention (TPA), factorizing Q, K, and V activations using contextual tensor decompositions to achieve 10× or more reduction in inference-time KV cache size relative to standard attention [Vaswani et al., 2017], with improved performance over MHA, MQA, GQA, and MLA.
- We unify existing attention mechanisms by revealing that MHA, MQA, and GQA all arise naturally as non-contextual variants of TPA.
- We introduce the Tensor ProducT ATTenTion Transformer (T6), a TPA-based architecture that consistently improves validation perplexity and downstream performance with reduced KV cache size.
- We show TPA integrates seamlessly with RoPE [Su et al., 2024], easing adoption in foundation-model architectures such as LLaMA and Gemma.
Tensor Factorization of Queries, Keys, and Values

KV Caching and Memory Reduction

Experimental Results


Citation
If you use TPA or the T6 Transformer, please cite:
@article{zhang2026tensor,
title={Tensor product attention is all you need},
author={Zhang, Yifan and Liu, Yifeng and Yuan, Huizhuo and Qin, Zhen and Yuan, Yang and Gu, Quanquan and Yao, Andrew},
journal={Advances in Neural Information Processing Systems},
volume={38},
pages={112206--112251},
year={2026}
}