Tensor Product Attention

Tensor Product Attention Is All You Need
A tensor-decomposition attention that shrinks the KV cache 10× or more while improving quality — and the T6 Transformer built on it.

Yifan Zhang*◇ · Yifeng Liu* · Huizhuo Yuan · Zhen Qin · Yang Yuan · Quanquan Gu · Andrew C Yao†
IIIS, Tsinghua · Shanghai Qi Zhi · UCLA · TapTap  ·  NeurIPS 2025 Spotlight · arXiv:2501.06425
* Equal contribution  ·  ◇ Project lead  ·  † Corresponding author
AttentionKV CacheTensor DecompositionRoPE

Abstract

Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. We propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank components (contextual factorization) and seamlessly integrating with RoPE, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new architecture for sequence modeling. Across language-modeling tasks, T6 exceeds standard Transformer baselines including MHA, MQA, GQA, and MLA on perplexity and a range of benchmarks. Notably, TPA’s memory efficiency enables processing significantly longer sequences under fixed resource constraints.

Tensor Product Attention

TPA / T6 architecture
  • We propose Tensor Product Attention (TPA), factorizing Q, K, and V activations using contextual tensor decompositions to achieve 10× or more reduction in inference-time KV cache size relative to standard attention [Vaswani et al., 2017], with improved performance over MHA, MQA, GQA, and MLA.
  • We unify existing attention mechanisms by revealing that MHA, MQA, and GQA all arise naturally as non-contextual variants of TPA.
  • We introduce the Tensor ProducT ATTenTion Transformer (T6), a TPA-based architecture that consistently improves validation perplexity and downstream performance with reduced KV cache size.
  • We show TPA integrates seamlessly with RoPE [Su et al., 2024], easing adoption in foundation-model architectures such as LLaMA and Gemma.

Tensor Factorization of Queries, Keys, and Values

Tensor factorization of Q, K, V

KV Caching and Memory Reduction

KV caching and memory reduction

Experimental Results

Training loss
Training loss of medium (353M), large (773M), and XL (1.5B) models with different attention mechanisms on FineWeb-Edu 100B.
Validation loss
Validation loss of medium (353M), large (773M), and XL (1.5B) models on FineWeb-Edu 100B.
Medium-size (353M) evaluation Large-size (773M) evaluation XL-size (1.5B) evaluation
Downstream evaluation (zero-shot and two-shot) on ARC, BoolQ, HellaSwag, OBQA, PIQA, WinoGrande, and MMLU via the lm-evaluation-harness.

Citation

If you use TPA or the T6 Transformer, please cite:

@article{zhang2026tensor,
  title={Tensor product attention is all you need},
  author={Zhang, Yifan and Liu, Yifeng and Yuan, Huizhuo and Qin, Zhen and Yuan, Yang and Gu, Quanquan and Yao, Andrew},
  journal={Advances in Neural Information Processing Systems},
  volume={38},
  pages={112206--112251},
  year={2026}
}