Traditional transformer architectures, while powerful for many NLP tasks, face significant computational and memory constraints as sequence lengths grow. Current implementations require O(n²) attention complexity, with models like GPT-3 175B consuming over 800GB of memory during training and struggling with context windows beyond 2,048 tokens.

The fundamental challenge lies in maintaining the transformer's ability to capture long-range dependencies while reducing its quadratic computational and memory requirements.

This page brings together solutions from recent research—including linear attention mechanisms, sparse transformers, state space models, and structured state approaches. These and other approaches aim to achieve sub-quadratic scaling while preserving the modeling capacity that made transformers successful.

1. Distilled Sentence Embedding Model with Knowledge Distillation for Efficient Similarity Calculation

MICROSOFT TECHNOLOGY LICENSING LLC, 2025

Training a lightweight sentence embedding model called Distilled Sentence Embedding (DSE) to efficiently calculate similarities between sentences compared to heavy transformer models. The DSE model is trained by decoupling the sentence embedding calculation from the similarity analysis using knowledge distillation. This involves training a student model to match the sentence embeddings generated by a teacher transformer model. The precomputed sentence embeddings from the DSE model can then be quickly compared to find similarities with new input sentences, without needing to score entire sentence pairs like in the transformer.

2. Nested Hierarchical Transformer Architecture with Full Self-Attention for Image Feature Extraction

GOOGLE LLC, 2025

Nested hierarchical transformers for efficient and interpretable vision tasks like image classification. The method involves using nested hierarchies of transformers instead of reducing self-attention range. Each hierarchy generates higher order features from image patches and then aggregates them into smaller representations. This allows maintaining full self-attention range for feature extraction while reducing the overall feature size. The nested hierarchies provide improved accuracy and data efficiency compared to conventional transformers. The nested architecture also provides interpretability benefits by decoupling feature learning and abstraction.

US12327395B2-patent-drawing

3. Guided Evolutionary Growth Method for Artificial Neural Network Architecture Development

BLAIZE INC, 2025

Discovering novel artificial neural network architectures through guided evolutionary growth. The method involves starting with a minimum possible configuration network and iteratively evolving it by adding layers guided by similarity to existing networks. The evolution is guided by calculating pairwise similarities between the evolving networks and existing networks. These similarities are used to add terms to the fitness function that discourage copying existing architectures. The evolving networks are trained, evaluated, and their fitness scored based on performance and similarity to existing networks. This iterative guided evolution process aims to find novel architectures in previously unexplored regions of the network space.

US12307371B2-patent-drawing

4. Neural Network Architecture with Gated Attentive Layer for Linear Complexity Long Sequence Processing

GOOGLE LLC, 2025

Neural network architecture for efficient long sequence processing in machine learning tasks by using an attentive layer with a gating mechanism that reduces the computational and memory requirements of applying self-attention to long input sequences. The attentive layer alleviates the burden of self-attention by using a gating mechanism that allows using an approximation of the quadratic self-attention mechanism with linear complexity over the sequence length. This allows using an attention mechanism that approximates the quadratic attention mechanism of the Transformer, leading to a variant with linear complexity over the context size without memory bottlenecks.

US2025139431A1-patent-drawing

5. Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment

Yuhao Ji, Chao Fang, Shaobo Ma, 2024

Transformer models have revolutionized AI tasks, but their large size hinders real-world deployment on resource-constrained and latency-critical edge devices. While binarized Transformers offer a promising solution by significantly reducing model size, existing approaches suffer from algorithm-hardware mismatches with limited co-design exploration, leading to suboptimal performance on edge devices. Hence, we propose a co-design method for efficient end-to-end edge deployment of Transformers from three aspects: algorithm, hardware, and joint optimization. First, we propose BMT, a novel hardware-friendly binarized Transformer with optimized quantization methods and components, and we further enhance its model accuracy by leveraging the weighted ternary weight splitting training technique. Second, we develop a streaming processor mixed binarized Transformer accelerator, namely BAT, which is equipped with specialized units and scheduling pipelines for efficient inference of binarized Transformers. Finally, we co-optimize the algorithm and hardware through a design space exploration appro... Read More

6. HPTA: A High Performance Transformer Accelerator Based on FPGA

Yuntao Han, Qiang Liu - IEEE, 2023

The transformer neural networks have achieved remarkable performance in both Natural Language Processing (NLP) and Computer Vision (CV) applications, with encoder-decoder architecture based on attention layers. However, implementing transformers on resource-constrained devices presents challenges due to the super-large network structures and nontrivial dataflows. Field-Programmable Gate Arrays (FPGA) have been a promising platform for Neural Network (NN) acceleration due to their design flexibility and customization. Existing FPGA-based implementations of transformers face efficiency and generality issues. This paper proposes HPTA, a high-performance accelerator for implementing transformers on FPGA. We analyze the structural features of transformer networks and design the accelerator with configurable processing element, optimized data selection and arrangement and efficient memory subsystem, to support various transformers. We evaluate the performance of HPTA with BERT and Swin Transformer, the typical transformer models in NLP and CV. HPTA achieves up to 44 and 29 inference time... Read More

7. EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers

Xin Yang, Tao Su - MDPI AG, 2022

The topic of transformers is rapidly emerging as one of the most important key primitives in neural networks. Unfortunately, most hardware designs for transformers are deficient, either hardly considering the configurability of the design or failing to realize the complete inference process of transformers. Specifically, few studies have paid attention to the compatibility of different computing paradigms. Thus, this paper presents EFA-Trans, a highly efficient and flexible hardware accelerator architecture for transformers. To reach high performance, we propose a configurable matrix computing array and leverage on-chip memories optimizations. In addition, with the design of nonlinear modules and fine-grained scheduling, our architecture can perform complete transformer inference. EFA-Trans is also compatible with dense and sparse patterns, which further expands its application scenarios. Moreover, a performance analytic model is abstracted to guide the determination of architecture parameter sets. Finally, our designs are developed by RTL and evaluated on Xilinx ZCU102. Experimental... Read More

8. SAUST: A Scheme for Acceleration of Unstructured Sparse Transformer

Yifan Song, Shunpeng Zhao, Song Chen - IEEE, 2022

Transformer achieves impressive results on many AI tasks. However, it also introduces a huge amount of computation. Pruning is a promising method to reduce the computation load by generating sparse transformer models. To avoid load imbalance caused by computing involved in zero elements, previous works explore structured pruning combined with hardware acceleration. However, tight constraints in structured pruning usually make training much harder and reach a lower sparsity level in the end. This paper proposes SAUST, a scheme that exploits the high sparsity level of unstructured pruning and addresses the load imbalance problem using both hardware and software methods. FPGA implementation shows that SAUST can achieve 3.35x and 2.76x execution time speedup compared to two state-of-the-art references on hardware accelerators.

Get Full Report

Access our comprehensive collection of 8 documents related to this technology