Alternatives to Transformer Architecture
Traditional transformer architectures, while powerful for many NLP tasks, face significant computational and memory constraints as sequence lengths grow. Current implementations require O(n²) attention complexity, with models like GPT-3 175B consuming over 800GB of memory during training and struggling with context windows beyond 2,048 tokens.
The fundamental challenge lies in maintaining the transformer's ability to capture long-range dependencies while reducing its quadratic computational and memory requirements.
This page brings together solutions from recent research—including linear attention mechanisms, sparse transformers, state space models, and structured state approaches. These and other approaches aim to achieve sub-quadratic scaling while preserving the modeling capacity that made transformers successful.
1. Optimizing the Structures of Transformer Neural Networks Using Parallel Simulated Annealing
Maciej Trzciński, Szymon Łukasik, Amir H. Gandomi - Walter de Gruyter GmbH, 2024
Abstract The Transformer is an important addition to the rapidly increasing list of different Artificial Neural Networks (ANNs) suited for extremely complex automation tasks. It has already gained the position of the tool of choice in automatic translation in many business solutions. In this paper, we present an automated approach to optimizing the Transformer structure based upon Simulated Annealing, an algorithm widely recognized for both its simplicity and usability in optimization tasks where the search space may be highly complex. The proposed method allows for the use of parallel computing and time-efficient optimization, thanks to modifying the structure while training the network rather than performing the two one after another. The algorithm presented does not reset the weights after changes in the transformer structure. Instead, it continues the training process to allow the results to be adapted without randomizing all the training parameters. The algorithm has shown a promising performance during experiments compared to traditional training methods without structural modi... Read More
2. Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment
Yuhao Ji, Chao Fang, Shaobo Ma, 2024
Transformer models have revolutionized AI tasks, but their large size hinders real-world deployment on resource-constrained and latency-critical edge devices. While binarized Transformers offer a promising solution by significantly reducing model size, existing approaches suffer from algorithm-hardware mismatches with limited co-design exploration, leading to suboptimal performance on edge devices. Hence, we propose a co-design method for efficient end-to-end edge deployment of Transformers from three aspects: algorithm, hardware, and joint optimization. First, we propose BMT, a novel hardware-friendly binarized Transformer with optimized quantization methods and components, and we further enhance its model accuracy by leveraging the weighted ternary weight splitting training technique. Second, we develop a streaming processor mixed binarized Transformer accelerator, namely BAT, which is equipped with specialized units and scheduling pipelines for efficient inference of binarized Transformers. Finally, we co-optimize the algorithm and hardware through a design space exploration appro... Read More
3. HPTA: A High Performance Transformer Accelerator Based on FPGA
Yuntao Han, Qiang Liu - IEEE, 2023
The transformer neural networks have achieved remarkable performance in both Natural Language Processing (NLP) and Computer Vision (CV) applications, with encoder-decoder architecture based on attention layers. However, implementing transformers on resource-constrained devices presents challenges due to the super-large network structures and nontrivial dataflows. Field-Programmable Gate Arrays (FPGA) have been a promising platform for Neural Network (NN) acceleration due to their design flexibility and customization. Existing FPGA-based implementations of transformers face efficiency and generality issues. This paper proposes HPTA, a high-performance accelerator for implementing transformers on FPGA. We analyze the structural features of transformer networks and design the accelerator with configurable processing element, optimized data selection and arrangement and efficient memory subsystem, to support various transformers. We evaluate the performance of HPTA with BERT and Swin Transformer, the typical transformer models in NLP and CV. HPTA achieves up to 44× and 29× inference time... Read More
4. Parameter Design Approaches based on AI Techniques for Transformer Neural Network Optimization
Gurpreet Singh, Mansi Chaudhary, Jaspreet Singh - IEEE, 2023
The Transformer neural network is a magnificent monster of the deep learning world, demanding attention with its capacity to digest data sequences with unmatched efficacy. It has a towering self-attention mechanism and powerful feedforward layers. People are in awe of its skills as a result of its skill in multimodal activities, language interpretation, and picture processing. Certainly, artificial intelligence (AI) continues to advance more quickly. Therefore, the preparation, procedure, preservation, and commercialization of the energy infrastructure are hotspots for research approaches based on data-driven technologies AI. In several areas of picture analysis and evaluating, including self-driving automobiles, it performs remarkably well. A brief overview of the transformer neural network and the specified input define are presented in this research. Next, a transformer neural network design accompanying model for tasks involving natural language processing is outlined. Following that, papers relating to distinct transformer neural network types together with distinct techniques a... Read More
5. TiC-SAT
Alireza Amirshahi, Joshua Klein, Giovanni Ansaloni - ACM, 2023
Transformer models have achieved impressive results in various AI scenarios, ranging from vision to natural language processing. However, their computational complexity and their vast number of parameters hinder their implementations on resource-constrained platforms. Furthermore, while loosely-coupled hardware accelerators have been proposed in the literature, data transfer costs limit their speed-up potential. We address this challenge along two axes. First, we introduce tightly-coupled, small-scale systolic arrays (TiC-SATs), governed by dedicated ISA extensions, as dedicated functional units to speed up execution. Then, thanks to the tightly-coupled architecture, we employ software optimizations to maximize data reuse, thus lowering miss rates across cache hierarchies. Full system simulations across various BERT and Vision-Transformer models are employed to validate our strategy, resulting in substantial application-wide speed-ups (e.g., up to 89.5X for BERT-large). TiC-SAT is available as an open-source framework1.
6. A Survey of Techniques for Optimizing Transformer Inference
Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, 2023
Recent years have seen a phenomenal rise in performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (GPT) and Vision Transformer (ViT), have shown their effectiveness across Natural Language Processing (NLP) and Computer Vision (CV) domains. Transformer-based networks such as ChatGPT have impacted the lives of common men. However, the quest for high predictive performance has led to an exponential increase in transformers' memory and compute footprint. Researchers have proposed techniques to optimize transformer inference at all levels of abstraction. This paper presents a comprehensive survey of techniques for optimizing the inference phase of transformer networks. We survey techniques such as knowledge distillation, pruning, quantization, neural architecture search and lightweight network design at the algorithmic level. We further review hardware-level optimization techniques and the design of novel hardware accelerators for transforme... Read More
7. EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers
Xin Yang, Tao Su - MDPI AG, 2022
The topic of transformers is rapidly emerging as one of the most important key primitives in neural networks. Unfortunately, most hardware designs for transformers are deficient, either hardly considering the configurability of the design or failing to realize the complete inference process of transformers. Specifically, few studies have paid attention to the compatibility of different computing paradigms. Thus, this paper presents EFA-Trans, a highly efficient and flexible hardware accelerator architecture for transformers. To reach high performance, we propose a configurable matrix computing array and leverage on-chip memories optimizations. In addition, with the design of nonlinear modules and fine-grained scheduling, our architecture can perform complete transformer inference. EFA-Trans is also compatible with dense and sparse patterns, which further expands its application scenarios. Moreover, a performance analytic model is abstracted to guide the determination of architecture parameter sets. Finally, our designs are developed by RTL and evaluated on Xilinx ZCU102. Experimental... Read More
8. SAUST: A Scheme for Acceleration of Unstructured Sparse Transformer
Yifan Song, Shunpeng Zhao, Song Chen - IEEE, 2022
Transformer achieves impressive results on many AI tasks. However, it also introduces a huge amount of computation. Pruning is a promising method to reduce the computation load by generating sparse transformer models. To avoid load imbalance caused by computing involved in zero elements, previous works explore structured pruning combined with hardware acceleration. However, tight constraints in structured pruning usually make training much harder and reach a lower sparsity level in the end. This paper proposes SAUST, a scheme that exploits the high sparsity level of unstructured pruning and addresses the load imbalance problem using both hardware and software methods. FPGA implementation shows that SAUST can achieve 3.35x and 2.76x execution time speedup compared to two state-of-the-art references on hardware accelerators.
9. Galvatron
Xupeng Miao, Y X Wang, Youhe Jiang - Association for Computing Machinery (ACM), 2022
Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make distributed training plans or apply parallelism combinations within a very limited search space. In this approach, we propose Galvatron, a new system framework that incorporates multiple popular parallelism dimensions and automatically finds the most efficient hybrid parallelism strategy. To better explore such a rarely huge search space, we 1) involve a decision tree to make decomposition and pruning based on some reasonable intuitions, and then 2) design a dynamic programming search algorithm to generate the optimal plan. Evaluations on four representative Transformer workloads show that Galvatron could perform automatically distributed training with different GPU memory budgets. Among all evaluated scenarios, Gal... Read More
10. Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
Zhengda Bian, Hongxin Liu, Boxiang Wang, 2021
The success of Transformer models has pushed the deep learning model scale to billions of parameters. Due to the limited memory resource of a single GPU, However, the best practice for choosing the optimal parallel strategy is still lacking, since it requires domain expertise in both deep learning and parallel computing. The Colossal-AI system addressed the above challenge by introducing a unified interface to scale your sequential code of model training to distributed environments. It supports parallel training methods such as data, pipeline, tensor, and sequence parallelism, as well as heterogeneous training methods integrated with zero redundancy optimizer. Compared to the baseline system, Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.
Get Full Report
Access our comprehensive collection of 10 documents related to this technology