Memory-Efficient Backpropagation Techniques for Deep Learning
Training large neural networks requires substantial memory resources, with intermediate activations often consuming more memory than model parameters. A typical transformer model with 1 billion parameters can require over 100GB of memory during training, primarily due to storing activation gradients for backpropagation. This memory bottleneck limits the size of models that can be trained on available hardware.
The fundamental challenge lies in balancing computational efficiency against memory usage while maintaining numerical stability and convergence properties during training.
This page brings together solutions from recent research—including gradient checkpointing, reversible layer architectures, activation recomputation strategies, and memory-efficient optimizer implementations. These and other approaches enable training of larger models on limited hardware resources while preserving training dynamics and model performance.
1. A Novel Approach to Gradient Evaluation and Efficient Deep Learning: A Hybrid Method
bogdan dorneanu, vasileios mappas, harvey arellanogarcia - PSE Press, 2025
Deep learning faces significant challenges in efficiently training large-scale models. These issues are closely linked, as efficient often depends on precise and computationally feasible gradient calculations. This work introduces innovative methodologies to improve deep network (DLN) complex systems. A novel approach DLN is proposed by adapting the block coordinate descent (BCD) method, which optimizes individual layers sequentially. combined with traditional batch-based create a hybrid method that harnesses strengths of both techniques. Additionally, study explores Iterated Control Random Search (ICRS) for initializing parameters applies quasi-Newton methods like L-BFGS restricted iterations enhance optimization. By tackling efficiency, this contribution offers comprehensive framework address key modern machine learning. The scalability effectiveness, especially handling real-world problems. Examples from Process Systems Engineering illustrate how these advancements can directly
2. CaDCR: An Efficient Cascaded Dynamic Collaborative Reasoning Framework for Intelligent Recognition Systems
bowen li, xudong cao, jun li - Multidisciplinary Digital Publishing Institute, 2025
To address the challenges of high computational cost and energy consumption posed by deep neural networks in embedded systems, this paper presents CaDCR, a lightweight dynamic collaborative reasoning framework. By integrating feature discrepancy-guided skipping mechanism with depth-sensitive early exit mechanism, framework establishes hierarchical decision logic: dynamically selects execution paths network blocks based on complexity input samples enables for simple through shallow confidence assessment, thereby forming an adaptive resource allocation strategy. CaDCR can both constantly suppress unnecessary satisfy hard constraints forcibly terminating inference process all samples. Based framework, we design cascaded system tailored deployment to tackle practical challenges. Experiments CIFAR-10/100, SpeechCommands datasets demonstrate that maintains accuracy comparable or higher than baseline models while significantly reducing approximately 4070% within controllable loss margin. In tests STM32 platform, frameworks performance matches theoretical expectations, further verifyin... Read More
3. Theoretical Limits of Feedback Alignment in Preference-based Fine-tuning of AI Models
zhenyu gao, 2025
Feedback alignment (FA) has emerged as an alternative to backpropagation for training deep networks by using fixed random feedback weights. While FA shows promise in supervised tasks, its extension preference-based fine-tuning (PFT) of large language modelswhich relies on human or learned preference signalsremains underexplored. In this work, we analyze theoretical limitations applied PFT objectives. We derive error propagation bounds, characterize convergence conditions paired-FA updates, and quantify the impact noise mismatch stability. By integrating recent advances meta-reinforcement learning prompt compression, highlight trade-offs between complexity efficiency, offering practical guidelines hybrid FAbackprop architectures large-scale optimization.
4. Matrix Accelerator with Multi-Stage Systolic Array and Output Sparsity Metadata for Parallel Processors
INTEL CORP, 2025
Matrix accelerator for parallel processors like GPUs to improve efficiency of matrix operations in machine learning workloads. The accelerator uses a multi-stage systolic array with sparsity support. It receives output sparsity metadata that indicates which outputs to bypass multiply-accumulate operations. This allows accelerating the backward propagation pass in training by avoiding unnecessary computations for zeroed-out weights. The accelerator can power gate multipliers and adders based on output sparsity, independent of input sparsity.
5. FairQuanti: Enhancing Fairness in Deep Neural Network Quantization via Neuron Role Contribution
jinyin chen, zhiqi cao, xiaojuan wang - Association for Computing Machinery, 2025
The increasing complexity of deep neural networks (DNNs) poses significant resource challenges for edge devices, prompting the development compression technologies like model quantization. However, while improving efficiency, quantization can introduce or perpetuate original models bias. Existing debiasing methods quantized models often incur additional costs. To address this issue, we propose FairQuanti , a novel approach that leverages neuron role contribution to achieve fairness. By distinguishing between biased and normal neurons, employs mixed precision mitigate bias during process. has four key differences from previous studies: (1) Neuron Roles - It formally defines roles, establishing framework feasible mitigation; (2) Effectiveness introduces fair strategy discriminatively quantizes balancing accuracy fairness through Bayesian optimization; (3) Generality applies both structured unstructured data across various bit levels; (4) Robustness demonstrates resilience against adaptive attacks. Extensive experiments on five datasets (three two unstructured) using different valida... Read More
6. Sparse Neural Network Training with In-Situ Synapse Pruning and Compact Indexing Scheme
NANO DIMENSION TECHNOLOGIES LTD, 2025
Generating sparse neural networks during training and representing them in a compact format to improve efficiency and reduce memory requirements. The method involves pruning synapse connections during training instead of just post-processing. It also uses a unique indexing scheme for the sparse weights that eliminates storing and processing disconnected synapses. This allows faster prediction and training as the computation and storage scales linearly with sparsity.
7. Neural Network Training System with In-Situ Weight Quantization Using Ensemble Kalman Filter
TDK CORP, 2025
Online learning program and learner for neural networks that quantizes weights during training to reduce computation and memory usage, especially for edge devices with resource constraints. The learning involves an ensemble Kalman filter to estimate weight updates in a shorter bit representation, then quantizing that representation to the final quantized weight. This allows quantizing weights during learning instead of just during inference.
8. A Topological Improvement of the Overall Performance of Sparse Evolutionary Training: Motif-Based Structural Optimization of Sparse MLPs Project
xiaotian chen, hongyun liu, seyed sahand mohammadi ziabari, 2025
Deep Neural Networks (DNNs) have been proven to be exceptionally effective and ap- plied across diverse domains within deep learning. However, as DNN models increase in complexity, the demand for reduced computational costs memory overheads has become increasingly urgent. Sparsity emerged a leading approach this area. The robustness of sparse Multi-layer Per- ceptrons (MLPs) supervised feature selection, along with application Sparse Evolutionary Training (SET), illustrates feasibility reducing without compromising accuracy. Moreover, it is believed that SET algorithm can still improved through struc- tural optimization called motif-based optimization, potential efficiency gains exceeding 40% performance decline under 4%. This research investigates whether structural optimiza- tion applied Perceptrons (SET-MLP) enhance what extent improvement achieved.
9. Neural Network Training Method Utilizing Precompiled Code Reuse for Computational Graph Execution
HUAWEI CLOUD COMPUTING TECHNOLOGIES CO LTD, 2025
Method for training neural networks with improved efficiency by reusing compiled code from prior training rounds. The method involves determining if a compiled code for a current computational graph exists in the system before executing it. If the compiled code exists, it is directly executed instead of generating a new one. This leverages previous compilations to avoid redundant steps and reduce resource usage compared to always regenerating the compiled code.
10. AI Model Training Method with Selective Computation Omission Based on Dynamic Confidence Threshold
KOREA UNIVERSITY RESEARCH AND BUSINESS FOUNDATION, 2025
Reducing computation and energy for training AI models by selectively omitting computations for images with high prediction confidence. The method involves determining a threshold confidence level for images during training. If an image's confidence exceeds the threshold, only the weights of that image are partially updated instead of doing a full backward propagation and weight update. This allows omitting some computations for images with low error, as the noise tolerance of mini-batch gradient descent allows approximating the weight change. A dynamic threshold is set based on allowable error to balance omissions and learning quality.
11. Energy-Aware Machine Learning Algorithm Design
dheeraj vaddepally - International Journal for Multidisciplinary Research (IJFMR), 2025
The exponential increase in machine learning (ML) use on mobile and edge devices indicated a necessity to adopt efficient algorithm design conserve energy for future consumption sustainability. Power reduction energy-constrained platforms like smartphones, Internet of Things devices, autonomous cars, at training inference, is critical importance. This book discusses techniques energy-conscious algorithms, specifically CPU GPU profiling reducing the power usage with techniques. Profiling tools are discussed find out requirements various model pruning, quantization, knowledge distillation, low-precision inference minimizing usage. For training, backpropagation, optimizers, distributed taken into account. work also efficiency-performance trade-offs promise energy-aware NAS dynamic resource management. influence shown through examples IoT device, computing, data center applications. Last but not least, hardware constraints scalability issues presented, directions designing more energy-efficient ML systems provided.
12. Neural Network Architecture with Locality-Sensitive Hashing Attention and Reversible Residual Connections for Sequential Data
GOOGLE LLC, 2025
Efficiently performing machine learning tasks on sequential data using neural networks by leveraging locality-sensitive hashing (LSH) attention and reversible residual connections. The LSH attention mechanism restricts the set of positions a query can attend to based on similarity, reducing computational costs compared to dot-product attention. Reversible residual connections allow recovering intermediate layer activations during backpropagation without storing all activations. This eliminates the need to save all layer activations for training.
13. Method for Selecting Operators with Evaluation Parameters for Recomputation in Deep Learning Models
BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO LTD, 2025
Method to optimize recomputation in deep learning models by intelligently selecting operators to participate in recomputation to improve computational efficiency. The method involves determining a recomputation evaluation parameter for each operator based on storage and computation time. Operators with high evaluation are selected to participate in recomputation. This allows sacrificing graphics memory for computation by storing intermediate results of selected operators. By intelligently choosing operators, it enables more computation in graphics memory to improve performance.
14. Optimizing Deep Learning Models for Resource‐Constrained Environments With Cluster‐Quantized Knowledge Distillation
niaz ashraf khan, a m saadman rafat - Wiley, 2025
ABSTRACT Deep convolutional neural networks (CNNs) are highly effective in computer vision tasks but remain challenging to deploy resourceconstrained environments due their high computational and memory requirements. Conventional model compression techniques, such as pruning posttraining quantization, often compromise accuracy by decoupling from training. Furthermore, traditional knowledge distillation approaches rely on fullprecision teacher models, limiting effectiveness compressed settings. To address these issues, we propose ClusterQuantized Knowledge Distillation (CQKD), a novel framework that integrates structured with distillation, incorporating clusterbased weight quantization directly into the training loop. Unlike existing methods, CQKD applies both student ensuring more transfer of knowledge. By leveraging layerwise Kmeans clustering, our approach achieves extreme while maintaining accuracy. Experimental results CIFAR10 CIFAR100 demonstrate CQKD, achieving ratios 34,000 preserving competitive accuracy97.9% 91.2% CIFAR100. These highlight ... Read More
15. The Hessian by blocks for neural network by backward propagation
Radhia Bessi, Nabil Gmati - Informa UK Limited, 2024
The back-propagation algorithm used with a stochastic gradient and the increase in computer performance are at the origin of the recent Deep learning trend. For some problems, however, the convergence of gradient methods is still very slow. Newton's method offers potential advantages in terms of faster convergence. This method uses the Hessian matrix to guide the optimization process but increases the computational cost at each iteration. Indeed, although the expression of the Hessian matrix is explicitly known, previous work did not propose an efficient algorithm for its fast computation. In this work, we first propose a backward algorithm to compute the exact Hessian matrix. In addition, the introduction of original operators, for the calculation of second derivatives, facilitates the reading and allows the parallelization of the backward-looking algorithm. To study the practical performance of Newton's method, we apply the proposed algorithm to train two classical neural networks for regression and classification problems and display the associated numerical results.
16. Efficient Implementation of Multilayer Perceptrons: Reducing Execution Time and Memory Consumption
Francisco Cedrón, Sara Alvarez-Gonzalez, Ana Ribas-Rodriguez - MDPI AG, 2024
A technique is presented that reduces the required memory of neural networks through improving weight storage. In contrast to traditional methods, which have an exponential memory overhead with the increase in network size, the proposed method stores only the number of connections between neurons. The proposed method is evaluated on feedforward networks and demonstrates memory saving capabilities of up to almost 80% while also being more efficient, especially with larger architectures.
17. Loss of plasticity in deep continual learning
Shibhansh Dohare, Juan Hernandez-Garcia, Qingfeng Lan - Springer Science and Business Media LLC, 2024
Artificial neural networks, deep-learning methods and the backpropagation algorithm
18. Gradient-free training of recurrent neural networks using random perturbations
Jesús García Fernández, Sander W. Keemink, Marcel van Gerven - Frontiers Media SA, 2024
Recurrent neural networks (RNNs) hold immense potential for computations due to their Turing completeness and sequential processing capabilities, yet existing methods for their training encounter efficiency challenges. Backpropagation through time (BPTT), the prevailing method, extends the backpropagation (BP) algorithm by unrolling the RNN over time. However, this approach suffers from significant drawbacks, including the need to interleave forward and backward phases and store exact gradient information. Furthermore, BPTT has been shown to struggle to propagate gradient information for long sequences, leading to vanishing gradients. An alternative strategy to using gradient-based methods like BPTT involves stochastically approximating gradients through perturbation-based methods. This learning approach is exceptionally simple, necessitating only forward passes in the network and a global reinforcement signal as feedback. Despite its simplicity, the random nature of its updates typically leads to inefficient optimization, limiting its effectiveness in training neural networks. In th... Read More
19. Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation
Yuchen Yang, Yingdong Shi, Cheems Wang, 2024
Fine-tuning pretrained large models to downstream tasks is an important problem, which however suffers from huge memory overhead due to large-scale parameters. This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization. To this end, we propose the Approximate Backpropagation (Approx-BP) theory, which provides the theoretical feasibility of decoupling the forward and backward passes. We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions, which use derivative functions of ReLUs in the backward pass while keeping their forward pass unchanged. In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers, thereby removing activation memory usage redundancy. Our method neither induces extra computation nor reduces training efficiency. We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce... Read More
20. Approximated Likelihood Ratio: A Forward-Only and Parallel Framework for Boosting Neural Network Training
Zeliang Zhang, Jinyang Jiang, Zhuo Liu, 2024
Efficient and biologically plausible alternatives to backpropagation in neural network training remain a challenge due to issues such as high computational complexity and additional assumptions about neural networks, which limit scalability to deeper networks. The likelihood ratio method offers a promising gradient estimation strategy but is constrained by significant memory consumption, especially when deploying multiple copies of data to reduce estimation variance. In this paper, we introduce an approximation technique for the likelihood ratio (LR) method to alleviate computational and memory demands in gradient estimation. By exploiting the natural parallelism during the backward pass using LR, we further provide a high-performance training strategy, which pipelines both the forward and backward pass, to make it more suitable for the computation on specialized hardware. Extensive experiments demonstrate the effectiveness of the approximation technique in neural network training. This work underscores the potential of the likelihood ratio method in achieving high-performance neural... Read More
Get Full Report
Access our comprehensive collection of 115 documents related to this technology