Memory-Efficient Backpropagation Techniques for Deep Learning
Training large neural networks requires substantial memory resources, with intermediate activations often consuming more memory than model parameters. A typical transformer model with 1 billion parameters can require over 100GB of memory during training, primarily due to storing activation gradients for backpropagation. This memory bottleneck limits the size of models that can be trained on available hardware.
The fundamental challenge lies in balancing computational efficiency against memory usage while maintaining numerical stability and convergence properties during training.
This page brings together solutions from recent research—including gradient checkpointing, reversible layer architectures, activation recomputation strategies, and memory-efficient optimizer implementations. These and other approaches enable training of larger models on limited hardware resources while preserving training dynamics and model performance.
1. Deep Learning Model Training with Threshold-Based Intermediate Result Storage to Minimize Memory Access Time During Backpropagation
PREFERRED NETWORKS INC, 2025
Efficiently training deep learning models by reducing the memory access time during backpropagation without requiring a faster memory or interface. The method involves storing intermediate results of the forward pass in memory if they exceed a threshold based on computation cost and size. This avoids fetching those results from memory during the backpropagation, reducing memory access time. However, the higher computation cost of the forward pass increases the total training time.
2. Sparse Data Storage System Utilizing Non-Zero Value Retention and Offset Calculation for Efficient Transmission
SHENZHEN CORERAIN TECHNOLOGIES CO LTD, 2025
Sparse data storage technique for deep learning to reduce the amount of data that needs to be transmitted during deep learning operations to lower the bandwidth requirements and improve efficiency. It stores only the non-zero values in memory and calculates the offset between them when transmitting. This avoids sending zeros and allows effective reduction of data size and bandwidth demand for deep learning applications.
3. Computing Resource Scheduling for Deep Learning Graphs with Integrated Forward and Backward Propagation Cycles
DELL PRODUCTS LP, 2025
Efficiently scheduling computing resources for deep learning models with backpropagation without needing a split into separate forward and backward graphs. The scheduling method involves handling the computing graph with cycles that arise in backpropagation. The graph contains nodes representing operators for forward propagation and their corresponding gradient operators for backpropagation. Instead of splitting into separate DAGs, the full graph is used for scheduling. This preserves operator correlations and avoids unnecessary resource duplication and data transfer between devices.
4. DRAM-Based In-Memory Computing with Bit-Serial XNOR Operations for Neural Network Acceleration
MICRON TECH INC, 2025
In-memory computing using dynamic random-access memory (DRAM) for deep neural network acceleration. The technique involves performing bit-serial XNOR computations directly in the charge domain of the DRAM cells. This allows executing multiplication-accumulation operations for binary neural networks inside the memory device itself. It reduces power consumption and improves throughput compared to using a processor to read/write from DRAM for every computation step. The technique leverages the charge storage capability of DRAM cells to compute binary operations internally without requiring frequent read/write cycles.
5. Message-Based Multi-Processor System with Coordinate-Driven Message Routing in Neural Network Configuration
SNAP INC, 2025
A message-based multi-processor system that can be configured as a deep neural network while requiring less memory compared to traditional neural networks. The system uses a message exchange network connecting clusters of processors. Each processor cluster has elements that can transmit messages to destinations. The system uses a logic module in each cluster to determine if a message should be sent to a specific destination based on computed coordinate ranges. This avoids needing a separate lookup table for each destination core. The logic module computes potential destination ranges from the processor cluster's own coordinates, then checks if the computed ranges overlap the destination's ranges. If so, it enables message transmission.
6. A Novel Approach to Gradient Evaluation and Efficient Deep Learning: A Hybrid Method
bogdan dorneanu, vasileios mappas, harvey arellanogarcia - PSE Press, 2025
Deep learning faces significant challenges in efficiently training large-scale models. These issues are closely linked, as efficient often depends on precise and computationally feasible gradient calculations. This work introduces innovative methodologies to improve deep network (DLN) complex systems. A novel approach DLN is proposed by adapting the block coordinate descent (BCD) method, which optimizes individual layers sequentially. combined with traditional batch-based create a hybrid method that harnesses strengths of both techniques. Additionally, study explores Iterated Control Random Search (ICRS) for initializing parameters applies quasi-Newton methods like L-BFGS restricted iterations enhance optimization. By tackling efficiency, this contribution offers comprehensive framework address key modern machine learning. The scalability effectiveness, especially handling real-world problems. Examples from Process Systems Engineering illustrate how these advancements can directly
7. CaDCR: An Efficient Cascaded Dynamic Collaborative Reasoning Framework for Intelligent Recognition Systems
bowen li, xudong cao, jun li - Multidisciplinary Digital Publishing Institute, 2025
To address the challenges of high computational cost and energy consumption posed by deep neural networks in embedded systems, this paper presents CaDCR, a lightweight dynamic collaborative reasoning framework. By integrating feature discrepancy-guided skipping mechanism with depth-sensitive early exit mechanism, framework establishes hierarchical decision logic: dynamically selects execution paths network blocks based on complexity input samples enables for simple through shallow confidence assessment, thereby forming an adaptive resource allocation strategy. CaDCR can both constantly suppress unnecessary satisfy hard constraints forcibly terminating inference process all samples. Based framework, we design cascaded system tailored deployment to tackle practical challenges. Experiments CIFAR-10/100, SpeechCommands datasets demonstrate that maintains accuracy comparable or higher than baseline models while significantly reducing approximately 4070% within controllable loss margin. In tests STM32 platform, frameworks performance matches theoretical expectations, further verifyin... Read More
8. Theoretical Limits of Feedback Alignment in Preference-based Fine-tuning of AI Models
zhenyu gao, 2025
Feedback alignment (FA) has emerged as an alternative to backpropagation for training deep networks by using fixed random feedback weights. While FA shows promise in supervised tasks, its extension preference-based fine-tuning (PFT) of large language modelswhich relies on human or learned preference signalsremains underexplored. In this work, we analyze theoretical limitations applied PFT objectives. We derive error propagation bounds, characterize convergence conditions paired-FA updates, and quantify the impact noise mismatch stability. By integrating recent advances meta-reinforcement learning prompt compression, highlight trade-offs between complexity efficiency, offering practical guidelines hybrid FAbackprop architectures large-scale optimization.
9. Matrix Accelerator with Multi-Stage Systolic Array and Output Sparsity Metadata for Parallel Processors
INTEL CORP, 2025
Matrix accelerator for parallel processors like GPUs to improve efficiency of matrix operations in machine learning workloads. The accelerator uses a multi-stage systolic array with sparsity support. It receives output sparsity metadata that indicates which outputs to bypass multiply-accumulate operations. This allows accelerating the backward propagation pass in training by avoiding unnecessary computations for zeroed-out weights. The accelerator can power gate multipliers and adders based on output sparsity, independent of input sparsity.
10. FairQuanti: Enhancing Fairness in Deep Neural Network Quantization via Neuron Role Contribution
jinyin chen, zhiqi cao, xiaojuan wang - Association for Computing Machinery, 2025
The increasing complexity of deep neural networks (DNNs) poses significant resource challenges for edge devices, prompting the development compression technologies like model quantization. However, while improving efficiency, quantization can introduce or perpetuate original models bias. Existing debiasing methods quantized models often incur additional costs. To address this issue, we propose FairQuanti , a novel approach that leverages neuron role contribution to achieve fairness. By distinguishing between biased and normal neurons, employs mixed precision mitigate bias during process. has four key differences from previous studies: (1) Neuron Roles - It formally defines roles, establishing framework feasible mitigation; (2) Effectiveness introduces fair strategy discriminatively quantizes balancing accuracy fairness through Bayesian optimization; (3) Generality applies both structured unstructured data across various bit levels; (4) Robustness demonstrates resilience against adaptive attacks. Extensive experiments on five datasets (three two unstructured) using different valida... Read More
11. Sparse Neural Network Training with In-Situ Synapse Pruning and Compact Indexing Scheme
NANO DIMENSION TECHNOLOGIES LTD, 2025
Generating sparse neural networks during training and representing them in a compact format to improve efficiency and reduce memory requirements. The method involves pruning synapse connections during training instead of just post-processing. It also uses a unique indexing scheme for the sparse weights that eliminates storing and processing disconnected synapses. This allows faster prediction and training as the computation and storage scales linearly with sparsity.
12. Neural Network Training System with In-Situ Weight Quantization Using Ensemble Kalman Filter
TDK CORP, 2025
Online learning program and learner for neural networks that quantizes weights during training to reduce computation and memory usage, especially for edge devices with resource constraints. The learning involves an ensemble Kalman filter to estimate weight updates in a shorter bit representation, then quantizing that representation to the final quantized weight. This allows quantizing weights during learning instead of just during inference.
13. A Topological Improvement of the Overall Performance of Sparse Evolutionary Training: Motif-Based Structural Optimization of Sparse MLPs Project
xiaotian chen, hongyun liu, seyed sahand mohammadi ziabari, 2025
Deep Neural Networks (DNNs) have been proven to be exceptionally effective and ap- plied across diverse domains within deep learning. However, as DNN models increase in complexity, the demand for reduced computational costs memory overheads has become increasingly urgent. Sparsity emerged a leading approach this area. The robustness of sparse Multi-layer Per- ceptrons (MLPs) supervised feature selection, along with application Sparse Evolutionary Training (SET), illustrates feasibility reducing without compromising accuracy. Moreover, it is believed that SET algorithm can still improved through struc- tural optimization called motif-based optimization, potential efficiency gains exceeding 40% performance decline under 4%. This research investigates whether structural optimiza- tion applied Perceptrons (SET-MLP) enhance what extent improvement achieved.
14. Neural Network Training Method Utilizing Precompiled Code Reuse for Computational Graph Execution
HUAWEI CLOUD COMPUTING TECHNOLOGIES CO LTD, 2025
Method for training neural networks with improved efficiency by reusing compiled code from prior training rounds. The method involves determining if a compiled code for a current computational graph exists in the system before executing it. If the compiled code exists, it is directly executed instead of generating a new one. This leverages previous compilations to avoid redundant steps and reduce resource usage compared to always regenerating the compiled code.
15. AI Model Training Method with Selective Computation Omission Based on Dynamic Confidence Threshold
KOREA UNIVERSITY RESEARCH AND BUSINESS FOUNDATION, 2025
Reducing computation and energy for training AI models by selectively omitting computations for images with high prediction confidence. The method involves determining a threshold confidence level for images during training. If an image's confidence exceeds the threshold, only the weights of that image are partially updated instead of doing a full backward propagation and weight update. This allows omitting some computations for images with low error, as the noise tolerance of mini-batch gradient descent allows approximating the weight change. A dynamic threshold is set based on allowable error to balance omissions and learning quality.
16. Energy-Aware Machine Learning Algorithm Design
dheeraj vaddepally - International Journal for Multidisciplinary Research (IJFMR), 2025
The exponential increase in machine learning (ML) use on mobile and edge devices indicated a necessity to adopt efficient algorithm design conserve energy for future consumption sustainability. Power reduction energy-constrained platforms like smartphones, Internet of Things devices, autonomous cars, at training inference, is critical importance. This book discusses techniques energy-conscious algorithms, specifically CPU GPU profiling reducing the power usage with techniques. Profiling tools are discussed find out requirements various model pruning, quantization, knowledge distillation, low-precision inference minimizing usage. For training, backpropagation, optimizers, distributed taken into account. work also efficiency-performance trade-offs promise energy-aware NAS dynamic resource management. influence shown through examples IoT device, computing, data center applications. Last but not least, hardware constraints scalability issues presented, directions designing more energy-efficient ML systems provided.
17. Neural Network Architecture with Locality-Sensitive Hashing Attention and Reversible Residual Connections for Sequential Data
GOOGLE LLC, 2025
Efficiently performing machine learning tasks on sequential data using neural networks by leveraging locality-sensitive hashing (LSH) attention and reversible residual connections. The LSH attention mechanism restricts the set of positions a query can attend to based on similarity, reducing computational costs compared to dot-product attention. Reversible residual connections allow recovering intermediate layer activations during backpropagation without storing all activations. This eliminates the need to save all layer activations for training.
18. Method for Selecting Operators with Evaluation Parameters for Recomputation in Deep Learning Models
BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO LTD, 2025
Method to optimize recomputation in deep learning models by intelligently selecting operators to participate in recomputation to improve computational efficiency. The method involves determining a recomputation evaluation parameter for each operator based on storage and computation time. Operators with high evaluation are selected to participate in recomputation. This allows sacrificing graphics memory for computation by storing intermediate results of selected operators. By intelligently choosing operators, it enables more computation in graphics memory to improve performance.
19. Optimizing Deep Learning Models for Resource‐Constrained Environments With Cluster‐Quantized Knowledge Distillation
niaz ashraf khan, a m saadman rafat - Wiley, 2025
ABSTRACT Deep convolutional neural networks (CNNs) are highly effective in computer vision tasks but remain challenging to deploy resourceconstrained environments due their high computational and memory requirements. Conventional model compression techniques, such as pruning posttraining quantization, often compromise accuracy by decoupling from training. Furthermore, traditional knowledge distillation approaches rely on fullprecision teacher models, limiting effectiveness compressed settings. To address these issues, we propose ClusterQuantized Knowledge Distillation (CQKD), a novel framework that integrates structured with distillation, incorporating clusterbased weight quantization directly into the training loop. Unlike existing methods, CQKD applies both student ensuring more transfer of knowledge. By leveraging layerwise Kmeans clustering, our approach achieves extreme while maintaining accuracy. Experimental results CIFAR10 CIFAR100 demonstrate CQKD, achieving ratios 34,000 preserving competitive accuracy97.9% 91.2% CIFAR100. These highlight ... Read More
20. The Hessian by blocks for neural network by backward propagation
Radhia Bessi, Nabil Gmati - Informa UK Limited, 2024
The back-propagation algorithm used with a stochastic gradient and the increase in computer performance are at the origin of the recent Deep learning trend. For some problems, however, the convergence of gradient methods is still very slow. Newton's method offers potential advantages in terms of faster convergence. This method uses the Hessian matrix to guide the optimization process but increases the computational cost at each iteration. Indeed, although the expression of the Hessian matrix is explicitly known, previous work did not propose an efficient algorithm for its fast computation. In this work, we first propose a backward algorithm to compute the exact Hessian matrix. In addition, the introduction of original operators, for the calculation of second derivatives, facilitates the reading and allows the parallelization of the backward-looking algorithm. To study the practical performance of Newton's method, we apply the proposed algorithm to train two classical neural networks for regression and classification problems and display the associated numerical results.
Get Full Report
Access our comprehensive collection of 120 documents related to this technology


