Memory-Efficient Backpropagation Techniques for Deep Learning

Training large neural networks requires substantial memory resources, with intermediate activations often consuming more memory than model parameters. A typical transformer model with 1 billion parameters can require over 100GB of memory during training, primarily due to storing activation gradients for backpropagation. This memory bottleneck limits the size of models that can be trained on available hardware.

The fundamental challenge lies in balancing computational efficiency against memory usage while maintaining numerical stability and convergence properties during training.

This page brings together solutions from recent research—including gradient checkpointing, reversible layer architectures, activation recomputation strategies, and memory-efficient optimizer implementations. These and other approaches enable training of larger models on limited hardware resources while preserving training dynamics and model performance.

1. Neural Network Training Method Utilizing Precompiled Code Reuse for Computational Graph Execution

HUAWEI CLOUD COMPUTING TECHNOLOGIES CO LTD, 2025

Method for training neural networks with improved efficiency by reusing compiled code from prior training rounds. The method involves determining if a compiled code for a current computational graph exists in the system before executing it. If the compiled code exists, it is directly executed instead of generating a new one. This leverages previous compilations to avoid redundant steps and reduce resource usage compared to always regenerating the compiled code.

2. AI Model Training Method with Selective Computation Omission Based on Dynamic Confidence Threshold

KOREA UNIVERSITY RESEARCH AND BUSINESS FOUNDATION, 2025

Reducing computation and energy for training AI models by selectively omitting computations for images with high prediction confidence. The method involves determining a threshold confidence level for images during training. If an image's confidence exceeds the threshold, only the weights of that image are partially updated instead of doing a full backward propagation and weight update. This allows omitting some computations for images with low error, as the noise tolerance of mini-batch gradient descent allows approximating the weight change. A dynamic threshold is set based on allowable error to balance omissions and learning quality.

3. AI at Scale: The Infrastructure Revolution Enabling GPT-Class Large Language Models

sravankumar nandamuri - Al-Kindi Center for Research and Development, 2025

The extraordinary capabilities of Large Language Models (LLMs) like GPT-4 and Llama 3 have redefined the boundaries artificial intelligence, yet their transformative power rests upon a foundation breakthrough infrastructure innovations largely invisible to end users. This article examines critical technological underpinnings enabling today's frontier models, focusing on memory-efficient parallelism strategies that optimize computational resources, high-throughput interconnect technologies facilitate massive distributed training, advanced model sharding techniques including 4D distribute components across resources. By exploring integration these elementsfrom specialized hardware accelerators sophisticated software orchestration systemswe provide insights into how AI community has overcome seemingly insurmountable barriers scale training unprecedented levels. Understanding offers valuable perspective both current future directions as field continues its rapid evolution toward increasingly capable systems.

4. Energy-Aware Machine Learning Algorithm Design

dheeraj vaddepally - International Journal for Multidisciplinary Research (IJFMR), 2025

The exponential increase in machine learning (ML) use on mobile and edge devices indicated a necessity to adopt efficient algorithm design conserve energy for future consumption sustainability. Power reduction energy-constrained platforms like smartphones, Internet of Things devices, autonomous cars, at training inference, is critical importance. This book discusses techniques energy-conscious algorithms, specifically CPU GPU profiling reducing the power usage with techniques. Profiling tools are discussed find out requirements various model pruning, quantization, knowledge distillation, low-precision inference minimizing usage. For training, backpropagation, optimizers, distributed taken into account. work also efficiency-performance trade-offs promise energy-aware NAS dynamic resource management. influence shown through examples IoT device, computing, data center applications. Last but not least, hardware constraints scalability issues presented, directions designing more energy-efficient ML systems provided.

5. Neural Network Architecture with Locality-Sensitive Hashing Attention and Reversible Residual Connections for Sequential Data

GOOGLE LLC, 2025

Efficiently performing machine learning tasks on sequential data using neural networks by leveraging locality-sensitive hashing (LSH) attention and reversible residual connections. The LSH attention mechanism restricts the set of positions a query can attend to based on similarity, reducing computational costs compared to dot-product attention. Reversible residual connections allow recovering intermediate layer activations during backpropagation without storing all activations. This eliminates the need to save all layer activations for training.

6. Method for Selecting Operators with Evaluation Parameters for Recomputation in Deep Learning Models

BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO LTD, 2025

Method to optimize recomputation in deep learning models by intelligently selecting operators to participate in recomputation to improve computational efficiency. The method involves determining a recomputation evaluation parameter for each operator based on storage and computation time. Operators with high evaluation are selected to participate in recomputation. This allows sacrificing graphics memory for computation by storing intermediate results of selected operators. By intelligently choosing operators, it enables more computation in graphics memory to improve performance.

7. Optimizing Deep Learning Models for Resource‐Constrained Environments With Cluster‐Quantized Knowledge Distillation

niaz ashraf khan, a m saadman rafat - Wiley, 2025

ABSTRACT Deep convolutional neural networks (CNNs) are highly effective in computer vision tasks but remain challenging to deploy resourceconstrained environments due their high computational and memory requirements. Conventional model compression techniques, such as pruning posttraining quantization, often compromise accuracy by decoupling from training. Furthermore, traditional knowledge distillation approaches rely on fullprecision teacher models, limiting effectiveness compressed settings. To address these issues, we propose ClusterQuantized Knowledge Distillation (CQKD), a novel framework that integrates structured with distillation, incorporating clusterbased weight quantization directly into the training loop. Unlike existing methods, CQKD applies both student ensuring more transfer of knowledge. By leveraging layerwise Kmeans clustering, our approach achieves extreme while maintaining accuracy. Experimental results CIFAR10 CIFAR100 demonstrate CQKD, achieving ratios 34,000 preserving competitive accuracy97.9% 91.2% CIFAR100. These highlight ... Read More

8. The Hessian by blocks for neural network by backward propagation

Radhia Bessi, Nabil Gmati - Informa UK Limited, 2024

The back-propagation algorithm used with a stochastic gradient and the increase in computer performance are at the origin of the recent Deep learning trend. For some problems, however, the convergence of gradient methods is still very slow. Newton's method offers potential advantages in terms of faster convergence. This method uses the Hessian matrix to guide the optimization process but increases the computational cost at each iteration. Indeed, although the expression of the Hessian matrix is explicitly known, previous work did not propose an efficient algorithm for its fast computation. In this work, we first propose a backward algorithm to compute the exact Hessian matrix. In addition, the introduction of original operators, for the calculation of second derivatives, facilitates the reading and allows the parallelization of the backward-looking algorithm. To study the practical performance of Newton's method, we apply the proposed algorithm to train two classical neural networks for regression and classification problems and display the associated numerical results.

9. Efficient Implementation of Multilayer Perceptrons: Reducing Execution Time and Memory Consumption

Francisco Cedrón, Sara Alvarez-Gonzalez, Ana Ribas-Rodriguez - MDPI AG, 2024

A technique is presented that reduces the required memory of neural networks through improving weight storage. In contrast to traditional methods, which have an exponential memory overhead with the increase in network size, the proposed method stores only the number of connections between neurons. The proposed method is evaluated on feedforward networks and demonstrates memory saving capabilities of up to almost 80% while also being more efficient, especially with larger architectures.

10. Loss of plasticity in deep continual learning

Shibhansh Dohare, Juan Hernandez-Garcia, Qingfeng Lan - Springer Science and Business Media LLC, 2024

Artificial neural networks, deep-learning methods and the backpropagation algorithm

11. Gradient-free training of recurrent neural networks using random perturbations

Jesús García Fernández, Sander W. Keemink, Marcel van Gerven - Frontiers Media SA, 2024

Recurrent neural networks (RNNs) hold immense potential for computations due to their Turing completeness and sequential processing capabilities, yet existing methods for their training encounter efficiency challenges. Backpropagation through time (BPTT), the prevailing method, extends the backpropagation (BP) algorithm by unrolling the RNN over time. However, this approach suffers from significant drawbacks, including the need to interleave forward and backward phases and store exact gradient information. Furthermore, BPTT has been shown to struggle to propagate gradient information for long sequences, leading to vanishing gradients. An alternative strategy to using gradient-based methods like BPTT involves stochastically approximating gradients through perturbation-based methods. This learning approach is exceptionally simple, necessitating only forward passes in the network and a global reinforcement signal as feedback. Despite its simplicity, the random nature of its updates typically leads to inefficient optimization, limiting its effectiveness in training neural networks. In th... Read More

12. Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation

Yuchen Yang, Yingdong Shi, Cheems Wang, 2024

Fine-tuning pretrained large models to downstream tasks is an important problem, which however suffers from huge memory overhead due to large-scale parameters. This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization. To this end, we propose the Approximate Backpropagation (Approx-BP) theory, which provides the theoretical feasibility of decoupling the forward and backward passes. We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions, which use derivative functions of ReLUs in the backward pass while keeping their forward pass unchanged. In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers, thereby removing activation memory usage redundancy. Our method neither induces extra computation nor reduces training efficiency. We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce... Read More

13. Approximated Likelihood Ratio: A Forward-Only and Parallel Framework for Boosting Neural Network Training

Zeliang Zhang, Jinyang Jiang, Zhuo Liu, 2024

Efficient and biologically plausible alternatives to backpropagation in neural network training remain a challenge due to issues such as high computational complexity and additional assumptions about neural networks, which limit scalability to deeper networks. The likelihood ratio method offers a promising gradient estimation strategy but is constrained by significant memory consumption, especially when deploying multiple copies of data to reduce estimation variance. In this paper, we introduce an approximation technique for the likelihood ratio (LR) method to alleviate computational and memory demands in gradient estimation. By exploiting the natural parallelism during the backward pass using LR, we further provide a high-performance training strategy, which pipelines both the forward and backward pass, to make it more suitable for the computation on specialized hardware. Extensive experiments demonstrate the effectiveness of the approximation technique in neural network training. This work underscores the potential of the likelihood ratio method in achieving high-performance neural... Read More

14. Moonwalk: Inverse-Forward Differentiation

Dmitrii Krylov, Armin Karamzade, Roy Fox, 2024

Backpropagation, while effective for gradient computation, falls short in addressing memory consumption, limiting scalability. This work explores forward-mode gradient computation as an alternative in invertible networks, showing its potential to reduce the memory footprint without substantial drawbacks. We introduce a novel technique based on a vector-inverse-Jacobian product that accelerates the computation of forward gradients while retaining the advantages of memory reduction and preserving the fidelity of true gradients. Our method, Moonwalk, has a time complexity linear in the depth of the network, unlike the quadratic time complexity of na\"ive forward, and empirically reduces computation time by several orders of magnitude without allocating more memory. We further accelerate Moonwalk by combining it with reverse-mode differentiation to achieve time complexity comparable with backpropagation while maintaining a much smaller memory footprint. Finally, we showcase the robustness of our method across several architecture choices. Moonwalk is the first forward-based method to com... Read More

15. Perturbation-based Learning for Recurrent Neural Networks

Jesús García Fernández, Sander W. Keemink, Marcel van Gerven, 2024

Recurrent neural networks (RNNs) hold immense potential for computations due to their Turing completeness and sequential processing capabilities, yet existing methods for their training encounter efficiency challenges. Backpropagation through time (BPTT), the prevailing method, extends the backpropagation (BP) algorithm by unrolling the RNN over time. However, this approach suffers from significant drawbacks, including the need to interleave forward and backward phases and store exact gradient information. Furthermore, BPTT has been shown to struggle with propagating gradient information for long sequences, leading to vanishing gradients. An alternative strategy to using gradient-based methods like BPTT involves stochastically approximating gradients through perturbation-based methods. This learning approach is exceptionally simple, necessitating only forward passes in the network and a global reinforcement signal as feedback. Despite its simplicity, the random nature of its updates typically leads to inefficient optimization, limiting its effectiveness in training neural networks. I... Read More

16. Efficient Deep Learning with Decorrelated Backpropagation

Sander Dalm, Joshua Offergeld, Nasir Ahmad, 2024

The backpropagation algorithm remains the dominant and most successful method for training deep neural networks (DNNs). At the same time, training DNNs at scale comes at a significant computational cost and therefore a high carbon footprint. Converging evidence suggests that input decorrelation may speed up deep learning. However, to date, this has not yet translated into substantial improvements in training efficiency in large-scale DNNs. This is mainly caused by the challenge of enforcing fast and stable network-wide decorrelation. Here, we show for the first time that much more efficient training of very deep neural networks using decorrelated backpropagation is feasible. To achieve this goal we made use of a novel algorithm which induces network-wide input decorrelation using minimal computational overhead. By combining this algorithm with careful optimizations, we obtain a more than two-fold speed-up and higher test accuracy compared to backpropagation when training a 18-layer deep residual network. This demonstrates that decorrelation provides exciting prospects for efficient d... Read More

17. PARMESAN: Parameter-Free Memory Search and Transduction for Dense Prediction Tasks

Philip Matthias Winter, M Wimmer, David Major, 2024

In this work we address flexibility in deep learning by means of transductive reasoning. For adaptation to new tasks or new data, existing methods typically involve tuning of learnable parameters or even complete re-training from scratch, rendering such approaches unflexible in practice. We argue that the notion of separating computation from memory by the means of transduction can act as a stepping stone for solving these issues. We therefore propose PARMESAN (parameter-free memory search and transduction), a scalable transduction method which leverages a memory module for solving dense prediction tasks. At inference, hidden representations in memory are being searched to find corresponding examples. In contrast to other methods, PARMESAN learns without the requirement for any continuous training or fine-tuning of learnable parameters simply by modifying the memory content. Our method is compatible with commonly used neural architectures and canonically transfers to 1D, 2D, and 3D grid-based data. We demonstrate the capabilities of our approach at complex tasks such as continual and... Read More

18. Backpropagation and Optimization in Deep Learning: Tutorial and Survey

Benyamin Ghojogh, Ali Ghodsi - Center for Open Science, 2024

This is a tutorial and survey paper on backpropagation and optimization in neural networks. It starts with gradient descent, line-search, momentum, and steepest descent. Then, backpropagation is introduced. Afterwards, stochastic gradient descent, mini-batch stochastic gradient descent, and their convergence rates are discussed. Adaptive learning rate methods, including AdaGrad, RMSProp, and Adam, are explained. Then, algorithms for sharpness-aware minimization are introduced. Finally, convergence guarantees for optimization in over-parameterized neural networks are discussed.

19. VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections

Roy Miles, Pradyumna Reddy, Ismail Elezi, 2024

Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks. Despite their success, training and fine-tuning these models is still far too computationally and memory intensive. In this paper, we identify and characterise the important components needed for effective model convergence using gradient descent. In doing so we find that the intermediate activations used to implement backpropagation can be excessively compressed without incurring any degradation in performance. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs. The proposed algorithm simply divides the tokens up into smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace during the forward pass. These features are then coarsely reconstructed during the backward pass to implement the update rules. We confirm the effectiveness of our algorithm as being complimentary to many state-of-the-art PEFT methods on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for fine-tuning LLaMA ... Read More

20. FLOPS: Forward Learning with OPtimal Sampling

Tao Ren, Zishi Zhang, Jinyang Jiang, 2024

Given the limitations of backpropagation, perturbation-based gradient computation methods have recently gained focus for learning with only forward passes, also referred to as queries. Conventional forward learning consumes enormous queries on each data point for accurate gradient estimation through Monte Carlo sampling, which hinders the scalability of those algorithms. However, not all data points deserve equal queries for gradient estimation. In this paper, we study the problem of improving the forward learning efficiency from a novel perspective: how to reduce the gradient estimation variance with minimum cost? For this, we propose to allocate the optimal number of queries over each data in one batch during training to achieve a good balance between estimation accuracy and computational efficiency. Specifically, with a simplified proxy objective and a reparameterization technique, we derive a novel plug-and-play query allocator with minimal parameters. Theoretical results are carried out to verify its optimality. We conduct extensive experiments for fine-tuning Vision Transformer... Read More

21. Adaptive Stochastic Conjugate Gradient Optimization for Backpropagation Neural Networks

22. Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

23. Towards Interpretable Deep Local Learning with Successive Gradient Reconciliation

24. Efficient Backpropagation with Variance-Controlled Adaptive Sampling

25. Forward Gradient-Based Frank-Wolfe Optimization for Memory Efficient Deep Neural Network Training

Get Full Report

Access our comprehensive collection of 131 documents related to this technology

Request PDF