Training large neural networks requires substantial memory resources, with intermediate activations often consuming more memory than model parameters. A typical transformer model with 1 billion parameters can require over 100GB of memory during training, primarily due to storing activation gradients for backpropagation. This memory bottleneck limits the size of models that can be trained on available hardware.

The fundamental challenge lies in balancing computational efficiency against memory usage while maintaining numerical stability and convergence properties during training.

This page brings together solutions from recent research—including gradient checkpointing, reversible layer architectures, activation recomputation strategies, and memory-efficient optimizer implementations. These and other approaches enable training of larger models on limited hardware resources while preserving training dynamics and model performance.

1. Deep Learning Model Training with Threshold-Based Intermediate Result Storage to Minimize Memory Access Time During Backpropagation

PREFERRED NETWORKS INC, 2025

Efficiently training deep learning models by reducing the memory access time during backpropagation without requiring a faster memory or interface. The method involves storing intermediate results of the forward pass in memory if they exceed a threshold based on computation cost and size. This avoids fetching those results from memory during the backpropagation, reducing memory access time. However, the higher computation cost of the forward pass increases the total training time.

US2025225390A1-patent-drawing

2. Sparse Data Storage System Utilizing Non-Zero Value Retention and Offset Calculation for Efficient Transmission

SHENZHEN CORERAIN TECHNOLOGIES CO LTD, 2025

Sparse data storage technique for deep learning to reduce the amount of data that needs to be transmitted during deep learning operations to lower the bandwidth requirements and improve efficiency. It stores only the non-zero values in memory and calculates the offset between them when transmitting. This avoids sending zeros and allows effective reduction of data size and bandwidth demand for deep learning applications.

3. Computing Resource Scheduling for Deep Learning Graphs with Integrated Forward and Backward Propagation Cycles

DELL PRODUCTS LP, 2025

Efficiently scheduling computing resources for deep learning models with backpropagation without needing a split into separate forward and backward graphs. The scheduling method involves handling the computing graph with cycles that arise in backpropagation. The graph contains nodes representing operators for forward propagation and their corresponding gradient operators for backpropagation. Instead of splitting into separate DAGs, the full graph is used for scheduling. This preserves operator correlations and avoids unnecessary resource duplication and data transfer between devices.

US12353911B2-patent-drawing

4. DRAM-Based In-Memory Computing with Bit-Serial XNOR Operations for Neural Network Acceleration

MICRON TECH INC, 2025

In-memory computing using dynamic random-access memory (DRAM) for deep neural network acceleration. The technique involves performing bit-serial XNOR computations directly in the charge domain of the DRAM cells. This allows executing multiplication-accumulation operations for binary neural networks inside the memory device itself. It reduces power consumption and improves throughput compared to using a processor to read/write from DRAM for every computation step. The technique leverages the charge storage capability of DRAM cells to compute binary operations internally without requiring frequent read/write cycles.

US2025217052A1-patent-drawing

5. Message-Based Multi-Processor System with Coordinate-Driven Message Routing in Neural Network Configuration

SNAP INC, 2025

A message-based multi-processor system that can be configured as a deep neural network while requiring less memory compared to traditional neural networks. The system uses a message exchange network connecting clusters of processors. Each processor cluster has elements that can transmit messages to destinations. The system uses a logic module in each cluster to determine if a message should be sent to a specific destination based on computed coordinate ranges. This avoids needing a separate lookup table for each destination core. The logic module computes potential destination ranges from the processor cluster's own coordinates, then checks if the computed ranges overlap the destination's ranges. If so, it enables message transmission.

US2025217159A1-patent-drawing

6. A Novel Approach to Gradient Evaluation and Efficient Deep Learning: A Hybrid Method

bogdan dorneanu, vasileios mappas, harvey arellanogarcia - PSE Press, 2025

Deep learning faces significant challenges in efficiently training large-scale models. These issues are closely linked, as efficient often depends on precise and computationally feasible gradient calculations. This work introduces innovative methodologies to improve deep network (DLN) complex systems. A novel approach DLN is proposed by adapting the block coordinate descent (BCD) method, which optimizes individual layers sequentially. combined with traditional batch-based create a hybrid method that harnesses strengths of both techniques. Additionally, study explores Iterated Control Random Search (ICRS) for initializing parameters applies quasi-Newton methods like L-BFGS restricted iterations enhance optimization. By tackling efficiency, this contribution offers comprehensive framework address key modern machine learning. The scalability effectiveness, especially handling real-world problems. Examples from Process Systems Engineering illustrate how these advancements can directly

7. CaDCR: An Efficient Cascaded Dynamic Collaborative Reasoning Framework for Intelligent Recognition Systems

bowen li, xudong cao, jun li - Multidisciplinary Digital Publishing Institute, 2025

To address the challenges of high computational cost and energy consumption posed by deep neural networks in embedded systems, this paper presents CaDCR, a lightweight dynamic collaborative reasoning framework. By integrating feature discrepancy-guided skipping mechanism with depth-sensitive early exit mechanism, framework establishes hierarchical decision logic: dynamically selects execution paths network blocks based on complexity input samples enables for simple through shallow confidence assessment, thereby forming an adaptive resource allocation strategy. CaDCR can both constantly suppress unnecessary satisfy hard constraints forcibly terminating inference process all samples. Based framework, we design cascaded system tailored deployment to tackle practical challenges. Experiments CIFAR-10/100, SpeechCommands datasets demonstrate that maintains accuracy comparable or higher than baseline models while significantly reducing approximately 4070% within controllable loss margin. In tests STM32 platform, frameworks performance matches theoretical expectations, further verifyin... Read More

8. Theoretical Limits of Feedback Alignment in Preference-based Fine-tuning of AI Models

zhenyu gao, 2025

Feedback alignment (FA) has emerged as an alternative to backpropagation for training deep networks by using fixed random feedback weights. While FA shows promise in supervised tasks, its extension preference-based fine-tuning (PFT) of large language modelswhich relies on human or learned preference signalsremains underexplored. In this work, we analyze theoretical limitations applied PFT objectives. We derive error propagation bounds, characterize convergence conditions paired-FA updates, and quantify the impact noise mismatch stability. By integrating recent advances meta-reinforcement learning prompt compression, highlight trade-offs between complexity efficiency, offering practical guidelines hybrid FAbackprop architectures large-scale optimization.

9. Matrix Accelerator with Multi-Stage Systolic Array and Output Sparsity Metadata for Parallel Processors

INTEL CORP, 2025

Matrix accelerator for parallel processors like GPUs to improve efficiency of matrix operations in machine learning workloads. The accelerator uses a multi-stage systolic array with sparsity support. It receives output sparsity metadata that indicates which outputs to bypass multiply-accumulate operations. This allows accelerating the backward propagation pass in training by avoiding unnecessary computations for zeroed-out weights. The accelerator can power gate multipliers and adders based on output sparsity, independent of input sparsity.

10. FairQuanti: Enhancing Fairness in Deep Neural Network Quantization via Neuron Role Contribution

jinyin chen, zhiqi cao, xiaojuan wang - Association for Computing Machinery, 2025

The increasing complexity of deep neural networks (DNNs) poses significant resource challenges for edge devices, prompting the development compression technologies like model quantization. However, while improving efficiency, quantization can introduce or perpetuate original models bias. Existing debiasing methods quantized models often incur additional costs. To address this issue, we propose FairQuanti , a novel approach that leverages neuron role contribution to achieve fairness. By distinguishing between biased and normal neurons, employs mixed precision mitigate bias during process. has four key differences from previous studies: (1) Neuron Roles - It formally defines roles, establishing framework feasible mitigation; (2) Effectiveness introduces fair strategy discriminatively quantizes balancing accuracy fairness through Bayesian optimization; (3) Generality applies both structured unstructured data across various bit levels; (4) Robustness demonstrates resilience against adaptive attacks. Extensive experiments on five datasets (three two unstructured) using different valida... Read More

11. Sparse Neural Network Training with In-Situ Synapse Pruning and Compact Indexing Scheme

NANO DIMENSION TECHNOLOGIES LTD, 2025

Generating sparse neural networks during training and representing them in a compact format to improve efficiency and reduce memory requirements. The method involves pruning synapse connections during training instead of just post-processing. It also uses a unique indexing scheme for the sparse weights that eliminates storing and processing disconnected synapses. This allows faster prediction and training as the computation and storage scales linearly with sparsity.

12. Neural Network Training System with In-Situ Weight Quantization Using Ensemble Kalman Filter

TDK CORP, 2025

Online learning program and learner for neural networks that quantizes weights during training to reduce computation and memory usage, especially for edge devices with resource constraints. The learning involves an ensemble Kalman filter to estimate weight updates in a shorter bit representation, then quantizing that representation to the final quantized weight. This allows quantizing weights during learning instead of just during inference.

13. A Topological Improvement of the Overall Performance of Sparse Evolutionary Training: Motif-Based Structural Optimization of Sparse MLPs Project

xiaotian chen, hongyun liu, seyed sahand mohammadi ziabari, 2025

Deep Neural Networks (DNNs) have been proven to be exceptionally effective and ap- plied across diverse domains within deep learning. However, as DNN models increase in complexity, the demand for reduced computational costs memory overheads has become increasingly urgent. Sparsity emerged a leading approach this area. The robustness of sparse Multi-layer Per- ceptrons (MLPs) supervised feature selection, along with application Sparse Evolutionary Training (SET), illustrates feasibility reducing without compromising accuracy. Moreover, it is believed that SET algorithm can still improved through struc- tural optimization called motif-based optimization, potential efficiency gains exceeding 40% performance decline under 4%. This research investigates whether structural optimiza- tion applied Perceptrons (SET-MLP) enhance what extent improvement achieved.

14. Neural Network Training Method Utilizing Precompiled Code Reuse for Computational Graph Execution

HUAWEI CLOUD COMPUTING TECHNOLOGIES CO LTD, 2025

Method for training neural networks with improved efficiency by reusing compiled code from prior training rounds. The method involves determining if a compiled code for a current computational graph exists in the system before executing it. If the compiled code exists, it is directly executed instead of generating a new one. This leverages previous compilations to avoid redundant steps and reduce resource usage compared to always regenerating the compiled code.

15. AI Model Training Method with Selective Computation Omission Based on Dynamic Confidence Threshold

KOREA UNIVERSITY RESEARCH AND BUSINESS FOUNDATION, 2025

Reducing computation and energy for training AI models by selectively omitting computations for images with high prediction confidence. The method involves determining a threshold confidence level for images during training. If an image's confidence exceeds the threshold, only the weights of that image are partially updated instead of doing a full backward propagation and weight update. This allows omitting some computations for images with low error, as the noise tolerance of mini-batch gradient descent allows approximating the weight change. A dynamic threshold is set based on allowable error to balance omissions and learning quality.

16. Energy-Aware Machine Learning Algorithm Design

dheeraj vaddepally - International Journal for Multidisciplinary Research (IJFMR), 2025

The exponential increase in machine learning (ML) use on mobile and edge devices indicated a necessity to adopt efficient algorithm design conserve energy for future consumption sustainability. Power reduction energy-constrained platforms like smartphones, Internet of Things devices, autonomous cars, at training inference, is critical importance. This book discusses techniques energy-conscious algorithms, specifically CPU GPU profiling reducing the power usage with techniques. Profiling tools are discussed find out requirements various model pruning, quantization, knowledge distillation, low-precision inference minimizing usage. For training, backpropagation, optimizers, distributed taken into account. work also efficiency-performance trade-offs promise energy-aware NAS dynamic resource management. influence shown through examples IoT device, computing, data center applications. Last but not least, hardware constraints scalability issues presented, directions designing more energy-efficient ML systems provided.

17. Neural Network Architecture with Locality-Sensitive Hashing Attention and Reversible Residual Connections for Sequential Data

GOOGLE LLC, 2025

Efficiently performing machine learning tasks on sequential data using neural networks by leveraging locality-sensitive hashing (LSH) attention and reversible residual connections. The LSH attention mechanism restricts the set of positions a query can attend to based on similarity, reducing computational costs compared to dot-product attention. Reversible residual connections allow recovering intermediate layer activations during backpropagation without storing all activations. This eliminates the need to save all layer activations for training.

18. Method for Selecting Operators with Evaluation Parameters for Recomputation in Deep Learning Models

BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO LTD, 2025

Method to optimize recomputation in deep learning models by intelligently selecting operators to participate in recomputation to improve computational efficiency. The method involves determining a recomputation evaluation parameter for each operator based on storage and computation time. Operators with high evaluation are selected to participate in recomputation. This allows sacrificing graphics memory for computation by storing intermediate results of selected operators. By intelligently choosing operators, it enables more computation in graphics memory to improve performance.

19. Optimizing Deep Learning Models for Resource‐Constrained Environments With Cluster‐Quantized Knowledge Distillation

niaz ashraf khan, a m saadman rafat - Wiley, 2025

ABSTRACT Deep convolutional neural networks (CNNs) are highly effective in computer vision tasks but remain challenging to deploy resourceconstrained environments due their high computational and memory requirements. Conventional model compression techniques, such as pruning posttraining quantization, often compromise accuracy by decoupling from training. Furthermore, traditional knowledge distillation approaches rely on fullprecision teacher models, limiting effectiveness compressed settings. To address these issues, we propose ClusterQuantized Knowledge Distillation (CQKD), a novel framework that integrates structured with distillation, incorporating clusterbased weight quantization directly into the training loop. Unlike existing methods, CQKD applies both student ensuring more transfer of knowledge. By leveraging layerwise Kmeans clustering, our approach achieves extreme while maintaining accuracy. Experimental results CIFAR10 CIFAR100 demonstrate CQKD, achieving ratios 34,000 preserving competitive accuracy97.9% 91.2% CIFAR100. These highlight ... Read More

20. The Hessian by blocks for neural network by backward propagation

Radhia Bessi, Nabil Gmati - Informa UK Limited, 2024

The back-propagation algorithm used with a stochastic gradient and the increase in computer performance are at the origin of the recent Deep learning trend. For some problems, however, the convergence of gradient methods is still very slow. Newton's method offers potential advantages in terms of faster convergence. This method uses the Hessian matrix to guide the optimization process but increases the computational cost at each iteration. Indeed, although the expression of the Hessian matrix is explicitly known, previous work did not propose an efficient algorithm for its fast computation. In this work, we first propose a backward algorithm to compute the exact Hessian matrix. In addition, the introduction of original operators, for the calculation of second derivatives, facilitates the reading and allows the parallelization of the backward-looking algorithm. To study the practical performance of Newton's method, we apply the proposed algorithm to train two classical neural networks for regression and classification problems and display the associated numerical results.

21. Efficient Implementation of Multilayer Perceptrons: Reducing Execution Time and Memory Consumption

Francisco Cedrón, Sara Alvarez-Gonzalez, Ana Ribas-Rodriguez - MDPI AG, 2024

A technique is presented that reduces the required memory of neural networks through improving weight storage. In contrast to traditional methods, which have an exponential memory overhead with the increase in network size, the proposed method stores only the number of connections between neurons. The proposed method is evaluated on feedforward networks and demonstrates memory saving capabilities of up to almost 80% while also being more efficient, especially with larger architectures.

22. Loss of plasticity in deep continual learning

Shibhansh Dohare, Juan Hernandez-Garcia, Qingfeng Lan - Springer Science and Business Media LLC, 2024

Artificial neural networks, deep-learning methods and the backpropagation algorithm

23. Gradient-free training of recurrent neural networks using random perturbations

Jesús García Fernández, Sander W. Keemink, Marcel van Gerven - Frontiers Media SA, 2024

Recurrent neural networks (RNNs) hold immense potential for computations due to their Turing completeness and sequential processing capabilities, yet existing methods for their training encounter efficiency challenges. Backpropagation through time (BPTT), the prevailing method, extends the backpropagation (BP) algorithm by unrolling the RNN over time. However, this approach suffers from significant drawbacks, including the need to interleave forward and backward phases and store exact gradient information. Furthermore, BPTT has been shown to struggle to propagate gradient information for long sequences, leading to vanishing gradients. An alternative strategy to using gradient-based methods like BPTT involves stochastically approximating gradients through perturbation-based methods. This learning approach is exceptionally simple, necessitating only forward passes in the network and a global reinforcement signal as feedback. Despite its simplicity, the random nature of its updates typically leads to inefficient optimization, limiting its effectiveness in training neural networks. In th... Read More

24. Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation

Yuchen Yang, Yingdong Shi, Cheems Wang, 2024

Fine-tuning pretrained large models to downstream tasks is an important problem, which however suffers from huge memory overhead due to large-scale parameters. This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization. To this end, we propose the Approximate Backpropagation (Approx-BP) theory, which provides the theoretical feasibility of decoupling the forward and backward passes. We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions, which use derivative functions of ReLUs in the backward pass while keeping their forward pass unchanged. In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers, thereby removing activation memory usage redundancy. Our method neither induces extra computation nor reduces training efficiency. We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce... Read More

25. Approximated Likelihood Ratio: A Forward-Only and Parallel Framework for Boosting Neural Network Training

Zeliang Zhang, Jinyang Jiang, Zhuo Liu, 2024

Efficient and biologically plausible alternatives to backpropagation in neural network training remain a challenge due to issues such as high computational complexity and additional assumptions about neural networks, which limit scalability to deeper networks. The likelihood ratio method offers a promising gradient estimation strategy but is constrained by significant memory consumption, especially when deploying multiple copies of data to reduce estimation variance. In this paper, we introduce an approximation technique for the likelihood ratio (LR) method to alleviate computational and memory demands in gradient estimation. By exploiting the natural parallelism during the backward pass using LR, we further provide a high-performance training strategy, which pipelines both the forward and backward pass, to make it more suitable for the computation on specialized hardware. Extensive experiments demonstrate the effectiveness of the approximation technique in neural network training. This work underscores the potential of the likelihood ratio method in achieving high-performance neural... Read More

26. Moonwalk: Inverse-Forward Differentiation

Dmitrii Krylov, Armin Karamzade, Roy Fox, 2024

Backpropagation, while effective for gradient computation, falls short in addressing memory consumption, limiting scalability. This work explores forward-mode gradient computation as an alternative in invertible networks, showing its potential to reduce the memory footprint without substantial drawbacks. We introduce a novel technique based on a vector-inverse-Jacobian product that accelerates the computation of forward gradients while retaining the advantages of memory reduction and preserving the fidelity of true gradients. Our method, Moonwalk, has a time complexity linear in the depth of the network, unlike the quadratic time complexity of na\"ive forward, and empirically reduces computation time by several orders of magnitude without allocating more memory. We further accelerate Moonwalk by combining it with reverse-mode differentiation to achieve time complexity comparable with backpropagation while maintaining a much smaller memory footprint. Finally, we showcase the robustness of our method across several architecture choices. Moonwalk is the first forward-based method to com... Read More

27. Perturbation-based Learning for Recurrent Neural Networks

Jesús García Fernández, Sander W. Keemink, Marcel van Gerven, 2024

Recurrent neural networks (RNNs) hold immense potential for computations due to their Turing completeness and sequential processing capabilities, yet existing methods for their training encounter efficiency challenges. Backpropagation through time (BPTT), the prevailing method, extends the backpropagation (BP) algorithm by unrolling the RNN over time. However, this approach suffers from significant drawbacks, including the need to interleave forward and backward phases and store exact gradient information. Furthermore, BPTT has been shown to struggle with propagating gradient information for long sequences, leading to vanishing gradients. An alternative strategy to using gradient-based methods like BPTT involves stochastically approximating gradients through perturbation-based methods. This learning approach is exceptionally simple, necessitating only forward passes in the network and a global reinforcement signal as feedback. Despite its simplicity, the random nature of its updates typically leads to inefficient optimization, limiting its effectiveness in training neural networks. I... Read More

28. Efficient Deep Learning with Decorrelated Backpropagation

Sander Dalm, Joshua Offergeld, Nasir Ahmad, 2024

The backpropagation algorithm remains the dominant and most successful method for training deep neural networks (DNNs). At the same time, training DNNs at scale comes at a significant computational cost and therefore a high carbon footprint. Converging evidence suggests that input decorrelation may speed up deep learning. However, to date, this has not yet translated into substantial improvements in training efficiency in large-scale DNNs. This is mainly caused by the challenge of enforcing fast and stable network-wide decorrelation. Here, we show for the first time that much more efficient training of very deep neural networks using decorrelated backpropagation is feasible. To achieve this goal we made use of a novel algorithm which induces network-wide input decorrelation using minimal computational overhead. By combining this algorithm with careful optimizations, we obtain a more than two-fold speed-up and higher test accuracy compared to backpropagation when training a 18-layer deep residual network. This demonstrates that decorrelation provides exciting prospects for efficient d... Read More

29. Backpropagation and Optimization in Deep Learning: Tutorial and Survey

Benyamin Ghojogh, Ali Ghodsi - Center for Open Science, 2024

This is a tutorial and survey paper on backpropagation and optimization in neural networks. It starts with gradient descent, line-search, momentum, and steepest descent. Then, backpropagation is introduced. Afterwards, stochastic gradient descent, mini-batch stochastic gradient descent, and their convergence rates are discussed. Adaptive learning rate methods, including AdaGrad, RMSProp, and Adam, are explained. Then, algorithms for sharpness-aware minimization are introduced. Finally, convergence guarantees for optimization in over-parameterized neural networks are discussed.

30. VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections

Roy Miles, Pradyumna Reddy, Ismail Elezi, 2024

Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks. Despite their success, training and fine-tuning these models is still far too computationally and memory intensive. In this paper, we identify and characterise the important components needed for effective model convergence using gradient descent. In doing so we find that the intermediate activations used to implement backpropagation can be excessively compressed without incurring any degradation in performance. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs. The proposed algorithm simply divides the tokens up into smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace during the forward pass. These features are then coarsely reconstructed during the backward pass to implement the update rules. We confirm the effectiveness of our algorithm as being complimentary to many state-of-the-art PEFT methods on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for fine-tuning LLaMA ... Read More

31. FLOPS: Forward Learning with OPtimal Sampling

Tao Ren, Zishi Zhang, Jinyang Jiang, 2024

Given the limitations of backpropagation, perturbation-based gradient computation methods have recently gained focus for learning with only forward passes, also referred to as queries. Conventional forward learning consumes enormous queries on each data point for accurate gradient estimation through Monte Carlo sampling, which hinders the scalability of those algorithms. However, not all data points deserve equal queries for gradient estimation. In this paper, we study the problem of improving the forward learning efficiency from a novel perspective: how to reduce the gradient estimation variance with minimum cost? For this, we propose to allocate the optimal number of queries over each data in one batch during training to achieve a good balance between estimation accuracy and computational efficiency. Specifically, with a simplified proxy objective and a reparameterization technique, we derive a novel plug-and-play query allocator with minimal parameters. Theoretical results are carried out to verify its optimality. We conduct extensive experiments for fine-tuning Vision Transformer... Read More

32. Adaptive Stochastic Conjugate Gradient Optimization for Backpropagation Neural Networks

Mohamed Hashem, Fadele Ayotunde Alaba, Muhammad Haruna Jumare - Institute of Electrical and Electronics Engineers (IEEE), 2024

Backpropagation neural networks are commonly utilized to solve complicated issues in various disciplines.However, optimizing their settings remains a significant task.Traditional gradient-based optimization methods, such as stochastic gradient descent (SGD), often exhibit slow convergence and hyperparameter sensitivity.An adaptive stochastic conjugate gradient (ASCG) optimization strategy for backpropagation neural networks is proposed in this research.ASCG combines the advantages of stochastic optimization and conjugate gradient techniques to increase training efficiency and convergence speed.Based on the observed gradients, the algorithm adaptively calculates the learning rate and search direction at each iteration, allowing for quicker convergence and greater generalization.Experimental findings on benchmark datasets show that ASCG optimization outperforms standard optimization techniques regarding convergence time and model performance.The proposed ASCG algorithm provides a viable method for improving the training process of backpropagation neural networks, making them more succe... Read More

33. Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

Cabrel Teguemne Fokam, Khaleelulla Khan Nazeer, Lukáš König, 2024

The increasing size of deep learning models has created the need for more efficient alternatives to the standard error backpropagation algorithm, that make better use of asynchronous, parallel and distributed computing. One major shortcoming of backpropagation is the interlocking between the forward phase of the algorithm, which computes a global loss, and the backward phase where the loss is backpropagated through all layers to compute the gradients, which are used to update the network parameters. To address this problem, we propose a method that parallelises SGD updates across the layers of a model by asynchronously updating them from multiple threads. Furthermore, since we observe that the forward pass is often much faster than the backward pass, we use separate threads for the forward and backward pass calculations, which allows us to use a higher ratio of forward to backward threads than the usual 1:1 ratio, reducing the overall staleness of the parameters. Thus, our approach performs asynchronous stochastic gradient descent using separate threads for the loss (forward) and gra... Read More

34. Towards Interpretable Deep Local Learning with Successive Gradient Reconciliation

Yibo Yang, Xiaojie Li, Motasem Alfarra, 2024

Relieving the reliance of neural network training on a global back-propagation (BP) has emerged as a notable research topic due to the biological implausibility and huge memory consumption caused by BP. Among the existing solutions, local learning optimizes gradient-isolated modules of a neural network with local errors and has been proved to be effective even on large-scale datasets. However, the reconciliation among local errors has never been investigated. In this paper, we first theoretically study non-greedy layer-wise training and show that the convergence cannot be assured when the local gradient in a module w.r.t. its input is not reconciled with the local gradient in the previous module w.r.t. its output. Inspired by the theoretical result, we further propose a local training strategy that successively regularizes the gradient reconciliation between neighboring modules without breaking gradient isolation or introducing any learnable parameters. Our method can be integrated into both local-BP and BP-free settings. In experiments, we achieve significant performance improvement... Read More

35. Efficient Backpropagation with Variance-Controlled Adaptive Sampling

Ziteng Wang, Jianfei Chen, Jun Zhu, 2024

Sampling-based algorithms, which eliminate ''unimportant'' computations during forward and/or back propagation (BP), offer potential solutions to accelerate neural network training. However, since sampling introduces approximations to training, such algorithms may not consistently maintain accuracy across various tasks. In this work, we introduce a variance-controlled adaptive sampling (VCAS) method designed to accelerate BP. VCAS computes an unbiased stochastic gradient with fine-grained layerwise importance sampling in data dimension for activation gradient calculation and leverage score sampling in token dimension for weight gradient calculation. To preserve accuracy, we control the additional variance by learning the sample ratio jointly with model parameters during training. We assessed VCAS on multiple fine-tuning and pre-training tasks in both vision and natural language domains. On all the tasks, VCAS can preserve the original training loss trajectory and validation accuracy with an up to 73.87% FLOPs reduction of BP and 49.58% FLOPs reduction of the whole training process. T... Read More

36. Forward Gradient-Based Frank-Wolfe Optimization for Memory Efficient Deep Neural Network Training

Mohammad Rostami, S. S. Kia, 2024

Training a deep neural network using gradient-based methods necessitates the calculation of gradients at each level. However, using backpropagation or reverse mode differentiation, to calculate the gradients necessities significant memory consumption, rendering backpropagation an inefficient method for computing gradients. This paper focuses on analyzing the performance of the well-known Frank-Wolfe algorithm, a.k.a. conditional gradient algorithm by having access to the forward mode of automatic differentiation to compute gradients. We provide in-depth technical details that show the proposed Algorithm does converge to the optimal solution with a sub-linear rate of convergence by having access to the noisy estimate of the true gradient obtained in the forward mode of automated differentiation, referred to as the Projected Forward Gradient. In contrast, the standard Frank-Wolfe algorithm, when provided with access to the Projected Forward Gradient, fails to converge to the optimal solution. We demonstrate the convergence attributes of our proposed algorithms using a numerical example... Read More

37. HLQ: Fast and Efficient Backpropagation via Hadamard Low-rank Quantization

S.P. Kim, Eunhyeok Park, 2024

With the rapid increase in model size and the growing importance of various fine-tuning applications, lightweight training has become crucial. Since the backward pass is twice as expensive as the forward pass, optimizing backpropagation is particularly important. However, modifications to this process can lead to suboptimal convergence, so training optimization should minimize perturbations, which is a highly challenging task. In this study, we introduce a novel optimization strategy called Hadamard Low-rank Quantization (HLQ), focusing on reducing the cost of backpropagation in convolutional and linear layers. We first analyze the sensitivity of gradient computation with respect to activation and weight, and judiciously design the HLQ pipeline to apply 4-bit Hadamard quantization to the activation gradient and Hadamard low-rank approximation to the weight gradient. This combination was found to be the best for maximizing benefits, and our extensive experiments demonstrate the outstanding performance of HLQ in both training from scratch and fine-tuning, achieving significant memory s... Read More

38. Modified Backpropagation Algorithm with Multiplicative Calculus in Neural Networks

Serkan Özbay - Kaunas University of Technology (KTU), 2023

Backpropagation is one of the most widely used algorithms for training feedforward deep neural networks. The algorithm requires a differentiable activation function and it performs computations of the gradient proceeding backwards through the feedforward deep neural network from the last layer through to the first layer. In order to calculate the gradient at a specific layer, the gradients of all layers are combined via the chain rule of calculus. One of the biggest disadvantages of the backpropagation is that it requires a large amount of training time. To overcome this issue, this paper proposes a modified backpropagation algorithm with multiplicative calculus. Multiplicative calculus provides an alternative to the classical calculus and it defines new kinds of derivative and integral forms in multiplicative form rather than addition and subtraction forms. The performance analyzes are discussed in various case studies and the results are given comparatively with classical backpropagation algorithm. It is found that the proposed modified backpropagation algorithm converges in less t... Read More

39. Decoupled neural network training with re-computation and weight prediction

Jiawei Peng, Yicheng Xu, Zhiping Lin - Public Library of Science (PLoS), 2023

To break the three lockings during backpropagation (BP) process for neural network training, multiple decoupled learning methods have been investigated recently. These methods either lead to significant drop in accuracy performance or suffer from dramatic increase in memory usage. In this paper, a new form of decoupled learning, named decoupled neural network training scheme with re-computation and weight prediction (DTRP) is proposed. In DTRP, a re-computation scheme is adopted to solve the memory explosion problem, and a weight prediction scheme is proposed to deal with the weight delay caused by re-computation. Additionally, a batch compensation scheme is developed, allowing the proposed DTRP to run faster. Theoretical analysis shows that DTRP is guaranteed to converge to crical points under certain conditions. Experiments are conducted by training various convolutional neural networks on several classification datasets, showing comparable or better results than the state-of-the-art methods and BP. These experiments also reveal that adopting the proposed method, the memory explosi... Read More

40. One Forward is Enough for Neural Network Training via Likelihood Ratio Method

Jinyang Jiang, Zeliang Zhang, Chenliang Xu, 2023

While backpropagation (BP) is the mainstream approach for gradient computation in neural network training, its heavy reliance on the chain rule of differentiation constrains the designing flexibility of network architecture and training pipelines. We avoid the recursive computation in BP and develop a unified likelihood ratio (ULR) method for gradient estimation with just one forward propagation. Not only can ULR be extended to train a wide variety of neural network architectures, but the computation flow in BP can also be rearranged by ULR for better device adaptation. Moreover, we propose several variance reduction techniques to further accelerate the training process. Our experiments offer numerical results across diverse aspects, including various neural network training scenarios, computation flow rearrangement, and fine-tuning of pre-trained models. All findings demonstrate that ULR effectively enhances the flexibility of neural network training by permitting localized module training without compromising the global objective and significantly boosts the network robustness.

41. TinyProp -- Adaptive Sparse Backpropagation for Efficient TinyML On-device Learning

Marcus Rüb, Daniel Maier, Daniel Mueller-Gritschneder, 2023

Training deep neural networks using backpropagation is very memory and computationally intensive. This makes it difficult to run on-device learning or fine-tune neural networks on tiny, embedded devices such as low-power micro-controller units (MCUs). Sparse backpropagation algorithms try to reduce the computational load of on-device learning by training only a subset of the weights and biases. Existing approaches use a static number of weights to train. A poor choice of this so-called backpropagation ratio limits either the computational gain or can lead to severe accuracy losses. In this paper we present TinyProp, the first sparse backpropagation method that dynamically adapts the back-propagation ratio during on-device training for each training step. TinyProp induces a small calculation overhead to sort the elements of the gradient, which does not significantly impact the computational gains. TinyProp works particularly well on fine-tuning trained networks on MCUs, which is a typical use case for embedded applications. For typical datasets from three datasets MNIST, DCASE2020 and... Read More

42. Bridging Discrete and Backpropagation: Straight-Through and Beyond

Liyuan Liu, Chengyu Dong, Xiaodong Liu, 2023

Backpropagation, the cornerstone of deep learning, is limited to computing gradients for continuous variables. This limitation poses challenges for problems involving discrete latent variables. To address this issue, we propose a novel approach to approximate the gradient of parameters involved in generating discrete latent variables. First, we examine the widely used Straight-Through (ST) heuristic and demonstrate that it works as a first-order approximation of the gradient. Guided by our findings, we propose ReinMax, which achieves second-order accuracy by integrating Heun's method, a second-order numerical method for solving ODEs. ReinMax does not require Hessian or other second-order derivatives, thus having negligible computation overheads. Extensive experimental results on various tasks demonstrate the superiority of ReinMax over the state of the art. Implementations are released at https://github.com/microsoft/ReinMax.

43. Block-local learning with probabilistic latent representations

David Kappel, Khaleelulla Khan Nazeer, Cabrel Teguemne Fokam, 2023

The ubiquitous backpropagation algorithm requires sequential updates through the network introducing a locking problem. In addition, back-propagation relies on the transpose of forward weight matrices to compute updates, introducing a weight transport problem across the network. Locking and weight transport are problems because they prevent efficient parallelization and horizontal scaling of the training process. We propose a new method to address both these problems and scale up the training of large models. Our method works by dividing a deep neural network into blocks and introduces a feedback network that propagates the information from the targets backwards to provide auxiliary local losses. Forward and backward propagation can operate in parallel and with different sets of weights, addressing the problems of locking and weight transport. Our approach derives from a statistical interpretation of training that treats output activations of network blocks as parameters of probability distributions. The resulting learning framework uses these parameters to evaluate the agreement bet... Read More

44. PaReprop: Fast Parallelized Reversible Backpropagation

Tyler Zhu, Karttikeya Mangalam, 2023

The growing size of datasets and deep learning models has made faster and memory-efficient training crucial. Reversible transformers have recently been introduced as an exciting new method for extremely memory-efficient training, but they come with an additional computation overhead of activation re-computation in the backpropagation phase. We present PaReprop, a fast Parallelized Reversible Backpropagation algorithm that parallelizes the additional activation re-computation overhead in reversible training with the gradient computation itself in backpropagation phase. We demonstrate the effectiveness of the proposed PaReprop algorithm through extensive benchmarking across model families (ViT, MViT, Swin and RoBERTa), data modalities (Vision & NLP), model sizes (from small to giant), and training batch sizes. Our empirical results show that PaReprop achieves up to 20% higher training throughput than vanilla reversible training, largely mitigating the theoretical overhead of 25% lower throughput from activation recomputation in reversible training. Project page: https://tylerzhu.com/pa... Read More

45. Can Forward Gradient Match Backpropagation?

Louis Fournier, Stéphane Rivaud, Eugene Belilovsky, 2023

Forward Gradients - the idea of using directional derivatives in forward differentiation mode - have recently been shown to be utilizable for neural network training while avoiding problems generally associated with backpropagation gradient computation, such as locking and memorization requirements. The cost is the requirement to guess the step direction, which is hard in high dimensions. While current solutions rely on weighted averages over isotropic guess vector distributions, we propose to strongly bias our gradient guesses in directions that are much more promising, such as feedback obtained from small, local auxiliary networks. For a standard computer vision neural network, we conduct a rigorous study systematically covering a variety of combinations of gradient targets and gradient guesses, including those previously presented in the literature. We find that using gradients obtained from a local loss as a candidate direction drastically improves on random noise in Forward Gradient methods.

46. A Novel Method for improving accuracy in neural network by reinstating traditional back propagation technique

R Gokulprasath, 2023

Deep learning has revolutionized industries like computer vision, natural language processing, and speech recognition. However, back propagation, the main method for training deep neural networks, faces challenges like computational overhead and vanishing gradients. In this paper, we propose a novel instant parameter update methodology that eliminates the need for computing gradients at each layer. Our approach accelerates learning, avoids the vanishing gradient problem, and outperforms state-of-the-art methods on benchmark data sets. This research presents a promising direction for efficient and effective deep neural network training.

47. SparseProp: Efficient Sparse Backpropagation for Faster Training of Neural Networks

Mahdi Nikdan, Tommaso Pegolotti, Eugenia Iofinova, 2023

We provide a new efficient version of the backpropagation algorithm, specialized to the case where the weights of the neural network being trained are sparse. Our algorithm is general, as it applies to arbitrary (unstructured) sparsity and common layer types (e.g., convolutional or linear). We provide a fast vectorized implementation on commodity CPUs, and show that it can yield speedups in end-to-end runtime experiments, both in transfer learning using already-sparsified networks, and in training sparse networks from scratch. Thus, our results provide the first support for sparse training on commodity hardware.

48. Entropy Based Regularization Improves Performance in the Forward-Forward Algorithm

Matteo Pardi, Domenico Tortorella, Alessio Micheli - Ciaco - i6doc.com, 2023

The forward-forward algorithm (FFA) is a recently proposed alternative to end-to-end backpropagation in deep neural networks.FFA builds networks greedily layer by layer, thus being of particular interest in applications where memory and computational constraints are important.In order to boost layers' ability to transfer useful information to subsequent layers, in this paper we propose a novel regularization term for the layerwise loss function that is based on Renyi's quadratic entropy.Preliminary experiments show accuracy is generally significantly improved across all network architectures.In particular, smaller architectures become more effective in addressing our classification tasks compared to the original FFA.

49. Beam Tree Recursive Cells

Jishnu Ray Chowdhury, Cornelia Caragea, 2023

We propose Beam Tree Recursive Cell (BT-Cell) - a backpropagation-friendly framework to extend Recursive Neural Networks (RvNNs) with beam search for latent structure induction. We further extend this framework by proposing a relaxation of the hard top-k operators in beam search for better propagation of gradient signals. We evaluate our proposed models in different out-of-distribution splits in both synthetic and realistic data. Our experiments show that BTCell achieves near-perfect performance on several challenging structure-sensitive synthetic tasks like ListOps and logical inference while maintaining comparable performance in realistic data against other RvNN-based models. Additionally, we identify a previously unknown failure case for neural models in generalization to unseen number of arguments in ListOps. The code is available at: https://github.com/JRC1995/BeamTreeRecursiveCells.

50. Efficient Real Time Recurrent Learning through combined activity and parameter sparsity

Anand Subramoney, 2023

Backpropagation through time (BPTT) is the standard algorithm for training recurrent neural networks (RNNs), which requires separate simulation phases for the forward and backward passes for inference and learning, respectively. Moreover, BPTT requires storing the complete history of network states between phases, with memory consumption growing proportional to the input sequence length. This makes BPTT unsuited for online learning and presents a challenge for implementation on low-resource real-time systems. Real-Time Recurrent Learning (RTRL) allows online learning, and the growth of required memory is independent of sequence length. However, RTRL suffers from exceptionally high computational costs that grow proportional to the fourth power of the state size, making RTRL computationally intractable for all but the smallest of networks. In this work, we show that recurrent networks exhibiting high activity sparsity can reduce the computational cost of RTRL. Moreover, combining activity and parameter sparsity can lead to significant enough savings in computational and memory costs to... Read More

51. Selective Path Automatic Differentiation: beyond uniform distribution on backpropagation dropout

52. Human Inspired Memory Module for Memory Augmented Neural Networks

53. Tidbits on Neural Network Training

54. An In-depth Study of Stochastic Backpropagation

55. A comparative study of back propagation and its alternatives on multilayer perceptrons

Get Full Report

Access our comprehensive collection of 120 documents related to this technology