Apple's Large Language Model Breakthroughs
36 patents in this list
Updated:
Large language models require significant computational resources, with state-of-the-art systems processing billions of parameters across distributed hardware architectures. Apple's approach focuses on optimizing these models for on-device deployment, where processing power, memory, and energy constraints create natural boundaries for model size and complexity.
The fundamental challenge lies in balancing model capabilities and resource efficiency while maintaining user privacy and real-time performance on mobile devices.
This page brings together solutions from recent research—including efficient memory addressing techniques, hardware-specific neural processors, asymmetric model retraining approaches, and specialized compiler optimizations. These and other approaches demonstrate how large language models can be practically implemented within the constraints of mobile computing environments.
1. Memory Addressing System Utilizing Multi-Level Granular Hashing for Device Distribution
Apple Inc., 2023
Memory addressing technique for distributing a large memory address space across multiple memory devices in a computer system. The technique uses hashing to select the memory device based on subsets of address bits at multiple levels of granularity. This allows flexible mapping of addresses to devices while optimizing performance by distributing accesses across devices. It also enables dynamic disabling of devices for maintenance while preserving frequently accessed data.
2. Hardware Accelerator with Power-Gated Local Memory and Non-Volatile Data Retention
Apple Inc., 2023
Hardware accelerators with power-gated local memory to reduce power consumption. The accelerator and memory can be powered down between iterations to save power. However, reusable data like constants and instruction words are stored in non-volatile memory while volatile memory holds varying data. This allows initializing the non-volatile memory once then reusing it without needing to load it each iteration. The volatile memory is powered off between iterations.
3. Neural Processor with Integrated Neural and Planar Engine Circuits for Input Management
Apple Inc., 2023
A neural processor with a combination of neural engine circuits for specialized neural network computations and planar engine circuits for more general computations. The planar engine circuits can handle a larger number of inputs than the neural engine circuits, and they can efficiently process combined input data by separating and duplicating values. This allows compact storage of multiple inputs and reduces computation cycles for operations involving many sources.
4. Digital System Hardware Accelerators with Independently Power-Gated Local Memory Sections
Apple Inc., 2022
Hardware accelerators in digital systems with local memories that can be power gated to reduce power consumption. The local memory is divided into independently powerable sections. The accelerators receive instructions that specify the amount of memory needed for that instruction. The power control circuit powers on the necessary sections for each instruction while powering off the unused sections to save power. This allows using smaller, lower-leakage local memories while still providing enough memory for each instruction.
5. Multi-Layer Convolution Algorithm with Block-Wise VPU Width Matching for Vector Processors
Apple Inc., 2022
Efficient multi-layer convolution algorithm for vector processors that enables maximizing utilization of the vector processing unit (VPU) for convolution operations on multi-channel input. The algorithm selectively processes blocks of the output in accordance with their width to match the VPU's width. For blocks that are multiples of the VPU width, it processes each channel separately in the VPU. For smaller width blocks, it processes multiple channels simultaneously in the VPU. This ensures full utilization of the VPU's data paths. The algorithm involves dividing the output into blocks, determining block width, and assigning blocks to threads that process them using optimized vector instructions for the width.
6. Cloud-Integrated Adaptive Training and Deployment of Large Language Models on Resource-Constrained Edge Devices
西安电子科技大学, XIDIAN UNIVERSITY, 2024
Training and deploying large language models for natural language processing on edge devices like smartphones with limited resources. The method involves cloud-integrated training that adaptively adjusts model parameters based on real-time feedback from the edge devices. The training combines optimization strategies from both the edge and server sides to balance speed and accuracy. At the edge, during inference, performance metrics like speed and accuracy are monitored and fed back to the server. The server then dynamically adjusts model parameters to adapt to the edge device's specific resource constraints. This allows flexible optimization of large language models on resource-limited edge devices by optimizing both training and inference.
7. Data Processing Method for Neural Network Model Scaling Using Externally Stored Network Units
HUAWEI TECH CO LTD, HUAWEI TECHNOLOGIES CO LTD, 2024
Data processing method to increase the size of large scale models by splitting them into network units stored externally to the computing unit. The method involves determining the network units corresponding to a target word vector based on a mapping table. These units are fetched from external storage and used to construct the neural network during training. This allows the model to scale beyond the memory of the computing unit. The external storage can be accessed between training iterations.
8. Pre-trained Language Model Migration via Redundant Module Removal and Adapter Short-Circuiting Using Reinforcement Learning
XIAMEN UNIV, XIAMEN UNIVERSITY, 2023
Method to efficiently migrate large-scale pre-trained language models to downstream tasks with reduced computational and memory overheads. The method involves identifying and removing redundant modules from the pre-trained model using reinforcement learning. A lightweight adapter is then short-circuited onto the identified redundant modules to bypass them during inference. This reduces the model size and improves deployment efficiency compared to full fine-tuning or parameter efficient methods that add additional parameters.
9. Neural Network-Based Context-Aware Text Prediction System with Separate Context Extraction, Prediction, and Relevance Assessment Networks
Apple Inc., 2023
Generating word and phrase predictions for text completion in digital assistants that provides efficient and intelligent provision of context-relevant text predictions. The method involves using a neural network system to determine text predictions based on both the text being completed and the context surrounding it. This allows the digital assistant to provide context-aware text predictions of varying lengths, improving completion accuracy compared to just predicting words based on the text alone. The neural network system has separate networks for extracting the context, determining text predictions, and assessing relevance.
10. Large Language Model Decomposition into Base and Low-Rank Task-Specific Adapter Modules
JIANGSU WEIHAO INTELLIGENT TECH CO LTD, JIANGSU WEIHAO INTELLIGENT TECHNOLOGY CO LTD, 2023
Optimizing and fine-tuning large pre-trained language models with reduced computational and storage requirements. The method involves breaking down a large pre-trained language model into a task-independent base model and task-specific adapter modules. The adapter modules are trained on specific tasks while constrained by low-rank decomposition to reduce the number of trainable parameters. This allows efficient adaptation of the large pre-trained model to new tasks with less computational overhead compared to directly fine-tuning the entire model.
11. Digital Assistant Interface with Suggested User Input Affordances for Task Discovery and Execution
Apple Inc., 2023
Providing suggested user inputs for triggering digital assistant tasks to help users discover and request useful functions. The digital assistant analyzes user requests for tasks and device content to determine text representations of utterances that can be used to perform the tasks. These suggested utterances are then displayed as affordances on the user interface to allow easy selection and execution of the tasks. The affordances provide a way for users to discover and request tasks they may not know how to ask for, making the digital assistant more accessible and efficient.
12. Joint Training Method for Large Language Models with Task-Specific Smaller Models
PING AN TECH SHENZHEN CO LTD, PING AN TECHNOLOGY CO LTD, 2023
Improving the generation effect and accuracy of large language models like GPT-3 for tasks like medical report generation by leveraging the knowledge of smaller models. The method involves jointly training the large language model with a smaller model that already performs the task. The joint training allows the large model to learn the output format and logic of the smaller model. This enables the large model to generate more accurate and appropriate outputs when used independently.
13. Digital Assistant with Multi-Stage Contextual Speech Analysis for Continuous Dialog
Apple Inc., 2023
Continuous dialog with a digital assistant that allows more natural and seamless interactions between users and digital assistants. The system uses a multi-stage speech analysis process to handle user follow-up speech, where initial values are analyzed first and then second values are analyzed based on the context. This allows the system to better understand and respond to user requests and follow-up statements. The system also enables users to interrupt and correct the assistant if needed. The multi-stage speech analysis is done by the digital assistant itself, not just relying on speech-related cues.
14. System for Asymmetric Retraining of Machine Learning Models with Input Data Distribution Constraints
Apple Inc., 2023
Allowing asymmetric retraining of upstream and downstream machine learning models without affecting consistency of output. The technique involves specifying constraints on the input data distribution that allow the downstream model to be retrained independently without requiring retraining of the upstream model. The constraints preserve the semantics of the values while allowing the distribution to change over time. This allows the downstream model to adapt to new data without affecting the accuracy of the upstream model's predictions.
15. Method for Associating Actions with Augmented User Utterances in Natural Language Models
Apple Inc., 2023
Creating and updating natural language models for digital assistants that allows more efficient and accurate interpretation of user requests. The method involves associating actions with user requests, determining augmented utterances based on the original request, and creating the model by mapping the augmented utterances to the associated actions. This allows the model to handle variations and unknown words in new requests. It also involves sharing learned models between applications.
16. Neural Processor Compiler with Local Data Buffer for Minimizing External Memory Access
Apple Inc., 2023
Compiler for neural processors that reduces data fetch and read operations between memory external to the neural processor and the neural engine inside the processor. The compiler stores input, output, and intermediate data in a local data buffer inside the neural processor rather than accessing system memory. This allows more efficient processing by avoiding external memory reads/writes most of the time.
17. Neural Processor Circuit with Integrated Binary Comparator and Dimensional Reduction Engines
Apple Inc., 2023
Neural processor circuit for performing binary comparison and reduction operations in a neural network accelerator without software control. The circuit has a neural engine to perform convolutions and a separate planar engine with a binary comparator and filter. The binary comparator applies Boolean operations to output tensors to generate conditional tensors. The filter reduces dimensions of the conditional tensors to generate channel-wise values. This allows implementing conditional operations and reducing tensor sizes directly in hardware, avoiding software for these tasks.
18. Content Grouping System Using Semantically Equivalent Topic Tags Across Languages
Apple Inc., 2023
Grouping and presenting content items by semantically equivalent topics instead of strict grouping by topic tags to prevent confusion when topic tags are different in different languages. The technique involves determining semantically equivalent topic tags across languages and grouping content items based on these equivalent topics. This allows presenting all politics-related content together regardless of whether the topic tag is "politics" in English or "politique" in French.
19. Neural Network for Unsupervised Grammatical Error Detection and Correction
Apple Inc., 2023
Intelligent detection and correction of grammatical errors in user input using a neural network trained using unsupervised learning. The network takes a set of words with a grammatical error and a reference set of words, transforms the error set using the network, and reconstructs the reference set. Comparing the transformed and reconstructed sets determines if the error set is grammatical. The network is trained without labeled examples by generating error sets, transforming them, and comparing. This allows efficient training of a grammar checker using unsupervised data.
20. Neural Processor with Dual-Mode Tensor Processing and Reduction Circuitry
Apple Inc., 2022
A neural processor with multiple modes for processing large tensors efficiently. The processor has both neural engine circuits for convolution operations and planar engine circuits that support a reduction mode for aggregating tensor values. The planar engines reduce large tensors in multiple cycles and accumulate results in buffers. This allows processing tensors larger than the engine capacity. The reduction mode also has optimized post-processing circuits for common reduction operations.
Request the full report with complete details of these
+16 patents for offline reading.