Heterogeneous Computing Architecture Based Processors

Heterogeneous computing architectures face significant performance barriers when managing data movement across dissimilar processing elements. Traditional systems require multiple data copies through host memory, introducing latencies of 150-300 microseconds per transfer while consuming valuable memory bandwidth—often exceeding 20 GB/s during intensive workloads. The architectural challenge intensifies in multi-processor environments where each additional processing element can introduce exponential complexity in routing tables and memory coherence protocols.

The fundamental engineering tradeoff in heterogeneous computing architectures lies in balancing the flexibility of diverse processing elements against the overhead of inter-processor communication and memory management.

This page brings together solutions from recent research—including virtual memory address mapping for direct data transfers between processors, protocol-adaptive external networks for reconfigurable processing units, dynamic data access selection mechanisms, and unified computing systems with direct node-cluster links. These and other approaches demonstrate practical implementations that minimize data movement costs while preserving the computational advantages of specialized processing elements.

1. Unified Computing System with Direct Node-Cluster Links and Localized Routing Tables

SHANGHAI BIREN TECHNOLOGY CO LTD, 2025

A unified computing system with direct links between computing nodes and clusters, enabling efficient data transfer and processing. Each computing processor maintains a local routing table to determine the next hop for data packets, eliminating the need for centralized routing and reducing latency. The system allows for direct communication between nodes and clusters, enabling collaborative processing and optimized data transfer.

2. Processor Architecture with Centralized Instruction Broadcast and Data Distribution for Vector Processing

BEIJING YOUZHUJU NETWORK TECHNOLOGY CO LTD, 2025

A processor architecture for efficient vector processing, comprising multiple processor cores with separate instruction and data caches, and a centralized distributor that broadcasts instructions and distributes data to the cores. The distributor enables efficient data processing by eliminating cache coherence overhead and allowing for simultaneous data access and processing across multiple cores.

3. Reconfigurable Processor Direct Data Transfer via Virtual Memory Address Mapping

SAMBANOVA SYSTEMS INC, 2025

Enabling direct data transfer between reconfigurable processors in distributed systems without involving the host CPUs. The technique allows bypassing the host memory and CPUs for inter-processor communication by using virtual memory addressing. One processor implements a virtual memory function that maps virtual addresses to physical addresses. The other processor uses the virtual addresses to initiate direct memory access operations at the other processor's memory or reconfigurable processor memory. This avoids the latency and bandwidth requirements of going through the host memory and CPUs.

4. Dynamic Data Access Selection Mechanism for Heterogeneous Systems with Multi-Method Transfer Options

SAMBANOVA SYSTEMS INC, 2025

Selecting the most efficient data access method between processors in a heterogeneous system based on latency and bandwidth requirements. The method involves dynamically choosing between three options: 1) two memory to memory transfers through a common buffer on the CPU, 2) direct memory access (DMA) without using the CPU, and 3) memory extension where the CPU maps virtual addresses to physical addresses and lets the processors directly access each other's memory. This allows hiding the underlying details of data passing from the application and selecting the optimal method for each device stage.

5. Memory Access Method in Heterogeneous Processing System with Inter-Processor Memory Mapping via Switch and Bus Circuitry

SAMBANOVA SYSTEMS INC, 2025

Method and apparatus for accessing data in a heterogeneous processing system with multiple processors using memory extension operation. The system includes a host processor, a first processor coupled to a first memory, a second processor coupled to a second memory, and switch and bus circuitry that communicatively couples the host processor, the first processor, and the second processor. The host processor maps virtual addresses of the second memory to physical addresses of the switch and bus circuitry and configures the first processor to directly access the second memory using the mapped physical addresses according to memory extension operation.

6. Dataflow Processor Pipeline Execution Method with Latency-Based Buffer Removal in Reconfigurable Architectures

SAMBANOVA SYSTEMS INC, 2025

A method and system for optimizing pipeline execution in dataflow processors with reconfigurable architectures. The method involves generating a pipeline of computational nodes related to a dataflow graph, interleaved between buffers on the array of reconfigurable units, and removing a buffer from the pipeline based on a comparison of the latencies of the computational nodes. The system comprises a compiler configured to generate the pipeline and remove buffers based on latency comparison.

7. Inter-Die Communication System with Protocol-Adaptive External Network for Reconfigurable Processing Units

SAMBANOVA SYSTEMS INC, 2024

A system and method for inter-die communication between reconfigurable processing units (RPUs) with different internal networks and protocols. The system enables peer-to-peer transactions between RPUs through a common external network, using a protocol that adapts to the internal network protocols of each RPU. The method involves routing requests through internal networks to reach the external network, where they are processed and forwarded to the target RPU. The system includes configurable units, internal networks, and external interfaces that enable efficient and flexible communication between RPUs.

8. Data Transfer Mechanism Utilizing Dual Memory-to-Memory Operations in Heterogeneous Processor Systems

SAMBANOVA SYSTEMS INC, 2024

Method and apparatus for transferring data between accessible memories of multiple processors in a heterogeneous processing system using two memory-to-memory transfer operations. The system includes a host processor, a first processor, a second processor, and multiple data transfer resources. The method involves allocating buffer space in the host memory, programming two data transfer resources to transfer data between the first memory and the host memory, and between the host memory and the second memory, and executing the data transfer operations.

9. Memory Extension Operation for Direct Access in Heterogeneous Multi-Processor Systems

SAMBANOVA SYSTEMS INC, 2024

10. Instruction Processing Apparatus with Trigger-Based Execution and Shared Buffer Management for Multiple Input Channels

ARM LTD, 2024

Apparatus for processing instructions in a triggered instruction architecture, comprising multiple processing elements and input channels, where each processing element executes instructions based on trigger conditions and can access multiple input channels through a shared buffer management system. The system allocates input data to buffers based on tag values, enabling efficient triggering and computation operations.

11. Decentralized Chip-to-Chip Interface with Network-on-Chip Bridge for Packetized Memory-Mapped Traffic

XILINX INC, 2024

A decentralized chip-to-chip interface architecture for transporting memory-mapped traffic between heterogeneous integrated circuit devices in a packetized, scalable, and configurable manner. The architecture enables virtualized communication between devices through a network-on-chip (NoC) interface and a NoC inter-chip bridge (NICB) that packetizes and routes memory-mapped traffic between devices over chip-to-chip interconnections.

12. Processor System for Asynchronous Parallel Execution of Grouped Matrix Multiply-Accumulate Operations Across Streaming Multiprocessors

NVIDIA CORP, 2024

A processor and system that enables efficient parallel processing of tensor operations by grouping multiple matrix multiply-accumulate (MMA) operations together and executing them asynchronously across multiple streaming multiprocessors (SMs). The processor receives a first instruction to group multiple MMA operations and a second instruction to execute the grouped operations, which are then performed by multiple accelerators in the SMs. This approach enables larger tensor operations to be processed in parallel, overcoming the limitations of traditional single-threaded MMA execution.

13. Research and Development of Algorithms for Dispatching Tasks in Distributed Computing Systems

Si Thu Thant Sin, Evgeni M. Portnov, A. M. Bain - IEEE, 2024

This article delves into heterogeneous computing systems, which employ multi-core processors (CPUs) and graphics processing units (GPUs) concurrently, facilitating efficient handling of resource-intensive tasks demanding substantial computing power. Heterogeneous systems primarily serve to judiciously allocate resources among users and computational processes. The veracity and dependability of the work's outcomes hinge upon the accurate utilization of mathematical tools, findings from experimental studies on heterogeneous SoCs, and practical validation. These systems represent a paradigm shift in computing, leveraging the strengths of both CPUs and GPUs to tackle diverse workloads effectively. By harnessing the parallel processing capabilities of GPUs alongside the general-purpose computing prowess of CPUs, heterogeneous systems optimize performance and throughput, catering to the burgeoning demands of modern computing applications. Through meticulous resource allocation and rigorous validation methodologies, these systems ensure the delivery of reliable and consistent results across... Read More

14. Execution System Utilizing Auto-Discovery Module for Application Allocation on Heterogeneous Reconfigurable Processor Pool

SAMBANOVA SYSTEMS INC, 2024

Executing an application on a pool of reconfigurable processors that includes first and second pluralities of reconfigurable processors with different architectures, where an auto-discovery module determines whether the application is executed on the first or second processors and a runtime processor allocates and configures the processors accordingly.

15. Processor Architecture with Dataflow Execution for Predictable Control Flow and Regular Data Access Patterns

ADVANCED MICRO DEVICES INC, 2024

A processor architecture that enables dataflow execution for workloads with predictable control flow and regular data access patterns, such as tensor algebra and dense neural networks. The architecture includes a decoder to interpret dataflow instructions, a setup circuit to configure dataflow circuitry, and execution circuitry to execute the dataflow operations. This approach eliminates the need for dynamic scheduling and dependency tracking, reducing energy consumption and improving performance for workloads that can be statically mapped to the hardware.

16. High-performance computing: Transitioning from Instruction-Level Parallelism to heterogeneous hybrid architectures

Mingtao Zhang - EWA Publishing, 2024

This paper delves into the shift from Instruction-Level Parallelism (ILP) to Heterogeneous Hybrid Parallel Computing in the quest for optimized performance processing. It sheds light on the constraints of ILP, emphasizing how these shortcomings have catalyzed a move toward the more adaptable and proficient framework of heterogeneous hybrid computing. This transformation's advantages are explored across diverse applications, notably in deep learning, cloud computing, data centers, and mobile SoCs. Additionally, the study underscores emerging architectures and innovations of this era, including many-core processors, FPGA-driven accelerators, and an assortment of software tools and libraries. While heterogeneous hybrid computing offers a promising horizon, it isn't without challenges. This paper brings to the fore issues like restricted adaptability, steep development costs, software compatibility hurdles, the absence of a standardized programming model, and vendor reliance. Through this in-depth exploration, our aim is to present a holistic snapshot of the present and potential future ... Read More

17. Hybrid Processing Architecture Integrating Coarse-Grained Reconfigurable Arrays with Intelligent Network Interface Components for Dataflow Graph Execution

SAMBANOVA SYSTEMS INC, 2024

A system for accelerating deep learning applications through a hybrid processing architecture that combines Coarse-Grained Reconfigurable Arrays (CGRAs) with intelligent network interface components. The system enables efficient execution of dataflow-based processing graphs by partitioning the graph between CGRAs and network interface components, with the CGRAs performing compute-intensive tasks and the network interface components handling inter-node communication and synchronization. The system achieves high scalability and throughput by eliminating the need for centralized control and synchronization, enabling efficient parallelization of deep learning workloads across multiple nodes.

18. Array Interface with Integrated Interface Tiles and Direct Memory Access Circuits

XILINX INC, 2023

An array interface for a data processing array, comprising interface tiles with multiple direct memory access (DMA) circuits, enabling efficient data transfer between the array and external memory. The interface tiles are integrated into the array, which comprises a configurable mix of compute and memory tiles, allowing for flexible configuration and operation of the array.

19. Runtime Virtualization System for Coarse-Grained Reconfigurable Array Processors with Unified Interface and Dynamic Resource Allocation

SAMBANOVA SYSTEMS INC, 2023

Runtime virtualization of reconfigurable architectures enables efficient sharing and isolation of coarse-grained reconfigurable array (CGRA) processors in cloud environments. The technology provides a unified interface to manage multiple CGRA devices, transfer resources, and storage resources, allowing for dynamic allocation and execution of application graphs across the reconfigurable devices. A common device driver coordinates execution across the devices, presenting a single virtual integrated circuit to user applications. The system supports multi-client and dynamic-workload scenarios, enabling efficient utilization of reconfigurable resources in cloud computing environments.

20. Neural Network Partitioning and Merging Method for Heterogeneous Computing Platforms

SAMSUNG ELECTRONICS CO LTD, 2023

A method for implementing neural networks on heterogeneous computing platforms, comprising partitioning a neural network model into sub-models based on a standard, merging sub-models based on characteristics, and deploying the merged sub-models. The method enables efficient execution of neural networks on diverse hardware architectures by dynamically adapting the network structure to the available processing resources.

21. Simultaneous and Heterogenous Multithreading

22. On-Chip Heterogeneous AI Processor with Multi-Architecture Computation Units and Dynamic Task Distribution

23. Programmable Spatial Array Processor with Two-Dimensional Upper Triangular Processing Element Array for Matrix Decomposition

24. Integrated Device Platform with Heterogeneous Subsystems and User-Defined Data Path Configuration

25. Special Issue 19th international workshop on algorithms, models and tools for parallel computing on heterogeneous platforms (HeteroPar'21)

Get Full Report

Access our comprehensive collection of 95 documents related to this technology

Request PDF