Heterogeneous computing architectures face significant performance barriers when managing data movement across dissimilar processing elements. Traditional systems require multiple data copies through host memory, introducing latencies of 150-300 microseconds per transfer while consuming valuable memory bandwidth—often exceeding 20 GB/s during intensive workloads. The architectural challenge intensifies in multi-processor environments where each additional processing element can introduce exponential complexity in routing tables and memory coherence protocols.

The fundamental engineering tradeoff in heterogeneous computing architectures lies in balancing the flexibility of diverse processing elements against the overhead of inter-processor communication and memory management.

This page brings together solutions from recent research—including virtual memory address mapping for direct data transfers between processors, protocol-adaptive external networks for reconfigurable processing units, dynamic data access selection mechanisms, and unified computing systems with direct node-cluster links. These and other approaches demonstrate practical implementations that minimize data movement costs while preserving the computational advantages of specialized processing elements.

1. Unified Computing System with Direct Node-Cluster Links and Localized Routing Tables

SHANGHAI BIREN TECHNOLOGY CO LTD, 2025

A unified computing system with direct links between computing nodes and clusters, enabling efficient data transfer and processing. Each computing processor maintains a local routing table to determine the next hop for data packets, eliminating the need for centralized routing and reducing latency. The system allows for direct communication between nodes and clusters, enabling collaborative processing and optimized data transfer.

2. Processor Architecture with Centralized Instruction Broadcast and Data Distribution for Vector Processing

BEIJING YOUZHUJU NETWORK TECHNOLOGY CO LTD, 2025

A processor architecture for efficient vector processing, comprising multiple processor cores with separate instruction and data caches, and a centralized distributor that broadcasts instructions and distributes data to the cores. The distributor enables efficient data processing by eliminating cache coherence overhead and allowing for simultaneous data access and processing across multiple cores.

3. Reconfigurable Processor Direct Data Transfer via Virtual Memory Address Mapping

SAMBANOVA SYSTEMS INC, 2025

Enabling direct data transfer between reconfigurable processors in distributed systems without involving the host CPUs. The technique allows bypassing the host memory and CPUs for inter-processor communication by using virtual memory addressing. One processor implements a virtual memory function that maps virtual addresses to physical addresses. The other processor uses the virtual addresses to initiate direct memory access operations at the other processor's memory or reconfigurable processor memory. This avoids the latency and bandwidth requirements of going through the host memory and CPUs.

4. Dynamic Data Access Selection Mechanism for Heterogeneous Systems with Multi-Method Transfer Options

SAMBANOVA SYSTEMS INC, 2025

Selecting the most efficient data access method between processors in a heterogeneous system based on latency and bandwidth requirements. The method involves dynamically choosing between three options: 1) two memory to memory transfers through a common buffer on the CPU, 2) direct memory access (DMA) without using the CPU, and 3) memory extension where the CPU maps virtual addresses to physical addresses and lets the processors directly access each other's memory. This allows hiding the underlying details of data passing from the application and selecting the optimal method for each device stage.

US12229057B2-patent-drawing

5. Memory Access Method in Heterogeneous Processing System with Inter-Processor Memory Mapping via Switch and Bus Circuitry

SAMBANOVA SYSTEMS INC, 2025

Method and apparatus for accessing data in a heterogeneous processing system with multiple processors using memory extension operation. The system includes a host processor, a first processor coupled to a first memory, a second processor coupled to a second memory, and switch and bus circuitry that communicatively couples the host processor, the first processor, and the second processor. The host processor maps virtual addresses of the second memory to physical addresses of the switch and bus circuitry and configures the first processor to directly access the second memory using the mapped physical addresses according to memory extension operation.

6. Dataflow Processor Pipeline Execution Method with Latency-Based Buffer Removal in Reconfigurable Architectures

SAMBANOVA SYSTEMS INC, 2025

A method and system for optimizing pipeline execution in dataflow processors with reconfigurable architectures. The method involves generating a pipeline of computational nodes related to a dataflow graph, interleaved between buffers on the array of reconfigurable units, and removing a buffer from the pipeline based on a comparison of the latencies of the computational nodes. The system comprises a compiler configured to generate the pipeline and remove buffers based on latency comparison.

US12189570B2-patent-drawing

7. Inter-Die Communication System with Protocol-Adaptive External Network for Reconfigurable Processing Units

SAMBANOVA SYSTEMS INC, 2024

A system and method for inter-die communication between reconfigurable processing units (RPUs) with different internal networks and protocols. The system enables peer-to-peer transactions between RPUs through a common external network, using a protocol that adapts to the internal network protocols of each RPU. The method involves routing requests through internal networks to reach the external network, where they are processed and forwarded to the target RPU. The system includes configurable units, internal networks, and external interfaces that enable efficient and flexible communication between RPUs.

US12143298B2-patent-drawing

8. Data Transfer Mechanism Utilizing Dual Memory-to-Memory Operations in Heterogeneous Processor Systems

SAMBANOVA SYSTEMS INC, 2024

Method and apparatus for transferring data between accessible memories of multiple processors in a heterogeneous processing system using two memory-to-memory transfer operations. The system includes a host processor, a first processor, a second processor, and multiple data transfer resources. The method involves allocating buffer space in the host memory, programming two data transfer resources to transfer data between the first memory and the host memory, and between the host memory and the second memory, and executing the data transfer operations.

9. Memory Extension Operation for Direct Access in Heterogeneous Multi-Processor Systems

SAMBANOVA SYSTEMS INC, 2024

Method and apparatus for accessing data in a heterogeneous processing system with multiple processors using memory extension operation. The system includes a host processor, a first processor coupled to a first memory, a second processor coupled to a second memory, and switch and bus circuitry that communicatively couples the host processor, the first processor, and the second processor. The host processor maps virtual addresses of the second memory to physical addresses of the switch and bus circuitry and configures the first processor to directly access the second memory using the mapped physical addresses according to memory extension operation.

US2024248853A1-patent-drawing

10. Instruction Processing Apparatus with Trigger-Based Execution and Shared Buffer Management for Multiple Input Channels

ARM LTD, 2024

Apparatus for processing instructions in a triggered instruction architecture, comprising multiple processing elements and input channels, where each processing element executes instructions based on trigger conditions and can access multiple input channels through a shared buffer management system. The system allocates input data to buffers based on tag values, enabling efficient triggering and computation operations.

11. Decentralized Chip-to-Chip Interface with Network-on-Chip Bridge for Packetized Memory-Mapped Traffic

XILINX INC, 2024

A decentralized chip-to-chip interface architecture for transporting memory-mapped traffic between heterogeneous integrated circuit devices in a packetized, scalable, and configurable manner. The architecture enables virtualized communication between devices through a network-on-chip (NoC) interface and a NoC inter-chip bridge (NICB) that packetizes and routes memory-mapped traffic between devices over chip-to-chip interconnections.

US12019576B2-patent-drawing

12. Processor System for Asynchronous Parallel Execution of Grouped Matrix Multiply-Accumulate Operations Across Streaming Multiprocessors

NVIDIA CORP, 2024

A processor and system that enables efficient parallel processing of tensor operations by grouping multiple matrix multiply-accumulate (MMA) operations together and executing them asynchronously across multiple streaming multiprocessors (SMs). The processor receives a first instruction to group multiple MMA operations and a second instruction to execute the grouped operations, which are then performed by multiple accelerators in the SMs. This approach enables larger tensor operations to be processed in parallel, overcoming the limitations of traditional single-threaded MMA execution.

13. Research and Development of Algorithms for Dispatching Tasks in Distributed Computing Systems

Si Thu Thant Sin, Evgeni M. Portnov, A. M. Bain - IEEE, 2024

This article delves into heterogeneous computing systems, which employ multi-core processors (CPUs) and graphics processing units (GPUs) concurrently, facilitating efficient handling of resource-intensive tasks demanding substantial computing power. Heterogeneous systems primarily serve to judiciously allocate resources among users and computational processes. The veracity and dependability of the work's outcomes hinge upon the accurate utilization of mathematical tools, findings from experimental studies on heterogeneous SoCs, and practical validation. These systems represent a paradigm shift in computing, leveraging the strengths of both CPUs and GPUs to tackle diverse workloads effectively. By harnessing the parallel processing capabilities of GPUs alongside the general-purpose computing prowess of CPUs, heterogeneous systems optimize performance and throughput, catering to the burgeoning demands of modern computing applications. Through meticulous resource allocation and rigorous validation methodologies, these systems ensure the delivery of reliable and consistent results across... Read More

14. Execution System Utilizing Auto-Discovery Module for Application Allocation on Heterogeneous Reconfigurable Processor Pool

SAMBANOVA SYSTEMS INC, 2024

Executing an application on a pool of reconfigurable processors that includes first and second pluralities of reconfigurable processors with different architectures, where an auto-discovery module determines whether the application is executed on the first or second processors and a runtime processor allocates and configures the processors accordingly.

15. Processor Architecture with Dataflow Execution for Predictable Control Flow and Regular Data Access Patterns

ADVANCED MICRO DEVICES INC, 2024

A processor architecture that enables dataflow execution for workloads with predictable control flow and regular data access patterns, such as tensor algebra and dense neural networks. The architecture includes a decoder to interpret dataflow instructions, a setup circuit to configure dataflow circuitry, and execution circuitry to execute the dataflow operations. This approach eliminates the need for dynamic scheduling and dependency tracking, reducing energy consumption and improving performance for workloads that can be statically mapped to the hardware.

16. High-performance computing: Transitioning from Instruction-Level Parallelism to heterogeneous hybrid architectures

Mingtao Zhang - EWA Publishing, 2024

This paper delves into the shift from Instruction-Level Parallelism (ILP) to Heterogeneous Hybrid Parallel Computing in the quest for optimized performance processing. It sheds light on the constraints of ILP, emphasizing how these shortcomings have catalyzed a move toward the more adaptable and proficient framework of heterogeneous hybrid computing. This transformation's advantages are explored across diverse applications, notably in deep learning, cloud computing, data centers, and mobile SoCs. Additionally, the study underscores emerging architectures and innovations of this era, including many-core processors, FPGA-driven accelerators, and an assortment of software tools and libraries. While heterogeneous hybrid computing offers a promising horizon, it isn't without challenges. This paper brings to the fore issues like restricted adaptability, steep development costs, software compatibility hurdles, the absence of a standardized programming model, and vendor reliance. Through this in-depth exploration, our aim is to present a holistic snapshot of the present and potential future ... Read More

17. Hybrid Processing Architecture Integrating Coarse-Grained Reconfigurable Arrays with Intelligent Network Interface Components for Dataflow Graph Execution

SAMBANOVA SYSTEMS INC, 2024

A system for accelerating deep learning applications through a hybrid processing architecture that combines Coarse-Grained Reconfigurable Arrays (CGRAs) with intelligent network interface components. The system enables efficient execution of dataflow-based processing graphs by partitioning the graph between CGRAs and network interface components, with the CGRAs performing compute-intensive tasks and the network interface components handling inter-node communication and synchronization. The system achieves high scalability and throughput by eliminating the need for centralized control and synchronization, enabling efficient parallelization of deep learning workloads across multiple nodes.

18. Array Interface with Integrated Interface Tiles and Direct Memory Access Circuits

XILINX INC, 2023

An array interface for a data processing array, comprising interface tiles with multiple direct memory access (DMA) circuits, enabling efficient data transfer between the array and external memory. The interface tiles are integrated into the array, which comprises a configurable mix of compute and memory tiles, allowing for flexible configuration and operation of the array.

US2023376437A1-patent-drawing

19. Runtime Virtualization System for Coarse-Grained Reconfigurable Array Processors with Unified Interface and Dynamic Resource Allocation

SAMBANOVA SYSTEMS INC, 2023

Runtime virtualization of reconfigurable architectures enables efficient sharing and isolation of coarse-grained reconfigurable array (CGRA) processors in cloud environments. The technology provides a unified interface to manage multiple CGRA devices, transfer resources, and storage resources, allowing for dynamic allocation and execution of application graphs across the reconfigurable devices. A common device driver coordinates execution across the devices, presenting a single virtual integrated circuit to user applications. The system supports multi-client and dynamic-workload scenarios, enabling efficient utilization of reconfigurable resources in cloud computing environments.

20. Neural Network Partitioning and Merging Method for Heterogeneous Computing Platforms

SAMSUNG ELECTRONICS CO LTD, 2023

A method for implementing neural networks on heterogeneous computing platforms, comprising partitioning a neural network model into sub-models based on a standard, merging sub-models based on characteristics, and deploying the merged sub-models. The method enables efficient execution of neural networks on diverse hardware architectures by dynamically adapting the network structure to the available processing resources.

US11803733B2-patent-drawing

21. Simultaneous and Heterogenous Multithreading

Kuan-Chieh Hsu, Hung‐Wei Tseng - ACM, 2023

The landscape of modern computers is undoubtedly heterogeneous, as all computing platforms integrate multiple types of processing units and hardware accelerators. However, the entrenched programming models focus on using only the most efficient processing units for each code region, underutilizing the processing power within heterogeneous computers.

22. On-Chip Heterogeneous AI Processor with Multi-Architecture Computation Units and Dynamic Task Distribution

SHANGHAI DENGLIN TECHNOLOGIES CO LTD, 2023

An on-chip heterogeneous AI processor that integrates multiple computation units with different architectures, each with its own task queue, to efficiently process neural network computations. The processor includes a controller that partitions the computation graph into subtasks and distributes them across the units, a shared storage unit, and an access interface. The units can operate in independent, cooperative, or interactive modes, and the processor supports operator fusion and inter-layer fusion to optimize performance.

US11789895B2-patent-drawing

23. Programmable Spatial Array Processor with Two-Dimensional Upper Triangular Processing Element Array for Matrix Decomposition

INTEL CORP, 2023

A programmable spatial array processor that can rapidly perform different types of matrix decomposition, including LU, QR, and Cholesky decompositions, by utilizing a two-dimensional upper triangular processing element array that executes under programmable instructions. The processor's architecture enables sequential instruction propagation through the array, eliminating the need for separate instruction memories in each processing element. This design enables the processor to switch between different matrix decomposition algorithms by simply changing the instructions provided to the array.

US2023297538A1-patent-drawing

24. Integrated Device Platform with Heterogeneous Subsystems and User-Defined Data Path Configuration

XILINX INC, 2023

An integrated programmable device platform that integrates heterogeneous subsystems, including programmable logic, processor, network-on-chip, and data processing engines, to provide a flexible and scalable architecture for implementing user applications. The platform enables user-defined data paths between subsystems and provides a unified management interface for configuring and controlling the entire system.

25. Special Issue 19th international workshop on algorithms, models and tools for parallel computing on heterogeneous platforms (HeteroPar'21)

Rosa M. Badía - Wiley, 2023

Heterogeneity has emerged as one of the most profound and challenging characteristics of today's parallel environments. From the macro level, where networks of distributed computers, composed of diverse node architectures, are interconnected with potentially heterogeneous networks, to the micro level, where deeper memory hierarchies and various accelerator architectures are increasingly common, the impact of heterogeneity on all computing tasks is increasing rapidly. Traditional parallel algorithms, programming environments, and tools, designed for legacy homogeneous multiprocessors, achieve a small fraction of the efficiency and the potential performance that can be obtained in current and future heterogeneous computing platforms. New ideas, innovative algorithms, and specialized programming environments and tools are needed to efficiently use these modern parallel and heterogeneous architectures. The International workshop on algorithms, models and tools for parallel computing on heterogeneous platforms (HeteroPar) is a forum for researchers working on algorithms, programming langu... Read More

26. Dataflow Graph Transformation for Coarse-Grained Reconfigurable Processor Optimization

SAMBANOVA SYSTEMS INC, 2023

Optimizing computing tasks for coarse-grained reconfigurable (CGR) processors by transforming dataflow graphs to reduce latency and increase throughput. The method involves analyzing intermediate representations of tensor-based algebraic expressions, detecting memory mapping operations, and relocating them to adjacent stages to enable efficient dataflow through the CGR grid.

US2023273879A1-patent-drawing

27. Reconfigurable Coarse-Grained Grid Computing Architecture with 2D Compute Unit Array for Matrix Multiplication

SAMBANOVA SYSTEMS INC, 2023

A system and method for matrix multiplication in a reconfigurable coarse-grained grid computing architecture. The system comprises a 2D grid of compute units, each assigned to a unique submatrix of the result matrix, and source memory units providing matrix data via packets. The method configures the compute units to produce their assigned submatrices and sends them to destination memory units, initiating data flow to produce the result matrix.

28. 3D Interconnected Multi-Core Processor Architecture with RISC-V Main Control, Micro Core Array, and Accelerator Layers

SHANDONG LINGNENG ELECTRONIC TECHNOLOGY CO LTD, 2023

RISC-V-based 3D interconnected multi-core processor architecture that integrates a main control layer, micro core array layer, and accelerator layer to achieve high-performance processing. The main control layer comprises multiple main cores that cooperate to control the micro core array layer and interact with the accelerator layer. The micro core array layer features a 3D router that enables fast data exchange between micro cores, while the accelerator layer is connected to the micro core array layer for efficient data transmission. The architecture enables efficient processing of complex instructions by converting them into simple instructions that can be executed by the micro cores.

29. Heterogeneous Computing Framework with Dynamic Task Allocation to Specialized Processing Units

BEIJING ZITIAO NETWORK TECHNOLOGY CO LTD, 2023

A processing method, device, equipment, and medium based on a heterogeneous computing framework, which enables efficient processing of data-intensive tasks by dynamically allocating tasks to specialized processing units such as GPUs, CPUs, and DSPs. The method involves creating a heterogeneous computing engine framework, segmenting input data into blocks, processing each block in parallel using multiple threads, and merging results from each processing unit to produce the final output.

30. System for Automatic Program Execution Placement on Heterogeneous Hardware Using Graph Neural Network and Reinforcement Learning

INTEL CORP, 2023

A system for automatic placement of computer program execution on heterogeneous hardware, using machine learning to predict optimal device placement based on program graph representations and dynamic execution environment information. The system employs a graph neural network (GNN) to generate a vector embedding of the program graph, and a reinforcement learning model to determine the optimal device placement based on the graph embedding and current execution environment. The GNN incorporates features such as control flow, data flow, call flow, and memory dependencies, while the reinforcement learning model adapts to changes in the execution environment.

31. Processor System with Dynamically Assignable Memories and Domain-Switching Capability

XILINX INC, 2023

A processor system with dynamically assignable memories that can be configured to operate in either a cache-coherent domain or an I/O domain, allowing users to dynamically allocate memory resources based on application requirements. The system includes a switch that routes data between the assignable memories and the cache-coherent and I/O paths based on the assigned domain, enabling flexible memory management and reassignment as needed.

32. Parallel Processor with Layered Hardware Modules for Mixed Arithmetic and Tensor Shape Manipulations

AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LTD, 2023

A parallel processor optimized for machine learning applications, comprising multiple layers of hardware-dedicated modules that address the challenges of mixed arithmetic operations and tensor shape manipulations. The processor is designed to efficiently handle a wide range of operations, including simple additions, matrix multiplications, convolutions, and complex functions, as well as tensor shape manipulation tasks such as extraction, compaction, and reshaping.

EP4156111A1-patent-drawing

33. Coarse-Grained Reconfigurable Array Processor with Daisy-Chain Configuration Distribution and Load Coordination

SAMBANOVA SYSTEMS INC, 2023

Coarse-grained reconfigurable array processor configuration enables efficient loading and unloading of configuration state and program state. The processor comprises programmable elements arranged in a grid or tile, with each element containing a configuration store. The configuration file is distributed across the array through a bus system, with each element receiving and processing sub-files in a daisy-chain fashion. The configuration load controller coordinates the daisy-chain, ensuring that each element completes its sub-file processing before receiving the next element's configuration.

34. Integrated Circuit Architecture with Grid-Arranged Data Processing and Memory Tiles Featuring Stream Switch Connectivity and Direct Memory Access

XILINX INC, 2022

A data processing engine (DPE) array integrated circuit (IC) architecture that integrates multiple DPE tiles and memory tiles. The DPE tiles are connected to a system-on-chip (SoC) interface and each includes a stream switch, core, and memory module. The memory tiles include a stream switch, direct memory access (DMA) engine, and random-access memory (RAM), with the DMA engine capable of accessing RAM within the same tile and other tiles. The tiles are arranged in a grid with stream switches connecting adjacent tiles, enabling data transfer between DPE tiles and memory tiles. The memory tiles also include event broadcast circuitry and memory-mapped switches for configuration and control.

US11520717B1-patent-drawing

35. An Agile Tile-based Platform for Adaptive Heterogeneous Many-Core Systems

Ahmed Kamaleldin, Diana Göhringer - IEEE, 2022

Computing heterogeneity is a crucial demand for today's systems-on-chip requirements. Current many-core computing architectures feature a scalable number of heterogeneous compute units supporting a wide range of application domains. However, supporting both heterogeneity and computing scalability brings significant design challenges related to on-chip communication between heterogeneous components and run-time management. This leads to growing design time, development cost, and lack of hardware modularity and re-usability. This PhD work aims to develop and design a modular and adaptive hardware platform for realizing different types and taxonomies of heterogeneous many-core systems targeting FPGAs reusing the same hardware components. The proposed platform is based on a modular and scalable tile-based architecture supporting heterogeneous instruction set architectures (ISAs), seamless integration of custom hardware accelerators and several memory hierarchies. In this paper, the proposed tile-based platform, preliminary results, and evaluation are presented targeting FPGAs. Finally, p... Read More

36. System for Dynamic Hot-Plug Management of Reconfigurable Data Flow Resources with File System Abstraction

SAMBANOVA SYSTEMS INC, 2022

A system enables dynamic hot-plug removal and insertion of reconfigurable data flow resources, such as coarse-grained reconfigurable array (CGRA) processors, from a pool of resources without system downtime. The system includes a controller that detects resource removal or addition and generates hot-plug events, and a runtime processor that manages resource allocation and execution while maintaining system operation. The system provides unified access to the reconfigurable resources through a file system abstraction, decoupling the resource pool from changes to the system configuration.

US11487694B1-patent-drawing

37. Processor Architecture for Speculative Parallel Execution in Neural Network Workloads

NVIDIA CORP, 2022

A processor architecture that enables speculative parallel execution of instructions in neural network inference and training workloads, leveraging compiler analysis to identify safe speculative execution opportunities based on copy operations, conditional branches, and other program characteristics. The architecture includes a host processor and a parallel processing unit that execute instructions speculatively, with the host processor launching instructions on the parallel processing unit and monitoring for speculative execution termination conditions.

US2022342673A1-patent-drawing

38. Integrated Circuit Architecture with Cascade-Connected Data Processing Engines and Programmable Routing

XILINX INC, 2022

Integrated circuit (IC) architecture featuring a grid of data processing engines (DPEs) with cascade-connected cores. Each DPE comprises a core and memory module, with the core capable of executing instructions. The cores are connected through input and output cascade connections that enable direct data transfer between DPEs. The connections are programmable, allowing selective routing of data between cores. This architecture enables concurrent processing and data exchange between multiple DPEs, facilitating high-performance computing applications.

US11443091B1-patent-drawing

39. Graph Processing Assignment Method Utilizing Heterogeneous CPU-FPGA Data Streams with Power-Law Distribution Analysis

HUAZHONG UNIVERSITY OF SCIENCE AND TECHNOLOGY, 2022

Optimization method for graph processing based on heterogeneous FPGA data streams, particularly for CPU+FPGA heterogeneous structures. The method dynamically assigns graph data to CPU and FPGA processing modules based on power-law distribution properties of the graph, enabling the FPGA to process irregular data streams while balancing processing loads between the CPU and FPGA. The FPGA performs traversal on the graph data to acquire irregularity parameters, which are used to match data with preset access rules and assign tasks accordingly.

40. Coprocessor Microarchitecture with Execution Circuitry Bypass, Partial Processing Element Grid, and Vector Operation Fusion

APPLE INC, 2022

Microarchitectural optimizations in coprocessors to improve performance and efficiency. The optimizations include: 1) bypassing unused execution circuitry for coprocessor instructions to prevent unnecessary evaluation, 2) implementing a partial grid of processing elements for coprocessors that can operate with fewer elements than a full grid, and 3) fusing vector mode operations to improve efficiency when vector operations are performed.

US11429555B2-patent-drawing

41. Parallel Computing Architecture with Integrated Coprocessor/Reducer Cores in 2D Core Layout

GOLDMAN SACHS & CO LLC, 2022

A parallel computing architecture that accelerates machine learning and AI applications by integrating multiple processing cores with specialized coprocessor/reducer cores. The architecture enables efficient distributed operations and data reduction through a 2D layout of cores, where each core is connected to a column of coprocessor/reducer cores. This design enables high-bandwidth data exchange and supports complex computations with large fan-in and fan-out, making it suitable for simulating brain-like neural networks.

US2022269637A1-patent-drawing

42. Reconfigurable Architecture with Configurable Unit Array and Pattern Memory Unit for Time-Multiplexed Program Execution

SAMBANOVA SYSTEMS INC, 2022

Reconfigurable architecture for time-multiplexed execution of programs, comprising an array of configurable units with configuration stores and a pattern memory unit (PMU) for on-chip memory distribution. The PMU contains scratchpad memory and a reconfigurable scalar datapath for address calculation, while the core computation is performed in the configurable units. The architecture enables efficient execution of programs through time-multiplexing, where general and reconfigurable hardware operations are interleaved.

US2022269534A1-patent-drawing

43. RISC-V Processor with Integrated FPGA for User-Defined Instruction Set Execution

ZARAM TECHNOLOGY CO LTD, 2022

A RISC-V processor-based computing device that integrates a field-programmable gate array (FPGA) to support user-defined instruction sets. The device combines a RISC-V processor core with an FPGA unit, enabling the execution of user-defined instructions alongside standard RISC-V instructions. The FPGA unit performs computations and external I/O control for user-defined instructions, while the RISC-V processor core handles standard instructions. The device provides a flexible architecture that allows users to define and implement custom instruction sets without requiring hardware modifications or separate ASIC development.

44. Vector Index Registers for Storing Multiple Addresses in Parallel Conditional and Vector Operations

MICRON TECHNOLOGY INC, 2022

Vector index registers for vector processors that store multiple addresses for accessing multiple positions in vectors, enabling efficient conditional operations and vector operations on operand vectors. The registers, VIR_TRUE and VIR_FALSE, store positions of TRUE and FALSE results of comparisons, allowing for parallel processing of conditional operations and subsequent vector operations on the corresponding elements of the operand vectors.

US11403256B2-patent-drawing

45. General-Purpose Graphics Processing Unit with SIMT Architecture and Sub-Group Thread Scheduling Mechanism

INTEL CORP, 2022

A general-purpose graphics processing unit (GPGPU) architecture that enables efficient execution of multiple thread groups with divergent thread paths. The GPGPU comprises multiple processing elements with single instruction, multiple thread (SIMT) architecture, each capable of hardware multithreading. A pipeline manager distributes thread groups as multiple sub-groups, and a scheduler schedules sub-warps of threads to the processing elements. The GPGPU also includes a logic unit that manages thread execution, retires completed sub-groups, and launches new sub-groups. This architecture enables concurrent execution of multiple thread groups with divergent thread paths, improving overall system performance.

46. Control Barrier Network with Configurable Bus for Dynamic Signal Routing in Distributed Execution Systems

SAMBANOVA SYSTEMS INC, 2022

A control barrier network for distributed execution systems enables efficient synchronization of processing units by dynamically routing control signals through a configurable bus. Each processing unit generates status signals indicating completion of execution fragments, which are consumed by control barrier logic units that produce barrier tokens and enable signals based on the status signals. The control barrier network can be configured to form static or dynamic signal routes, allowing for flexible synchronization of processing units in distributed execution systems.

US11386038B2-patent-drawing

47. Integrated Circuit Architecture with Data Processing Engine Array and Configurable Memory Tiles

XILINX INC, 2022

An integrated circuit (IC) architecture featuring a data processing engine (DPE) array with integrated memory tiles. The DPE array comprises multiple tiles, each containing a processing core, memory module, and stream switch. Memory tiles are equipped with direct memory access (DMA) engines and random-access memory (RAM), enabling inter-tile data transfer. The architecture allows for flexible memory configuration through composite memory formation and supports various packet processing modes, including in-order and out-of-order processing.

48. Vector Processor Architecture with Multi-Lane Vector Index Registers for Simultaneous Operand Access

MICRON TECHNOLOGY INC, 2022

Vector processor architecture that employs multi-lane solutions for enhanced performance in vector operations. The architecture includes vector index registers (VIRs) that store multiple addresses for accessing operand vectors, enabling simultaneous access to multiple vector elements. The VIRs can be used for both true and false results of conditional operations, allowing for efficient execution of conditional operations in vector processors.

49. Extending SYCL's Programming Paradigm with Tensor-based SIMD Abstractions

Wilson Feng, Shucai Yao, Kai Ting Wang - ACM, 2022

Heterogeneous computing has emerged as an important method for supporting more than one kind of processors or accelerators in a program. There is generally a trade off between source code portability and device performance for heterogeneous programming. Thus, new programming abstractions to assist programmers to reduce their development efforts while minimizing performance penalties is extremely valuable.

50. System for Compile-Time Tensor Memory Layout Analysis and Conflict Resolution in Heterogeneous Processor Environments

SAMBANOVA SYSTEMS INC, 2022

Compile-time determination of tensor memory layouts and conflict resolution for heterogeneous processors. The system analyzes a program's dataflow graph to determine required memory layouts for tensors, considering both producer and consumer operations. It detects conflicts between expected layouts and resolves them by modifying the graph to ensure compatibility, enabling efficient execution on CPUs, GPUs, FPGAs, and other accelerators.

51. Method for Partitioning Heterogeneous Integrated Circuits via Model-Based Data Flow Graph Conversion

52. Semiconductor Device with Parallel Arithmetic Unit for Concurrent Processed and Predetermined Data Calculations

53. Heterogeneous Programming for the Homogeneous Majority

54. Heterogeneous Computing Systems

55. Distributed Heterogeneous Parallel Computing Framework Based on Component Flow

Get Full Report

Access our comprehensive collection of 95 documents related to this technology