A zero-copy, lock-free IPC bridge. Reduce synchronization overhead to the theoretical hardware minimum and scale state-heavy computations across commodity multi-core CPUs.
New to the terminology? Words like RTL, ISS, tile, and all-reduce are field-specific — the Glossary explains every one in plain English →
Every design decision follows from one idea: synchronize only when the computation requires it — not because the clock says so.
Processors advance and exchange state only at a synchronization event — when the computation actually has something to share. An 8-core 512×512 GEMM needs 43 events; a clock-driven equivalent would take 672,000. Data itself moves at memory speed over zero-copy shared memory.
You will never wake up to a hung 14-hour regression test. The DistProc algorithm structurally prohibits data overwriting and process run-aways. Because it is entirely lock-free, it is mathematically immune to deadlocks—no matter how many partitions or cores are involved.
Two interchangeable backends, chosen in the shared-memory header — no recompile: spin for the lowest latency, or futex to hand idle cores back to the OS (28% → 2% CPU). See the performance deep-dive →
To prove the stability, latency, and scaling of the DistProc middleware, we subjected it to three distinct hardware-in-the-loop stress tests.
A single 128x128 Gemmini core cannot compute a 512x512 matrix multiplication due to strict 256KB scratchpad limits. DistProc orchestrates a swarm of independent SoCs to cooperatively compute the matrix. We extended this to execute a full 16-core BERT transformer workload, using real HuggingFace weights quantized by industry-standard ML compilers (Apache TVM), verified bit-exact.
A 64-node distributed-AI collective. Node 0 is a cycle-accurate CVA6 RTL model verified per-instruction-retire against the golden reference simulator (Spike ISS) in a separate process. Nodes 1-63 are fast C++ functional models. DistProc synchronizes the heterogeneous cluster over a shared memory butterfly all-reduce.
DistProc is wired directly into the official NVIDIA Deep Learning Accelerator (NVDLA) RTL. Python acts as the orchestrator, sending real configuration traces directly to the NVDLA's CSB and AXI memory pins over the DistProc bridge. Proves the middleware is robust enough to drive millions of logic gates in Tier-1 enterprise silicon without deadlocking.
The gentpcsm microbenchmark with a high compute load distributed across E-cores (10,000 cycles, 7.2M µs total workload). Both speedup and absolute task completion time shown on the same axis.
A specialist engineering consultancy based in Saltsjöbaden, Sweden. DistProc is available as a software IP licence, as a consulting engagement, or as a full development project.
DistProc middleware source code and integration support for your scientific computing or simulation pipeline.
EDA infrastructure, AI accelerator co-simulation, NPU bare-metal drivers, RTL verification.
Custom integration of DistProc into your specific hardware or simulation environment.
DistProc is an independent middleware platform. Our validation benchmarks are built upon and stress-test the following world-class open-source projects: