EADDP · TRL-7 · PROVEN PROTOTYPE

DistProc: General-Purpose Middleware for
Event-Accurate Distributed Processing

A zero-copy, lock-free IPC bridge. Reduce synchronization overhead to the theoretical hardware minimum and scale state-heavy computations across commodity multi-core CPUs.

The Executive Summary
DistProc Performance · Deep-Dive → CVA6 & Gemmini Cluster Verification → Request a Demo

New to the terminology? Words like RTL, ISS, tile, and all-reduce are field-specific — the Glossary explains every one in plain English →

14.8×
Linear speedup
across 15 cores
0.41µs
Max sync latency
at 8 P-cores
524K
int32 elements verified
8-core cooperative GEMM
0.009%
IPC overhead fraction
of total walltime
15K×
Sync call reduction
vs clock-driven approach

Architecture

Three Properties. One Principle.

Every design decision follows from one idea: synchronize only when the computation requires it — not because the clock says so.

01 / SYNCHRONIZATION

Synchronize at the Event

Processors advance and exchange state only at a synchronization event — when the computation actually has something to share. An 8-core 512×512 GEMM needs 43 events; a clock-driven equivalent would take 672,000. Data itself moves at memory speed over zero-copy shared memory.

02 / CORRECTNESS

100% Deadlock Immune

You will never wake up to a hung 14-hour regression test. The DistProc algorithm structurally prohibits data overwriting and process run-aways. Because it is entirely lock-free, it is mathematically immune to deadlocks—no matter how many partitions or cores are involved.

03 / IMPLEMENTATION

Tuned for Speed or for CPU

Two interchangeable backends, chosen in the shared-memory header — no recompile: spin for the lowest latency, or futex to hand idle cores back to the OS (28% → 2% CPU). See the performance deep-dive →


Production Validation

Validated on Tier-1 Silicon Architectures

To prove the stability, latency, and scaling of the DistProc middleware, we subjected it to three distinct hardware-in-the-loop stress tests.

Swarm Computing & ML Compilers

From Synthetic Arrays to TVMBERT

A single 128x128 Gemmini core cannot compute a 512x512 matrix multiplication due to strict 256KB scratchpad limits. DistProc orchestrates a swarm of independent SoCs to cooperatively compute the matrix. We extended this to execute a full 16-core BERT transformer workload, using real HuggingFace weights quantized by industry-standard ML compilers (Apache TVM), verified bit-exact.

Heterogeneous Co-Simulation (CVA6)

Proof of System-Level Synchronization

A 64-node distributed-AI collective. Node 0 is a cycle-accurate CVA6 RTL model verified per-instruction-retire against the golden reference simulator (Spike ISS) in a separate process. Nodes 1-63 are fast C++ functional models. DistProc synchronizes the heterogeneous cluster over a shared memory butterfly all-reduce.

Production IP Integration (NVDLA)

Proof of Enterprise Stability

DistProc is wired directly into the official NVIDIA Deep Learning Accelerator (NVDLA) RTL. Python acts as the orchestrator, sending real configuration traces directly to the NVDLA's CSB and AXI memory pins over the DistProc bridge. Proves the middleware is robust enough to drive millions of logic gates in Tier-1 enterprise silicon without deadlocking.


Performance

Scaling Efficiency — Measured on i9-14900K

The gentpcsm microbenchmark with a high compute load distributed across E-cores (10,000 cycles, 7.2M µs total workload). Both speedup and absolute task completion time shown on the same axis.

Performance Chart
At 15 cores: 14.8× speedup · 0.485s vs 7.20s single-core · IPC overhead < 0.44µs/cycle
Overhead fraction is load-dependent — 0.009% measured at gemm512 workload (808s run, 43 sync events). At lighter loads or higher sync rates the fraction increases proportionally.

Shevatech AB

A specialist engineering consultancy based in Saltsjöbaden, Sweden. DistProc is available as a software IP licence, as a consulting engagement, or as a full development project.

📧 info@shevatech.com
🌐 www.shevatech.com
🔗 Yehoshua Shoshan on LinkedIn
📍 Saltsjöbaden, Sweden

What We Offer

Software IP Licence

DistProc middleware source code and integration support for your scientific computing or simulation pipeline.

Consulting

EDA infrastructure, AI accelerator co-simulation, NPU bare-metal drivers, RTL verification.

Development Project

Custom integration of DistProc into your specific hardware or simulation environment.

Email copied to clipboard: info@shevatech.com

Built on Open Standards & Open Source IP

DistProc is an independent middleware platform. Our validation benchmarks are built upon and stress-test the following world-class open-source projects: