Shevatech AB

The Core

A Synchronization Event, Priced

gentpcsm v2 is the software-only DistProc IP: autonomous processes share one POSIX memory region under a single-writer / multiple-reader discipline, and meet at an event-driven barrier. No locks, no kernel on the hot path, no serialization. The cost of one synchronization is the cost of one atomic barrier operation — not of data volume.

PUBLISH

One Atomic Store

Each core writes its own slot (single writer) and raises a flag with one release-store. There is exactly one writer per location, so coherence reduces to publication — no mutex, no compare-and-swap contention.

GATHER

Acquire-Load Each Peer

A core advances only when every peer's flag has reached the current generation — an acquire-load per peer. The barrier fires only when the computation has something to share, not on a clock edge.

LAYOUT

1-D Slot Array

v2 replaced v1's 2×32×32 matrix (8.5 MB mapped, only the diagonal used) with a compact 1-D per-core array — ~32× smaller footprint, far fewer cache and TLB misses per barrier. That alone is the ~2× speedup below.

Measured — zero load, per-core pinned, 100K cycles

The Barrier Floor & the Layout Win

Raw synchronization cost per cycle, swept 1–15 cores. The lock-free spin barrier stays sub-microsecond and nearly flat; the new 1-D layout roughly halves it versus v1.

Sync cost per cycle — v2 vs v1 (spin)

1-D slot array (v2) vs 32×32 diagonal matrix (v1) — same algorithm, ~2× lower

Cores	v2 spin (µs)	v1 (µs)	Speedup
1	0.05	0.10	2.0×
4	0.12	0.24	2.0×
8	0.14	0.33	2.4×
12	0.18	0.43	2.4×
15	0.24	0.44	1.8×

At a realistic per-cycle compute load of tens of microseconds, a 0.05–0.24 µs barrier is well under 1% of wall-time. The synchronization is effectively free — the speedup is bounded by the work, not the transport.

The Backend Choice

Spin vs Futex — Latency or Occupancy

v2 ships two header-selectable backends. Spin keeps latency lowest by polling; futex blocks in the kernel so a waiting core sleeps instead of burning a CPU. Same protocol, same result — a different answer to "what should a core do while it waits?"

Sync cost per cycle — spin vs futex

spin polls (sub-µs); futex pays a syscall per blocked wait

Idle-core CPU while waiting on a slow peer

1 loaded core gates 7 idle cores — what the idle cores cost

SPIN · DEFAULT

Lowest Latency

An escalating poll (pause → yield → nano-sleep) on the peer's flag. 0.05–0.24 µs per sync, but a waiting core stays ~hot. The right choice when every microsecond of latency matters and cores are not oversubscribed.

FUTEX · OPT-IN

Frees the Core

The flag is a futex word; a waiting core blocks in the kernel and is woken on publish. Higher per-sync cost (0.2–2.1 µs of syscalls), but a core waiting on a slow peer drops from 28% to 2% CPU.

THE RULE

Spin for Latency, Block for Occupancy

Spin when you want the lowest sync latency and have cores to spare; futex when you are oversubscribed, when one participant lags, or when idle participants should give the CPU back. Chosen in the SHM header — no recompile.

Beyond Benchmarking

The Same Rig Calibrates Your Deployment

Because every variable is controlled, the measurement bench is not only how we report these numbers — it is a planning tool. The same setup sizes the partition, maps the CPUs, and dry-runs a customer's workload with cpu_burn standing in for their compute, before any real code exists.

CALIBRATE

Partition Sizing

For a given workload, find how many participants, which core type (P or E), and which backend (spin or futex) it actually needs — measured on the controlled rig, not guessed from a datasheet.

DISTRIBUTE

CPU Mapping

Decide the pinning before the real system exists: the critical path on fast P-cores, the rest on E-cores, headroom left for OS jitter. The rig shows exactly where a given layout stops scaling linearly.

DRY-RUN

Workload Stand-In

cpu_burn injects a controlled per-cycle load that mimics the customer's compute. Dry-run the whole partition and synchronization topology — and read back the overhead fraction — before a single line of their code is written.

Engineering the Co-Sim — Gemmini accelerator node

The Same Discipline, on a Real Accelerator

The synchronization floor above is one half of performance work; the other is the co-simulation itself. When we put a real Gemmini systolic array on a cluster node, profiling — not guessing — found the cost and cut the run ~2.4×, with the distributed result staying bit-exact throughout.

PROFILE

Find the Real Cost

Per-phase markers showed the accelerator's tile-publish dominating each butterfly step. The cause: in the default SoC the Gemmini DMA is L2-coherent, so every cache line of a 64 KB tile is its own coherent transaction.

FIX

Bypass the Cache

Moving the node to a non-coherent (TLBroadcast) memory system lets the DMA write straight to shared memory — no per-line coherence, no flush loops. A single 128³ tile takes no cache-reuse penalty, so it is pure win.

VERIFY

Faster, Still Exact

Every result re-checked: bit-identical to our model, to the golden, and to the official Gemmini ISA simulator. ~2.4× faster wall time with not one output value changed.

Wall time — heterogeneous Gemmini cluster (1 RTL array + N−1 models)

Nodes	Coherent DMA	Non-coherent DMA	Speedup	Result
4	218 s	94 s	2.3×	bit-exact
16	377 s	159 s	2.4×	bit-exact
64	527 s	224 s	2.4×	bit-exact

Wall time is RTL-co-simulation time on one workstation, not silicon — the win is in the co-sim, where it makes verification practical. See the accelerator integration →.

Reproduce

From git, in seconds

$ cd sw/gentpcsm2 && make bench_spin // zero load, per-core pinned

Cores WallTime(s) AvgTime(us)

1 0.0055 0.05

8 0.0141 0.14 (v1: 0.33 — ~2.4× slower)

15 0.0242 0.24

$ make bench_futex // 1 loaded core, 7 idle

idle-core CPU — spin: 28% busy · futex: 2% busy

[OK] lock-free SWMR barrier · sub-µs floor · two backends · pure C11, no deps

About

A specialist engineering consultancy in Saltsjöbaden, Sweden. The gentpcsm v2 core is available as a software IP licence — a small, dependency-free C11 library you embed in your own distributed pipeline, co-simulation harness, or runtime.

📧 info@shevatech.com

🌐 www.shevatech.com

🔗 Yehoshua Shoshan on LinkedIn

📍 Saltsjöbaden, Sweden

What the numbers mean

A Floor, Not a Bottleneck

The headline is not a single speed — it is that synchronization has a fixed, tiny, sub-microsecond cost that does not grow meaningfully with core count, and that you can choose whether a waiting core spends a CPU or yields it. Under any real workload the barrier is a rounding error; the system scales with the computation, exactly as it should.

Built on Open Standards & Open Source IP

DistProc is an independent middleware platform. Our validation benchmarks are built upon and stress-test the following world-class open-source projects:

[ Verilator ↗ ] [ Apache TVM ↗ ] [ UC Berkeley Chipyard (Gemmini) ↗ ] [ OpenHW Group (CVA6) ↗ ] [ NVIDIA (NVDLA) ↗ ] [ RISC-V International (Spike ISS) ↗ ]