GENTPCSM v2 · PURE C11 · MEASURED · i9-14900K

DistProc Performance

A deep dive into the numbers behind the synchronization layer. What a distributed synchronization event actually costs, and how to trade latency for occupancy.

Test Environment & Executive Summary
Go to the data → CVA6 & Gemmini Cluster →
0.05µs
sync floor
1 core, spin backend
0.24µs
barrier cost
15 cores, spin
~2×
faster than v1
1-D layout vs 32×32 matrix
14×
idle-core CPU saved
futex parks (28%→2%)
~300
lines of C11
no deps, embeddable

The Core

A Synchronization Event, Priced

gentpcsm v2 is the software-only DistProc IP: autonomous processes share one POSIX memory region under a single-writer / multiple-reader discipline, and meet at an event-driven barrier. No locks, no kernel on the hot path, no serialization. The cost of one synchronization is the cost of one atomic barrier operation — not of data volume.

PUBLISH

One Atomic Store

Each core writes its own slot (single writer) and raises a flag with one release-store. There is exactly one writer per location, so coherence reduces to publication — no mutex, no compare-and-swap contention.

GATHER

Acquire-Load Each Peer

A core advances only when every peer's flag has reached the current generation — an acquire-load per peer. The barrier fires only when the computation has something to share, not on a clock edge.

LAYOUT

1-D Slot Array

v2 replaced v1's 2×32×32 matrix (8.5 MB mapped, only the diagonal used) with a compact 1-D per-core array — ~32× smaller footprint, far fewer cache and TLB misses per barrier. That alone is the ~2× speedup below.


Measured — zero load, per-core pinned, 100K cycles

The Barrier Floor & the Layout Win

Raw synchronization cost per cycle, swept 1–15 cores. The lock-free spin barrier stays sub-microsecond and nearly flat; the new 1-D layout roughly halves it versus v1.

Sync cost per cycle — v2 vs v1 (spin)
1-D slot array (v2) vs 32×32 diagonal matrix (v1) — same algorithm, ~2× lower
Coresv2 spin (µs)v1 (µs)Speedup
10.050.102.0×
40.120.242.0×
80.140.332.4×
120.180.432.4×
150.240.441.8×

At a realistic per-cycle compute load of tens of microseconds, a 0.05–0.24 µs barrier is well under 1% of wall-time. The synchronization is effectively free — the speedup is bounded by the work, not the transport.


The Backend Choice

Spin vs Futex — Latency or Occupancy

v2 ships two header-selectable backends. Spin keeps latency lowest by polling; futex blocks in the kernel so a waiting core sleeps instead of burning a CPU. Same protocol, same result — a different answer to "what should a core do while it waits?"

Sync cost per cycle — spin vs futex
spin polls (sub-µs); futex pays a syscall per blocked wait
Idle-core CPU while waiting on a slow peer
1 loaded core gates 7 idle cores — what the idle cores cost
SPIN · DEFAULT

Lowest Latency

An escalating poll (pause → yield → nano-sleep) on the peer's flag. 0.05–0.24 µs per sync, but a waiting core stays ~hot. The right choice when every microsecond of latency matters and cores are not oversubscribed.

FUTEX · OPT-IN

Frees the Core

The flag is a futex word; a waiting core blocks in the kernel and is woken on publish. Higher per-sync cost (0.2–2.1 µs of syscalls), but a core waiting on a slow peer drops from 28% to 2% CPU.

THE RULE

Spin for Latency, Block for Occupancy

Spin when you want the lowest sync latency and have cores to spare; futex when you are oversubscribed, when one participant lags, or when idle participants should give the CPU back. Chosen in the SHM header — no recompile.


Beyond Benchmarking

The Same Rig Calibrates Your Deployment

Because every variable is controlled, the measurement bench is not only how we report these numbers — it is a planning tool. The same setup sizes the partition, maps the CPUs, and dry-runs a customer's workload with cpu_burn standing in for their compute, before any real code exists.

CALIBRATE

Partition Sizing

For a given workload, find how many participants, which core type (P or E), and which backend (spin or futex) it actually needs — measured on the controlled rig, not guessed from a datasheet.

DISTRIBUTE

CPU Mapping

Decide the pinning before the real system exists: the critical path on fast P-cores, the rest on E-cores, headroom left for OS jitter. The rig shows exactly where a given layout stops scaling linearly.

DRY-RUN

Workload Stand-In

cpu_burn injects a controlled per-cycle load that mimics the customer's compute. Dry-run the whole partition and synchronization topology — and read back the overhead fraction — before a single line of their code is written.


Engineering the Co-Sim — Gemmini accelerator node

The Same Discipline, on a Real Accelerator

The synchronization floor above is one half of performance work; the other is the co-simulation itself. When we put a real Gemmini systolic array on a cluster node, profiling — not guessing — found the cost and cut the run ~2.4×, with the distributed result staying bit-exact throughout.

PROFILE

Find the Real Cost

Per-phase markers showed the accelerator's tile-publish dominating each butterfly step. The cause: in the default SoC the Gemmini DMA is L2-coherent, so every cache line of a 64 KB tile is its own coherent transaction.

FIX

Bypass the Cache

Moving the node to a non-coherent (TLBroadcast) memory system lets the DMA write straight to shared memory — no per-line coherence, no flush loops. A single 128³ tile takes no cache-reuse penalty, so it is pure win.

VERIFY

Faster, Still Exact

Every result re-checked: bit-identical to our model, to the golden, and to the official Gemmini ISA simulator. ~2.4× faster wall time with not one output value changed.

NodesCoherent DMANon-coherent DMASpeedupResult
4218 s94 s2.3×bit-exact
16377 s159 s2.4×bit-exact
64527 s224 s2.4×bit-exact

Wall time is RTL-co-simulation time on one workstation, not silicon — the win is in the co-sim, where it makes verification practical. See the accelerator integration →.

Reproduce

From git, in seconds

$ cd sw/gentpcsm2 && make bench_spin // zero load, per-core pinned
Cores WallTime(s) AvgTime(us)
1 0.0055 0.05
8 0.0141 0.14 (v1: 0.33 — ~2.4× slower)
15 0.0242 0.24
 
$ make bench_futex // 1 loaded core, 7 idle
idle-core CPU — spin: 28% busy · futex: 2% busy
 
[OK] lock-free SWMR barrier · sub-µs floor · two backends · pure C11, no deps

Shevatech AB

A specialist engineering consultancy in Saltsjöbaden, Sweden. The gentpcsm v2 core is available as a software IP licence — a small, dependency-free C11 library you embed in your own distributed pipeline, co-simulation harness, or runtime.

📧 info@shevatech.com
🌐 www.shevatech.com
🔗 Yehoshua Shoshan on LinkedIn
📍 Saltsjöbaden, Sweden

A Floor, Not a Bottleneck

The headline is not a single speed — it is that synchronization has a fixed, tiny, sub-microsecond cost that does not grow meaningfully with core count, and that you can choose whether a waiting core spends a CPU or yields it. Under any real workload the barrier is a rounding error; the system scales with the computation, exactly as it should.

Email copied to clipboard: info@shevatech.com

Built on Open Standards & Open Source IP

DistProc is an independent middleware platform. Our validation benchmarks are built upon and stress-test the following world-class open-source projects: