A deep dive into the numbers behind the synchronization layer. What a distributed synchronization event actually costs, and how to trade latency for occupancy.
taskset to prevent OS scheduler jitter.
gentpcsm v2 is the software-only DistProc IP: autonomous processes share one POSIX memory region under a single-writer / multiple-reader discipline, and meet at an event-driven barrier. No locks, no kernel on the hot path, no serialization. The cost of one synchronization is the cost of one atomic barrier operation — not of data volume.
Each core writes its own slot (single writer) and raises a flag with one release-store. There is exactly one writer per location, so coherence reduces to publication — no mutex, no compare-and-swap contention.
A core advances only when every peer's flag has reached the current generation — an acquire-load per peer. The barrier fires only when the computation has something to share, not on a clock edge.
v2 replaced v1's 2×32×32 matrix (8.5 MB mapped, only the diagonal used) with a compact 1-D per-core array — ~32× smaller footprint, far fewer cache and TLB misses per barrier. That alone is the ~2× speedup below.
Raw synchronization cost per cycle, swept 1–15 cores. The lock-free spin barrier stays sub-microsecond and nearly flat; the new 1-D layout roughly halves it versus v1.
At a realistic per-cycle compute load of tens of microseconds, a 0.05–0.24 µs barrier is well under 1% of wall-time. The synchronization is effectively free — the speedup is bounded by the work, not the transport.
v2 ships two header-selectable backends. Spin keeps latency lowest by polling; futex blocks in the kernel so a waiting core sleeps instead of burning a CPU. Same protocol, same result — a different answer to "what should a core do while it waits?"
An escalating poll (pause → yield → nano-sleep) on the peer's flag. 0.05–0.24 µs per sync, but a waiting core stays ~hot. The right choice when every microsecond of latency matters and cores are not oversubscribed.
The flag is a futex word; a waiting core blocks in the kernel and is woken on publish. Higher per-sync cost (0.2–2.1 µs of syscalls), but a core waiting on a slow peer drops from 28% to 2% CPU.
Spin when you want the lowest sync latency and have cores to spare; futex when you are oversubscribed, when one participant lags, or when idle participants should give the CPU back. Chosen in the SHM header — no recompile.
Because every variable is controlled, the measurement bench is not only how we report these numbers — it is a planning tool. The same setup sizes the partition, maps the CPUs, and dry-runs a customer's workload with cpu_burn standing in for their compute, before any real code exists.
For a given workload, find how many participants, which core type (P or E), and which backend (spin or futex) it actually needs — measured on the controlled rig, not guessed from a datasheet.
Decide the pinning before the real system exists: the critical path on fast P-cores, the rest on E-cores, headroom left for OS jitter. The rig shows exactly where a given layout stops scaling linearly.
cpu_burn injects a controlled per-cycle load that mimics the customer's compute. Dry-run the whole partition and synchronization topology — and read back the overhead fraction — before a single line of their code is written.
The synchronization floor above is one half of performance work; the other is the co-simulation itself. When we put a real Gemmini systolic array on a cluster node, profiling — not guessing — found the cost and cut the run ~2.4×, with the distributed result staying bit-exact throughout.
Per-phase markers showed the accelerator's tile-publish dominating each butterfly step. The cause: in the default SoC the Gemmini DMA is L2-coherent, so every cache line of a 64 KB tile is its own coherent transaction.
Moving the node to a non-coherent (TLBroadcast) memory system lets the DMA write straight to shared memory — no per-line coherence, no flush loops. A single 128³ tile takes no cache-reuse penalty, so it is pure win.
Every result re-checked: bit-identical to our model, to the golden, and to the official Gemmini ISA simulator. ~2.4× faster wall time with not one output value changed.
| Nodes | Coherent DMA | Non-coherent DMA | Speedup | Result |
|---|---|---|---|---|
| 4 | 218 s | 94 s | 2.3× | bit-exact |
| 16 | 377 s | 159 s | 2.4× | bit-exact |
| 64 | 527 s | 224 s | 2.4× | bit-exact |
Wall time is RTL-co-simulation time on one workstation, not silicon — the win is in the co-sim, where it makes verification practical. See the accelerator integration →.
A specialist engineering consultancy in Saltsjöbaden, Sweden. The gentpcsm v2 core is available as a software IP licence — a small, dependency-free C11 library you embed in your own distributed pipeline, co-simulation harness, or runtime.
The headline is not a single speed — it is that synchronization has a fixed, tiny, sub-microsecond cost that does not grow meaningfully with core count, and that you can choose whether a waiting core spends a CPU or yields it. Under any real workload the barrier is a rounding error; the system scales with the computation, exactly as it should.
DistProc is an independent middleware platform. Our validation benchmarks are built upon and stress-test the following world-class open-source projects: