Shevatech AB

Architecture

One Real Core. A Cluster of Models. A Relocatable Reference.

The cluster is heterogeneous by construction: nodes are interchangeably cycle-accurate RTL or fast C++ functional models, cooperating over the same zero-copy shared memory.

01 / HETEROGENEITY

Verify a Tile in a Big Context

Node 0 is a real CVA6 SoC in Verilator; the other N-1 nodes are millisecond C++ models. Adding 60 model nodes costs essentially nothing — wall-time tracks one RTL core regardless of array size. N=64 runs in about the same time as N=4.

02 / REFERENCE DECOUPLING

Per-Retire ISS Lockstep

Node 0 publishes every RVFI retire over the DistProc barrier to a private Spike ISS in a separate process. Reference-guided injection lets Spike follow the core's reads of data other nodes wrote — verifying an interacting core, not an isolated one. Zero mismatches.

03 / REAL COLLECTIVE

Butterfly All-Reduce + Fabric Model

A recursive-doubling all-reduce of distributed-GEMM shards — log₂N steps, node-to-node, no central gather. A configurable interconnect model prices the collective on real fabrics: 10 GbE through NVLink. The same workload, any network.

System Overview

Process Structure

One RTL node bound to its ISS over the control-plane barrier; every node meets on the SWMR data plane and exchanges through an explicit butterfly. Heterogeneous, bit-identical.

Measured — CVA6 RV64GC + Spike, i9-14900K

Distributed GEMM, All-Reduced & Verified

C[D×D] = A[D×N] @ B[N×D] sharded across N nodes; each node computes a rank-1 outer product, a butterfly all-reduce sums them. Node 0's shard is cycle-accurate and Spike-verified; the reduced result is bit-exact against an independent golden at every scale.

Latency regime — small shard (128 B / step)

Nodes	Wall (s)	Spike	C==A@B
4	11.7	PASS	exact
16	11.7	PASS	exact
64	11.7	PASS	exact

Bandwidth regime — large shard (32 KB / step)

Nodes	Wall (s)	Spike	C==A@B
4	18.0	PASS	exact
16	19.2	PASS	exact
64	22.5	PASS	exact

The invariance

Wall-time tracks one RTL node regardless of array size — adding 60 model nodes (4 → 64) is essentially free because they are millisecond C++ models. The slight growth in the bandwidth regime is just Node 0 copying the larger exchange vector over more butterfly steps. Node 0 stays cycle-accurate and Spike-verified the entire time.

Accelerator Integration — Gemmini, verified vs the ISA simulator

The Cycle-Accurate Node Can Be a Real Accelerator

Node 0 doesn't have to be a CPU. We dropped a real Gemmini systolic array — 128×128 PEs, 16,384 INT8 MACs — onto the cluster's cycle-accurate node. It computes its GEMM shard on the actual array and joins the same butterfly all-reduce, every byte of the collective riding the accelerator's own DMA — and it's verified the same way the CPU node is: against an independent reference simulator.

01 / REAL HARDWARE

A Systolic Array, Not a Loop

Node 0 is a Rocket+Gemmini SoC in Verilator. Its 128×128×128 tile is computed on the real mesh; the entire butterfly all-reduce moves through Gemmini DMA — mvout to publish a shard, mvin-accumulate to reduce a partner's. The host core only ever touches one-word flags.

02 / INDEPENDENT REFERENCE

Verified vs the Gemmini ISA Simulator

Every output tile is checked bit-for-bit against the official Gemmini ISA functional simulator (Spike + libgemmini) running the identical shard. It is the accelerator analog of the CPU node's per-instruction-retire Spike lockstep — the real array, held to the reference model.

03 / HETEROGENEOUS

One Real Array, N−1 Models

The other nodes are fast C++ Gemmini models. A faithful-twin check confirms the RTL array and the model emit bit-identical tiles for the same shard — so substituting models for cycle-accurate hardware loses no numeric fidelity. 1 RTL Gemmini + 63 models, exact.

Four independent checks — bit-exact at every scale

Nodes	RTL = model (faithful twin)	all-reduce = A@B	RTL tile = Gemmini ISS	all-reduce = Σ ISS tiles
4	PASS	PASS	PASS	PASS
16	PASS	PASS	PASS	PASS
64	PASS	PASS	PASS	PASS

0 / 16,384 element mismatches on every tile, at every scale. ISS = spike --extension=gemmini, the official Gemmini ISA functional model.

Milestone Achieved: Unified CV-X-IF Integration

We have successfully bridged the Gemmini accelerator directly to the CVA6 core via a custom shim bridging the standard RISC-V coprocessor interface (CV-X-IF) to the accelerator (RoCC). DistProc's EADDP architecture enables this Dual-Altitude Verification to run simultaneously: the scalar CVA6 core is verified per-instruction-retire against the Spike ISS, while the Gemmini accelerator is verified per-tile against the Spike-gemmini model. Both verifications run simultaneously while the node executes its shard of the 64-node distributed butterfly all-reduce.

Interconnect Model

Same Collective, Any Network

The all-reduce wire time is modeled as log₂N × (latency + message/bandwidth). Small tensors are latency-bound and every fabric looks similar; large tensors are bandwidth-bound and the fabrics diverge by 60×. The choice of network goes from "doesn't matter" to "decides everything".

64-node all-reduce time by fabric

large shard, 32 KB/step, 6 butterfly steps — bandwidth-bound

Fabric	Latency	Bandwidth	All-reduce	Regime
10 GbE	10 µs	1.25 GB/s	217 µs	BW-bound
100 GbE	5 µs	12.5 GB/s	45.7 µs	lat-bound
InfiniBand HDR	1 µs	25 GB/s	13.9 µs	BW-bound
NVLink	0.5 µs	300 GB/s	3.7 µs	lat-bound

Small shard (128 B/step), same 64 nodes: all four fabrics fall within 3–6 µs. The message size flips the regime — and the sim wall stays ~one RTL node either way.

Verified Output

Reproducible from git

// GEMM_DIM=64 ./run_cluster_ring.sh 4 16 64 — mongoose i9-14900K

Shevatech cluster demo — distributed-GEMM butterfly ALL-REDUCE

shard C[64×64], message S=32768 B/step (bandwidth regime)

Node 0 = CVA6 RTL, Spike-verified · Nodes 1..N-1 = C++ models

N=4 (1 RTL + 3 mdl) | 18.0s | Spike PASS | C[0,0]=50 == 50

N=16 (1 RTL + 15 mdl) | 19.2s | Spike PASS | C[0,0]=1768 == 1768

N=64 (1 RTL + 63 mdl) | 22.5s | Spike PASS | C[0,0]=93600 == 93600

[SUCCESS] one cycle-accurate CVA6 node, per-retire ISS-verified,

reducing a 64-node distributed GEMM, bit-exact, in ~22s.

About

A specialist engineering consultancy based in Saltsjöbaden, Sweden. DistProc and the CVA6 cluster-verification stack are available as a software IP licence, a consulting engagement, or a full development project.

📧 info@shevatech.com

🌐 www.shevatech.com

🔗 Yehoshua Shoshan on LinkedIn

📍 Saltsjöbaden, Sweden

Why it matters

Accessibility, Not Just Speed

For most teams, the alternative to this is not a faster tool — it is a $2M+ hardware emulator or commercial EDA licenses (VCS, Xcelium, Questa). DistProc combined with Verilator keeps one node cycle-accurate and ISS-verified inside a full-scale array, on commodity hardware, with a relocatable golden reference — the system-context verification you can actually run, at a fraction of the cost.

Cycle-Accurate RISC-V & NPU Verification
at Cluster Scale