One CVA6 core, verified per-instruction-retire against a golden reference simulator, running its shard of a distributed-AI collective inside a 64-node cluster — the other 63 nodes are fast functional models. On a workstation, in seconds.
The cluster is heterogeneous by construction: nodes are interchangeably cycle-accurate RTL or fast C++ functional models, cooperating over the same zero-copy shared memory.
Node 0 is a real CVA6 SoC in Verilator; the other N-1 nodes are millisecond C++ models. Adding 60 model nodes costs essentially nothing — wall-time tracks one RTL core regardless of array size. N=64 runs in about the same time as N=4.
Node 0 publishes every RVFI retire over the DistProc barrier to a private Spike ISS in a separate process. Reference-guided injection lets Spike follow the core's reads of data other nodes wrote — verifying an interacting core, not an isolated one. Zero mismatches.
A recursive-doubling all-reduce of distributed-GEMM shards — log₂N steps, node-to-node, no central gather. A configurable interconnect model prices the collective on real fabrics: 10 GbE through NVLink. The same workload, any network.
One RTL node bound to its ISS over the control-plane barrier; every node meets on the SWMR data plane and exchanges through an explicit butterfly. Heterogeneous, bit-identical.
C[D×D] = A[D×N] @ B[N×D] sharded across N nodes; each node computes a rank-1 outer product, a butterfly all-reduce sums them. Node 0's shard is cycle-accurate and Spike-verified; the reduced result is bit-exact against an independent golden at every scale.
| Nodes | Wall (s) | Spike | C==A@B |
|---|---|---|---|
| 4 | 11.7 | PASS | exact |
| 16 | 11.7 | PASS | exact |
| 64 | 11.7 | PASS | exact |
| Nodes | Wall (s) | Spike | C==A@B |
|---|---|---|---|
| 4 | 18.0 | PASS | exact |
| 16 | 19.2 | PASS | exact |
| 64 | 22.5 | PASS | exact |
Wall-time tracks one RTL node regardless of array size — adding 60 model nodes (4 → 64) is essentially free because they are millisecond C++ models. The slight growth in the bandwidth regime is just Node 0 copying the larger exchange vector over more butterfly steps. Node 0 stays cycle-accurate and Spike-verified the entire time.
Node 0 doesn't have to be a CPU. We dropped a real Gemmini systolic array — 128×128 PEs, 16,384 INT8 MACs — onto the cluster's cycle-accurate node. It computes its GEMM shard on the actual array and joins the same butterfly all-reduce, every byte of the collective riding the accelerator's own DMA — and it's verified the same way the CPU node is: against an independent reference simulator.
Node 0 is a Rocket+Gemmini SoC in Verilator. Its 128×128×128 tile is computed on the real mesh; the entire butterfly all-reduce moves through Gemmini DMA — mvout to publish a shard, mvin-accumulate to reduce a partner's. The host core only ever touches one-word flags.
Every output tile is checked bit-for-bit against the official Gemmini ISA functional simulator (Spike + libgemmini) running the identical shard. It is the accelerator analog of the CPU node's per-instruction-retire Spike lockstep — the real array, held to the reference model.
The other nodes are fast C++ Gemmini models. A faithful-twin check confirms the RTL array and the model emit bit-identical tiles for the same shard — so substituting models for cycle-accurate hardware loses no numeric fidelity. 1 RTL Gemmini + 63 models, exact.
| Nodes | RTL = model (faithful twin) | all-reduce = A@B | RTL tile = Gemmini ISS | all-reduce = Σ ISS tiles |
|---|---|---|---|---|
| 4 | PASS | PASS | PASS | PASS |
| 16 | PASS | PASS | PASS | PASS |
| 64 | PASS | PASS | PASS | PASS |
0 / 16,384 element mismatches on every tile, at every scale. ISS = spike --extension=gemmini, the official Gemmini ISA functional model.
We have successfully bridged the Gemmini accelerator directly to the CVA6 core via a custom shim bridging the standard RISC-V coprocessor interface (CV-X-IF) to the accelerator (RoCC). DistProc's EADDP architecture enables this Dual-Altitude Verification to run simultaneously: the scalar CVA6 core is verified per-instruction-retire against the Spike ISS, while the Gemmini accelerator is verified per-tile against the Spike-gemmini model. Both verifications run simultaneously while the node executes its shard of the 64-node distributed butterfly all-reduce.
The all-reduce wire time is modeled as log₂N × (latency + message/bandwidth). Small tensors are latency-bound and every fabric looks similar; large tensors are bandwidth-bound and the fabrics diverge by 60×. The choice of network goes from "doesn't matter" to "decides everything".
Small shard (128 B/step), same 64 nodes: all four fabrics fall within 3–6 µs. The message size flips the regime — and the sim wall stays ~one RTL node either way.
A specialist engineering consultancy based in Saltsjöbaden, Sweden. DistProc and the CVA6 cluster-verification stack are available as a software IP licence, a consulting engagement, or a full development project.
For most teams, the alternative to this is not a faster tool — it is a $2M+ hardware emulator or commercial EDA licenses (VCS, Xcelium, Questa). DistProc combined with Verilator keeps one node cycle-accurate and ISS-verified inside a full-scale array, on commodity hardware, with a relocatable golden reference — the system-context verification you can actually run, at a fraction of the cost.
DistProc is an independent middleware platform. Our validation benchmarks are built upon and stress-test the following world-class open-source projects: