CVA6 · RVFI LOCKSTEP · TRL-7 · VERIFIED

Cycle-Accurate RISC-V & NPU Verification
at Cluster Scale

One CVA6 core, verified per-instruction-retire against a golden reference simulator, running its shard of a distributed-AI collective inside a 64-node cluster — the other 63 nodes are fast functional models. On a workstation, in seconds.

The Executive Summary
See the Numbers → Request a Demo
1
cycle-accurate CVA6 node
verified per-retire vs Spike
64
nodes in the array
(1 RTL + 63 fast models)
~12s
wall time — flat
from 4 to 64 nodes
0
per-retire Spike
mismatches
60×
fabric spread, bandwidth-
bound all-reduce

Architecture

One Real Core. A Cluster of Models. A Relocatable Reference.

The cluster is heterogeneous by construction: nodes are interchangeably cycle-accurate RTL or fast C++ functional models, cooperating over the same zero-copy shared memory.

01 / HETEROGENEITY

Verify a Tile in a Big Context

Node 0 is a real CVA6 SoC in Verilator; the other N-1 nodes are millisecond C++ models. Adding 60 model nodes costs essentially nothing — wall-time tracks one RTL core regardless of array size. N=64 runs in about the same time as N=4.

02 / REFERENCE DECOUPLING

Per-Retire ISS Lockstep

Node 0 publishes every RVFI retire over the DistProc barrier to a private Spike ISS in a separate process. Reference-guided injection lets Spike follow the core's reads of data other nodes wrote — verifying an interacting core, not an isolated one. Zero mismatches.

03 / REAL COLLECTIVE

Butterfly All-Reduce + Fabric Model

A recursive-doubling all-reduce of distributed-GEMM shards — log₂N steps, node-to-node, no central gather. A configurable interconnect model prices the collective on real fabrics: 10 GbE through NVLink. The same workload, any network.


System Overview

Process Structure

One RTL node bound to its ISS over the control-plane barrier; every node meets on the SWMR data plane and exchanges through an explicit butterfly. Heterogeneous, bit-identical.

CVA6 cluster — process structure Private Spike ISS spike_lockstep per-retire reference + shared-load inject gentpcsm barrier RVFI Node 0 — CVA6 RTL Verilator · HARTID 0 the cycle-accurate DUT, a butterfly participant Node 1 · cmodel Node 2 · cmodel · · · Node N-1 · cmodel fast functional · ~ms each Shared DRAM · /cva6_shared_dram (256 MB, MAP_SHARED) SWMR data plane — rank-1 partials · butterfly exchange (EX/ST) · RESULT per-retire butterfly all-reduce · node i ⇄ i⊕2ˢ · log₂N steps ✓ LOCKSTEP PASS Node 0, per-retire ✓ C == A@B golden · N = 4 / 16 / 64 wall ≈ one RTL node — flat across N · ~12 s on a $1,500 workstation

Measured — CVA6 RV64GC + Spike, i9-14900K

Distributed GEMM, All-Reduced & Verified

C[D×D] = A[D×N] @ B[N×D] sharded across N nodes; each node computes a rank-1 outer product, a butterfly all-reduce sums them. Node 0's shard is cycle-accurate and Spike-verified; the reduced result is bit-exact against an independent golden at every scale.

NodesWall (s)SpikeC==A@B
411.7PASSexact
1611.7PASSexact
6411.7PASSexact
NodesWall (s)SpikeC==A@B
418.0PASSexact
1619.2PASSexact
6422.5PASSexact

Wall-time tracks one RTL node regardless of array size — adding 60 model nodes (4 → 64) is essentially free because they are millisecond C++ models. The slight growth in the bandwidth regime is just Node 0 copying the larger exchange vector over more butterfly steps. Node 0 stays cycle-accurate and Spike-verified the entire time.


Accelerator Integration — Gemmini, verified vs the ISA simulator

The Cycle-Accurate Node Can Be a Real Accelerator

Node 0 doesn't have to be a CPU. We dropped a real Gemmini systolic array — 128×128 PEs, 16,384 INT8 MACs — onto the cluster's cycle-accurate node. It computes its GEMM shard on the actual array and joins the same butterfly all-reduce, every byte of the collective riding the accelerator's own DMA — and it's verified the same way the CPU node is: against an independent reference simulator.

01 / REAL HARDWARE

A Systolic Array, Not a Loop

Node 0 is a Rocket+Gemmini SoC in Verilator. Its 128×128×128 tile is computed on the real mesh; the entire butterfly all-reduce moves through Gemmini DMA — mvout to publish a shard, mvin-accumulate to reduce a partner's. The host core only ever touches one-word flags.

02 / INDEPENDENT REFERENCE

Verified vs the Gemmini ISA Simulator

Every output tile is checked bit-for-bit against the official Gemmini ISA functional simulator (Spike + libgemmini) running the identical shard. It is the accelerator analog of the CPU node's per-instruction-retire Spike lockstep — the real array, held to the reference model.

03 / HETEROGENEOUS

One Real Array, N−1 Models

The other nodes are fast C++ Gemmini models. A faithful-twin check confirms the RTL array and the model emit bit-identical tiles for the same shard — so substituting models for cycle-accurate hardware loses no numeric fidelity. 1 RTL Gemmini + 63 models, exact.

NodesRTL = model
(faithful twin)
all-reduce
= A@B
RTL tile
= Gemmini ISS
all-reduce
= Σ ISS tiles
4PASSPASSPASSPASS
16PASSPASSPASSPASS
64PASSPASSPASSPASS

0 / 16,384 element mismatches on every tile, at every scale.  ISS = spike --extension=gemmini, the official Gemmini ISA functional model.

We have successfully bridged the Gemmini accelerator directly to the CVA6 core via a custom shim bridging the standard RISC-V coprocessor interface (CV-X-IF) to the accelerator (RoCC). DistProc's EADDP architecture enables this Dual-Altitude Verification to run simultaneously: the scalar CVA6 core is verified per-instruction-retire against the Spike ISS, while the Gemmini accelerator is verified per-tile against the Spike-gemmini model. Both verifications run simultaneously while the node executes its shard of the 64-node distributed butterfly all-reduce.


Interconnect Model

Same Collective, Any Network

The all-reduce wire time is modeled as log₂N × (latency + message/bandwidth). Small tensors are latency-bound and every fabric looks similar; large tensors are bandwidth-bound and the fabrics diverge by 60×. The choice of network goes from "doesn't matter" to "decides everything".

64-node all-reduce time by fabric
large shard, 32 KB/step, 6 butterfly steps — bandwidth-bound
FabricLatencyBandwidthAll-reduceRegime
10 GbE10 µs1.25 GB/s217 µsBW-bound
100 GbE5 µs12.5 GB/s45.7 µslat-bound
InfiniBand HDR1 µs25 GB/s13.9 µsBW-bound
NVLink0.5 µs300 GB/s3.7 µslat-bound

Small shard (128 B/step), same 64 nodes: all four fabrics fall within 3–6 µs. The message size flips the regime — and the sim wall stays ~one RTL node either way.


Verified Output

Reproducible from git

// GEMM_DIM=64 ./run_cluster_ring.sh 4 16 64 — mongoose i9-14900K
Shevatech cluster demo — distributed-GEMM butterfly ALL-REDUCE
shard C[64×64], message S=32768 B/step (bandwidth regime)
Node 0 = CVA6 RTL, Spike-verified · Nodes 1..N-1 = C++ models
 
N=4  (1 RTL + 3 mdl)  | 18.0s | Spike PASS | C[0,0]=50    == 50
N=16 (1 RTL + 15 mdl) | 19.2s | Spike PASS | C[0,0]=1768  == 1768
N=64 (1 RTL + 63 mdl) | 22.5s | Spike PASS | C[0,0]=93600 == 93600
 
[SUCCESS] one cycle-accurate CVA6 node, per-retire ISS-verified,
reducing a 64-node distributed GEMM, bit-exact, in ~22s.

Shevatech AB

A specialist engineering consultancy based in Saltsjöbaden, Sweden. DistProc and the CVA6 cluster-verification stack are available as a software IP licence, a consulting engagement, or a full development project.

📧 info@shevatech.com
🌐 www.shevatech.com
🔗 Yehoshua Shoshan on LinkedIn
📍 Saltsjöbaden, Sweden

Accessibility, Not Just Speed

For most teams, the alternative to this is not a faster tool — it is a $2M+ hardware emulator or commercial EDA licenses (VCS, Xcelium, Questa). DistProc combined with Verilator keeps one node cycle-accurate and ISS-verified inside a full-scale array, on commodity hardware, with a relocatable golden reference — the system-context verification you can actually run, at a fraction of the cost.

Email copied to clipboard: info@shevatech.com

Built on Open Standards & Open Source IP

DistProc is an independent middleware platform. Our validation benchmarks are built upon and stress-test the following world-class open-source projects: