Shevatech AB — Glossary & Cheat-Sheet

Group 1

The big picture

What Shevatech makes, and the headline labels you'll see in the page banners.

DistProcproduct

Shevatech's core product: middleware (see below) that lets many separate programs cooperate on one big computation by sharing memory directly and synchronizing only at the moments the work requires it, instead of on every clock tick. Why it matters: almost all the wasted time in distributed simulation is processes waiting on each other. DistProc cuts that waiting to the hardware minimum.

EADDPEvent-Accurate Distributed Processing

The principle DistProc is built on: processes exchange state at computational events (when there is genuinely something to share) rather than at fixed time steps. "Event-accurate" = correct at every event, without paying to synchronize in between. Analogy: a team that meets when a decision is actually needed, not every 5 minutes by the clock.

gentpcsm (v2)software IP

The actual ~300-line C11 library that implements the DistProc synchronization — the thing you license and embed. "v2" is the current generation (a leaner memory layout than v1). The name is internal; think of it as "the DistProc engine."

Middleware

Software that sits between your application and the operating system / hardware and provides a reusable service — here, the service is "let these processes cooperate efficiently." You build on top of it; you don't rewrite it.

IPCInter-Process Communication

How separate programs (processes) on a machine pass information to each other. DistProc's IPC is zero-copy (see below) and lock-free — the two properties that make it fast.

Co-simulation (co-sim)

Running two or more different simulators together so they behave as one system — e.g. a detailed model of a chip plus a model of the network around it. DistProc is the glue that keeps the pieces in step.

TRL-7Technology Readiness Level 7

A standard 1–9 maturity scale (originally NASA, now common in engineering). 7 = a working prototype demonstrated in a realistic operating environment — not a lab toy (low TRL), not yet a hardened commercial product (TRL 9). It signals "proven and real, ready to engage on."

Zero-copy

Data is shared by letting every process read the same region of memory, rather than copying it from one process to another. Moving a megabyte costs nothing extra because nothing is moved — only a flag is flipped to say "it's ready."

Lock-free / Deadlock

A lock is when one process makes others wait while it holds a shared resource; a deadlock is when two processes each wait on the other forever and everything freezes. Lock-free means the design never uses such locks, so it is structurally impossible to deadlock — a 14-hour test can't hang at hour 13.

Group 2

Verification — "is the chip correct?"

Chip design is mostly checking. These are the terms for how you prove a design computes the right thing.

RTLRegister-Transfer Level

The actual hardware design of a chip, written in a hardware description language (Verilog/Chisel) — it describes every register and logic gate. Running RTL in simulation is exact but slow. Analogy: the full architectural blueprint of a building, down to every screw.

Cycle-accurate

A model that reproduces the chip's behavior on every single clock cycle — the gold standard for fidelity, and the slowest to run. The opposite of a "functional" model that only gets the final answer right.

ISSInstruction Set Simulator

A fast software model of a processor that executes its instructions and gets the architectural result right (registers, memory) without modeling the cycle-by-cycle hardware. It is the golden reference a real design is checked against. On this site the ISS is Spike (for the CPU) and Spike+libgemmini (for the accelerator). Analogy: a chess engine that knows the rules and the right move, but doesn't simulate the physical clock on the table.

Golden reference / golden model

The trusted, known-correct answer you compare against. If the RTL's output matches the golden reference, the RTL is correct. Here the golden reference is the ISS.

Bit-exact

Two results are identical down to every single bit — not "close," not "within rounding," but exactly equal. The strongest possible match. "0 / 16,384 element mismatches" means every one of 16,384 numbers matched exactly.

Retire / "verified per-instruction-retire"

An instruction retires the moment the processor finishes it and commits its result. Verified per-instruction-retire means: every time the real RTL core completes an instruction, we immediately check that the program-counter, the instruction, and the result it wrote all match the golden ISS — for every instruction, not just the final answer. A single wrong step is caught the instant it happens. Why it matters: it localizes a bug to the exact instruction that caused it, instead of "the final number is wrong somewhere."

Lockstep

Running the design and the golden reference side by side, in step, comparing them continuously. "Per-retire lockstep" = they march one instruction at a time and are compared at each retire.

RVFIRISC-V Formal Interface

A standard set of signals a RISC-V core exposes that report, for each retired instruction, exactly what it did (PC, instruction bits, register write, memory access). It's the "tap" the lockstep checker listens to. Formal here just means "standardized for verification."

DUTDevice Under Test

The thing being verified — here, the one cycle-accurate RTL node.

Functional model / C++ model

A fast program that produces the correct numerical result of a component without modeling its hardware timing. Milliseconds instead of minutes. In the cluster, the non-RTL nodes are functional models.

Verilatoropen source

The free, open-source tool that turns RTL into a fast C++ simulator you can run on a normal CPU. It's how we run the cycle-accurate node without a $2M hardware emulator.

Dual-altitude verificationour term

Our phrase for checking a chip at two levels at once: the CPU is verified at the fine grain (every instruction retire vs the ISS), and the attached AI accelerator is verified at the coarse grain (every output tile vs its own reference model) — both simultaneously, while the node does real work in the cluster. "Altitude" = zoom level: one view from 10 metres (every instruction), one from 10 km (every result tile).

Group 3

RISC-V & CPUs

The processor pieces. RISC-V is the open instruction set everything here is built on.

RISC-Vopen standard

A free, open instruction set architecture — the published "vocabulary" of commands a processor understands (like x86 or ARM, but open and royalty-free). Anyone can build a chip that runs it.

ISAInstruction Set Architecture

The contract between software and hardware: the list of instructions, registers, and rules a processor implements. RISC-V is an ISA; CVA6 is one chip that implements it.

CVA6open core

A real, open-source RISC-V processor design (from the OpenHW Group) — a complete, capable CPU you can simulate or build into a chip. It's the cycle-accurate CPU node in our cluster.

Spikeopen ISS

The official RISC-V ISS (golden reference simulator). If Spike and the RTL core ever disagree on an instruction, the RTL has a bug. With --extension=gemmini it also models the Gemmini accelerator.

Hart / HARTID

RISC-V's word for a hardware thread — essentially "one CPU core." HARTID is its number. "HARTID 0" = node 0's core.

SoCSystem on Chip

A complete computer on a single chip — CPU, memory interfaces, and accelerators together. Each cluster node simulates a full SoC, not just a bare core.

RoCCRocket Custom Coprocessor interface

The original way the Gemmini accelerator attaches to a RISC-V core and receives commands. It's accelerator-specific plumbing.

CV-X-IFCORE-V eXtension Interface

A standard open interface for bolting custom instructions/accelerators onto a RISC-V core (CVA6 speaks it). We built a small "shim" that translates between CV-X-IF and Gemmini's RoCC, so a standards-based CPU can drive the accelerator. Why it matters: using the standard interface instead of glue makes the approach reusable, not a one-off hack.

Shim

A thin adapter layer that makes two interfaces that weren't designed for each other talk — here, CV-X-IF ⇄ RoCC.

Group 4

AI accelerators (Gemmini, NVDLA)

The specialized hardware that does the heavy AI math, and the words for how it moves data.

NPU / AI acceleratorNeural Processing Unit

A chip block built to do the one operation neural networks need most — multiplying big grids of numbers — far faster and more efficiently than a general CPU.

Gemminiopen accelerator

An open-source AI accelerator (from UC Berkeley's Chipyard project) — a systolic array of 128×128 = 16,384 multiply units. It's the real accelerator we put on a cluster node.

Systolic array

A grid of tiny identical compute cells where data flows rhythmically from cell to cell, each doing one multiply-add as it passes. Extremely efficient for matrix math because the data, not control logic, does the moving. Analogy: a bucket brigade for arithmetic.

PEProcessing Element

One cell of the systolic array — a single multiply-accumulate unit. Gemmini at "DIM=128" has 128×128 of them.

MACMultiply-ACcumulate

The fundamental neural-net operation: multiply two numbers and add to a running total (acc += a × b). "16,384 INT8 MACs" = 16,384 of these happening in parallel on 8-bit integers.

Tile

A fixed-size block of a big matrix (here 128×128) that the accelerator processes in one go. Big matrices are chopped into tiles because the hardware (and its small on-chip memory) handles one tile at a time. "Verified per-tile" = each 128×128 output block is checked against the reference. Analogy: painting a mural one square panel at a time.

GEMMGEneral Matrix Multiply

Multiplying two matrices — the workhorse operation of nearly all AI. "Distributed GEMM" = one big matrix multiply spread across many nodes.

Scratchpad & Accumulator

The accelerator's small, fast on-chip memories. The scratchpad holds the input tiles; the accumulator holds the running sums of results. Both are tiny (256 KB), which is exactly why big matrices must be tiled and distributed.

mvin / mvout

Gemmini's commands to move a tile in from memory to the scratchpad, and to move a result out to memory. The cluster's data exchange rides entirely on these (publish a result with mvout, pull in a partner's with mvin).

DMADirect Memory Access

Hardware that moves blocks of data to/from memory without the CPU doing it byte by byte. Gemmini's DMA does all the heavy data movement; the CPU only flips small "ready" flags.

L2 cache · coherent vs non-coherent

A cache is fast memory that keeps recent data close to the processor. Coherent means the hardware guarantees every unit sees the same up-to-date copy — safe, but each transfer pays a coordination tax. Non-coherent (TLBroadcast) skips that tax and writes straight to shared memory — much faster when you don't need the sharing guarantee, which is why it cut our accelerator co-sim ~2.4×.

NVDLANVIDIA Deep Learning Accelerator

NVIDIA's open-source production AI accelerator design. We wired DistProc directly into its real RTL to prove the middleware is robust enough to drive Tier-1 enterprise silicon (millions of gates) without hanging.

Group 5

Clusters & networks

How many nodes split a job and recombine the answer — and how the network choice shapes the cost.

Node

One participant in the cluster — here, one simulated SoC (either the cycle-accurate RTL one, or a fast model). "64-node cluster" = 64 such participants.

Shard

One node's slice of a big computation. Each node computes its shard of a matrix multiply, then the shards are combined.

All-reduce

A collective operation where every node contributes a value and they're all combined (e.g. summed), with every node ending up holding the combined result. The backbone of distributed AI training.

Butterfly all-reduce / "shared-memory butterfly"

An efficient pattern for doing an all-reduce: instead of everyone sending to one central node, nodes pair up and exchange-and-combine in log₂N rounds (round 1: neighbors; round 2: distance-2 partners; …). After ⌈log₂N⌉ rounds all 64 nodes hold the full sum. "Shared-memory" = those exchanges happen by reading/writing shared memory (DistProc), not over a network. The crossing pattern of who-talks-to-whom looks like butterfly wings. Why it matters: log₂N rounds instead of N — 6 rounds for 64 nodes, not 63 — and no central bottleneck.

Recursive doubling

The exact scheme behind the butterfly: in step s, node i exchanges with the node 2ˢ away. Step 0 pairs distance-1, step 1 distance-2, step 2 distance-4 — the reach doubles each round, covering all N nodes in log₂N steps.

Rank-1 outer product

A specific, cheap way to build a piece of a matrix product: multiply one column by one row. Each node computes a rank-1 partial of the result, and the all-reduce sums them into the full matrix. (Detail-level term; the takeaway is "each node makes a partial answer, they get summed.")

Fabric / interconnect

The network that connects nodes — e.g. 10 GbE (ordinary Ethernet), InfiniBand, or NVLink (NVIDIA's GPU link). Our model can price the same workload on any of them.

Latency-bound vs bandwidth-bound

Latency = the fixed delay to send any message (how long until the first byte arrives). Bandwidth = how fast bytes flow once started. Small messages are latency-bound (the delay dominates — every network looks similar); large messages are bandwidth-bound (the pipe width dominates — fast networks pull 60× ahead). Message size flips which one decides your speed.

Group 6

DistProc internals

The mechanics of the synchronization layer — what's on the Performance page.

Shared memory (POSIX / MAP_SHARED)

A region of RAM that several processes map into their own address space and all see at once. The standard Linux mechanism for it is POSIX shared memory with the MAP_SHARED flag. It's how DistProc achieves zero-copy.

SWMRSingle-Writer / Multiple-Reader

A discipline where each piece of shared data has exactly one writer but many readers. Because only one process ever writes a given slot, there's no write-collision to lock against — this is what makes the design lock-free and simple to reason about.

Barrier / synchronization event

A meeting point where participants wait until everyone has reached it, then all proceed. DistProc's barrier fires only at computational events (not every clock tick) — that's the whole efficiency idea. "43 sync events" means the 8-core job only had to meet 43 times.

Spin vs Futex (the two backends)

Two answers to "what should a core do while waiting at the barrier?" Spin = keep actively checking the flag (lowest latency, but the core stays busy/hot). Futex = go to sleep in the OS and get woken on arrival (slightly higher per-wait cost, but a waiting core drops from ~28% to ~2% CPU). Chosen by a setting — no recompile. Spin = standing at the door watching; futex = sitting down and being tapped when it's time.

Microsecond (µs) · sub-µs

A microsecond is one millionth of a second. "Sub-microsecond synchronization tax" means each meeting costs less than a millionth of a second — negligible next to the actual compute.

Speedup & linear scaling

Speedup = how many times faster N cores finish vs one core. Linear/predictable scaling = adding a core adds a small, constant amount of overhead — no sudden "cliff" where performance collapses. "14.8× across 15 cores" is near-perfect.

cpu_burn

A test tool that injects a controlled, adjustable amount of fake "work" per cycle, to stand in for a customer's real compute when sizing and dry-running a deployment before their code exists.

Group 7

Numbers & matrices

The data types and notation that show up in the results.

INT8 / int32

Integer number formats of a given width. INT8 = an 8-bit integer (−128…127) — AI models use these because they're small and fast. int32 = 32-bit, used to hold the sums of many INT8 products without overflowing. "int8 input, int32 accumulator" is the standard accelerator pattern.

Quantization

Converting a model's high-precision numbers (32-bit floats) into small integers (INT8) so it runs far faster on the accelerator, with carefully managed accuracy loss. The TVMBERT demo uses industry-standard quantization.

C = A @ B notation

The @ is matrix multiplication (Python's symbol for it). "C[D×D] = A[D×N] @ B[N×D]" just says matrix C (D rows × D cols) is the product of A and B with those shapes.

512×512, 128×128, etc.

Matrix dimensions (rows × columns). The point of several demos: a single 128×128 accelerator cannot hold a 512×512 matrix, so the job must be split across cooperating nodes — which is what DistProc enables.

Group 8

Machine-learning & tools

The AI workloads and open-source projects the demos build on.

BERT / transformer

BERT is a well-known language AI model; "transformer" is the neural-network architecture it (and GPT-style models) use. Running a real BERT layer is a meaningful, recognizable AI workload — not a toy.

Apache TVMopen source

An open-source ML compiler — it takes a trained model and turns it into optimized, quantized code for a given piece of hardware. We use it (industry-standard) to prepare the BERT weights for the Gemmini accelerator.

HuggingFace weights

HuggingFace is the main public repository of trained AI models. "Real HuggingFace weights" = we ran an actual published model, not random numbers — evidence the pipeline handles real workloads.

Chipyardopen source

UC Berkeley's framework for assembling RISC-V chips (it bundles Gemmini, the Rocket core, and more). Our accelerator SoCs are generated with it.