149.3 tokens/sec.
Ring-0. CPU-only.

When AI runs at the kernel level — no syscall tax, no OS overhead — the hardware gets to keep its full potential. This is what that looks like on bare metal.

149.3
tok/s
Silicate Zero Server — AMD EPYC 9354P
CPU-only · Ring-0
4–15×
faster
10–35
tok/s
Typical above-kernel
CPU inference

The same model. Same hardware class.
Very different results.

All results use Qwen3-1.7B Q4_K_M unless noted. CPU-only entries run on server-class x86_64 silicon — no GPU acceleration.

System / Runtime
Model
Mode
Tokens/sec
Silicate Zero Server — AMD EPYC 9354P Ring-0 kernel runtime
Qwen3-1.7B Q4_K_M
CPU · Ring-0
149.3
Intel i7-13700K, llama.cpp Q4 Above-kernel (Linux + glibc)
Qwen3-1.7B
CPU · userland
~35
Fast desktop CPU, Ollama / llama.cpp class Above-kernel (macOS / Linux)
Qwen3-1.7B Q4_K_M
CPU · userland
10–15
Consumer GPU, 6 GB VRAM Above-kernel (CUDA + OS driver stack)
Qwen3-1.7B
GPU · userland
~126
Qwen official SGLang, 1 GPU Above-kernel (CUDA stack)
Qwen3-1.7B BF16
GPU · serving
227.8

Assumptions: CPU rows use matching hardware class (server x86_64 EPYC / Xeon) where possible. GPU rows included for context only — GPU acceleration is a future Silicate Zero roadmap item (Stage 16+). BF16 vs Q4 quantisation means the SGLang figure is not apples-to-apples with Q4 CPU results. The 149.3 result is CPU-only, no GPU, no CUDA, no driver stack.

Why Ring-0 wins on raw throughput

Every above-kernel runtime — including llama.cpp, Ollama, vLLM, SGLang — pays a constant tax to the OS. Ring-0 eliminates the toll booth.

Zero syscall overhead

Above-kernel runtimes cross the kernel boundary thousands of times per inference call — memory allocation, I/O, threading. At Ring-0, the AI is the kernel. No boundary to cross.

Single address space

Model weights, KV cache, tokeniser buffers, and device memory live in one flat address space with no privilege-level switches. Cache lines stay hot. NUMA-aware placement is trivial.

Direct hardware telemetry

Ring-0 reads CPU performance counters, thermal sensors, and memory-controller stats natively — no abstraction layer. The scheduler can make real-time decisions based on actual silicon state, not OS-mediated approximations.

Read the full architecture →

How we measured it

Credibility is repeatable. Here's exactly what produced the 149.3 tok/s figure.

Hardware
AMD EPYC 9354P
64 logical CPUs · 128 GB ECC DDR5 · Native NIC · SMP enabled · No GPU · No accelerator
Model
Qwen3-1.7B Q4_K_M
4-bit quantised GGUF format. Same weights used in all CPU comparison entries.
Prompt length
32 tokens output
Short generation window to isolate token-generation throughput from prefill latency.
Measurement
Wall-clock, 10 runs
Median over 10 consecutive runs after a 2-run warm-up. Reported figure is the median, not peak.
Runtime
Silicate Zero kernel v0.1-alpha
Ring-0 inference path. No Linux kernel. No libc. Bare-metal boot → inference loop.
Repeatability
Self-hostable
The Silicate Benchmark tool (see below) lets anyone reproduce this on compatible hardware. Results within ±3% across runs in our lab.
We publish these numbers knowing they invite scrutiny. If you reproduce this benchmark and get a different result, tell us — we'll update this page with your data and methodology.

Benchmark your own hardware

Don't take our word for it. Run it yourself on any AMD EPYC or compatible server.

Free · Self-hosted

Silicate Benchmark

Open-source benchmark harness. Boot from USB, run the standard suite, get a signed results file. Compare against our reference numbers or publish your own.

  • Qwen3, Llama, Mistral model support
  • CPU + memory bandwidth profiling
  • Signed, verifiable output format
  • AGPLv3 — fully auditable
Get on GitHub — Free
Paid · Managed

Silicate Benchmark — Hardware Rental

Dedicated test servers for inference validation. We run the suite on identical Silicate Zero Server hardware, return a verified results report.

  • Verified reference environment on identical Silicate Zero Server hardware
  • Side-by-side comparison with 149.3 tok/s reference baseline
  • PDF report with signed hash chain
  • Dedicated test server access for enterprise customers
Contact for pricing →

The fastest CPU inference is
already open source.

Silicate Zero runs on any AMD EPYC server. No GPU required. No cloud dependency. You own the hardware, the model, and the runtime.