Benchmark

149.3 tokens/sec.
Ring-0. CPU-only.

When AI runs at the kernel level — no syscall tax, no OS overhead — the hardware gets to keep its full potential. This is what that looks like on bare metal.

149.3

tok/s

Silicate Zero Server — AMD EPYC 9354P
CPU-only · Ring-0

4–15×
faster

10–35

tok/s

Typical above-kernel
CPU inference

Comparison

The same model. Same hardware class.
Very different results.

All results use Qwen3-1.7B Q4_K_M unless noted. CPU-only entries run on server-class x86_64 silicon — no GPU acceleration.

System / Runtime

Model

Mode

Tokens/sec

Silicate Zero Server — AMD EPYC 9354P Ring-0 kernel runtime

Qwen3-1.7B Q4_K_M

CPU · Ring-0

149.3

Intel i7-13700K, llama.cpp Q4 Above-kernel (Linux + glibc)

Qwen3-1.7B

CPU · userland

~35

Fast desktop CPU, Ollama / llama.cpp class Above-kernel (macOS / Linux)

Qwen3-1.7B Q4_K_M

CPU · userland

10–15

Consumer GPU, 6 GB VRAM Above-kernel (CUDA + OS driver stack)

Qwen3-1.7B

GPU · userland

~126

Qwen official SGLang, 1 GPU Above-kernel (CUDA stack)

Qwen3-1.7B BF16

GPU · serving

227.8

Assumptions: CPU rows use matching hardware class (server x86_64 EPYC / Xeon) where possible. GPU rows included for context only — GPU acceleration is a future Silicate Zero roadmap item (Stage 16+). BF16 vs Q4 quantisation means the SGLang figure is not apples-to-apples with Q4 CPU results. The 149.3 result is CPU-only, no GPU, no CUDA, no driver stack.

Architecture advantage

Why Ring-0 wins on raw throughput

Every above-kernel runtime — including llama.cpp, Ollama, vLLM, SGLang — pays a constant tax to the OS. Ring-0 eliminates the toll booth.

Zero syscall overhead

Above-kernel runtimes cross the kernel boundary thousands of times per inference call — memory allocation, I/O, threading. At Ring-0, the AI is the kernel. No boundary to cross.

Single address space

Model weights, KV cache, tokeniser buffers, and device memory live in one flat address space with no privilege-level switches. Cache lines stay hot. NUMA-aware placement is trivial.

Direct hardware telemetry

Ring-0 reads CPU performance counters, thermal sensors, and memory-controller stats natively — no abstraction layer. The scheduler can make real-time decisions based on actual silicon state, not OS-mediated approximations.

Read the full architecture →

Methodology

How we measured it

Credibility is repeatable. Here's exactly what produced the 149.3 tok/s figure.

Hardware

AMD EPYC 9354P

64 logical CPUs · 128 GB ECC DDR5 · Native NIC · SMP enabled · No GPU · No accelerator

Model

Qwen3-1.7B Q4_K_M

4-bit quantised GGUF format. Same weights used in all CPU comparison entries.

Prompt length

32 tokens output

Short generation window to isolate token-generation throughput from prefill latency.

Measurement

Wall-clock, 10 runs

Median over 10 consecutive runs after a 2-run warm-up. Reported figure is the median, not peak.

Runtime

Silicate Zero kernel v0.1-alpha

Ring-0 inference path. No Linux kernel. No libc. Bare-metal boot → inference loop.

Repeatability

Self-hostable

The Silicate Benchmark tool (see below) lets anyone reproduce this on compatible hardware. Results within ±3% across runs in our lab.

We publish these numbers knowing they invite scrutiny. If you reproduce this benchmark and get a different result, tell us — we'll update this page with your data and methodology.

Silicate Benchmark

Benchmark your own hardware

Don't take our word for it. Run it yourself on any AMD EPYC or compatible server.

Free · Self-hosted

Silicate Benchmark

Open-source benchmark harness. Boot from USB, run the standard suite, get a signed results file. Compare against our reference numbers or publish your own.

Qwen3, Llama, Mistral model support
CPU + memory bandwidth profiling
Signed, verifiable output format
AGPLv3 — fully auditable

Get on GitHub — Free

Paid · Managed

Silicate Benchmark — Hardware Rental

Dedicated test servers for inference validation. We run the suite on identical Silicate Zero Server hardware, return a verified results report.

Verified reference environment on identical Silicate Zero Server hardware
Side-by-side comparison with 149.3 tok/s reference baseline
PDF report with signed hash chain
Dedicated test server access for enterprise customers

Contact for pricing →

The fastest CPU inference is
already open source.

Silicate Zero runs on any AMD EPYC server. No GPU required. No cloud dependency. You own the hardware, the model, and the runtime.

Get Silicate Zero — Free Star on GitHub

149.3 tokens/sec.Ring-0. CPU-only.

The same model. Same hardware class.Very different results.

Why Ring-0 wins on raw throughput

Zero syscall overhead

Single address space

Direct hardware telemetry

How we measured it

Benchmark your own hardware

Silicate Benchmark

Silicate Benchmark — Hardware Rental

The fastest CPU inference isalready open source.

149.3 tokens/sec.
Ring-0. CPU-only.

The same model. Same hardware class.
Very different results.

The fastest CPU inference is
already open source.