TECHNICAL BRIEFING

Disaggregated All-Flash Storage
meets NVIDIA’s inference paradigm

ZK-Storage WS5000 × NVIDIA Dynamo / KVBM / NIXL / GPUDirect Storage — same paradigm, different layers: an objective, verifiable, non-disparaging comparison and a complementary positioning.

300 GB/saggregate bandwidth (S9)

85×peak inference-load speedup (S38)

90.9%median reduction, 7 metrics

WS5000in mass production · sovereign

AGENDA

How this briefing flows

Shared bottleneck first, then each side’s tech, then an objective comparison and complementary positioning

	Module	Takeaway
01	Shared bottleneck	GPUs are starved by slow IO — NVIDIA’s judgment and ours
02	ZK-Storage stack	Disaggregation + four core technologies
03	Mapping to NVIDIA	Disaggregation / KV-cache offload / GPUDirect / data path
04	Comparison table	Row by row, with sources and conventions
05	Complementary & validation	Third-party benchmark + sovereign positioning

THE BOTTLENECK

Shared view: faster GPUs are starved by slow IO

NVIDIA (GPUDirect): “As AI, HPC, and data analytics datasets continue to increase in size, the time spent loading data begins to impact application performance. Fast GPUs are increasingly starved by slow IO.”

NVIDIA Developer · GPUDirect

This matches our view: in the LLM era the real bottleneck is on the data-supply side — model load, checkpoint I/O and KV-cache scheduling — not raw compute alone.

<60%

avg. compute-center utilization

large headroom (S11)

30–50%

effective GPU util. when IO-bound

research (S4)

2–3×

uplift via storage acceleration

~74%

peak KV-cache offload savings

online (S5)

ZK-STORAGE STACK

The ZK-Storage stack

Disaggregation at the core — turning storage from a bit player into a compute amplifier.

ARCHITECTURE

Disaggregation: compute pool ⟷ lossless fabric ⟷ all-flash pool

Compute pool

GPU / NPU nodes

Huawei Ascend Atlas 910B

Training / inference (transparent)

Lossless fabric

NVMe-oF over RDMA / RoCE

GPUDirect path

2×200GbE line-rate

All-flash pool

EBOF flash array

CPFS parallel file system

KV-cache tiered scheduling

Core idea

Decouple storage media from compute into an independent all-flash pool, linked to GPUs over a lossless fabric; compute and capacity scale independently, with no change to upper-layer frameworks.

FOUR PILLARS

Four core technologies

01NVMe-oF over RDMA / RoCE

Carry NVMe over RDMA, bypassing redundant copies to approach local-disk performance.

02GPUDirect path

Data moves directly between storage and GPU memory, shortening the path and cutting CPU and latency overhead.

03All-flash EBOF

Controller-less, high-density flash pool; bandwidth and IOPS scale near-linearly with capacity, at lower power.

04KV-cache tiered scheduling

Offload and reuse KV cache for long-context / high-switch inference, lifting effective GPU utilization.

VS. NVIDIA

Mapping to NVIDIA: same paradigm, different layers

NVIDIA’s software / IO frameworks define disaggregated inference + tiered KV-cache offload + a direct storage path; ZK-Storage brings the same engineering ideas to sovereign compute at the storage-base layer.

NVIDIA PARADIGM

NVIDIA’s inference paradigm (official)

NVIDIA Dynamo composes three core techniques: Disaggregated Serving, KV Cache-Aware Routing and KV Cache Offloading, underpinned by the low-latency transfer layer NIXL.

NVIDIA, verbatim (excerpt)

“KV cache offloading moves KV cache from HBM to cheaper storage tiers such as host memory, local disk, or remote storage. Reusing precomputed state improves TTFT, reduces TCO, and allows for longer context.”

NVIDIA Dynamo · Introduction ai-dynamo/dynamo (GitHub)

MAPPING ①

Disaggregation ↔ Disaggregated Serving

ZK-Storage

Hardware-disaggregated EBOF

Decouple storage and compute into an independent all-flash pool, linked to the GPU pool over NVMe-oF/RoCE; compute and capacity scale independently.

↔

NVIDIA

NVIDIA Dynamo · 分离式推理服务（Disaggregated Serving）

“Disaggregated serving runs prefill and decode on different devices so each can be scaled and parallelized independently. It required three capabilities: scheduling, memory management for KV cache offloading and onboarding, and low-latency data transfer to move KV cache between nodes and across the memory hierarchy.”

NVIDIA Dynamo · Introduction ai-dynamo/dynamo (GitHub)

MAPPING ②

KV-cache offload ↔ KVBM tiers

ZK-Storage

KV-cache tiered scheduling

For long-context / multi-model switching, offload and reuse KV cache between GPU memory and all-flash — extend context and concurrency without buying more GPUs.

↔

NVIDIA

NVIDIA Dynamo KVBM · KV Cache 分层卸载

“The KV Block Manager (KVBM) offers a unified memory API spanning GPU memory, pinned host memory, remote RDMA-accessible memory, local/distributed SSDs, and remote file/object/cloud storage. Offloading KV cache from HBM to cheaper tiers (G1 GPU → G2 CPU → G3 SSD → G4 remote) improves TTFT, reduces TCO and enables longer context.”

NVIDIA Dynamo · KVBM

MAPPING ③

GPUDirect path ↔ GPUDirect Storage

ZK-Storage

GPUDirect path + NVMe-oF

Move data on a direct DMA path between all-flash storage and GPU memory, bypassing the CPU to shorten the path and cut latency.

↔

NVIDIA

NVIDIA GPUDirect Storage（Magnum IO GDS）

“GPUDirect Storage enables a direct data path between local or remote storage, such as NVMe or NVMe over Fabric (NVMe-oF), and GPU memory. It avoids extra copies through a bounce buffer in the CPU’s memory, enabling a DMA engine near the NIC or storage to move data on a direct path into or out of GPU memory — all without burdening the CPU.”

NVIDIA Developer · GPUDirect GPUDirect Storage Overview Guide

MAPPING ④

Data path ↔ NIXL

ZK-Storage

NVMe-oF over RDMA/RoCE

Carry NVMe over lossless RDMA, providing a low-latency, high-bandwidth data path across GPU memory / host memory / all-flash.

↔

NVIDIA

NVIDIA NIXL · 推理数据传输库

“NIXL (NVIDIA Inference Xfer Library) provides a non-blocking API for high-performance, vendor-agnostic data movement, transferring KV caches across GPU memory, CPU memory and storage tiers (SSD / remote) for use cases such as disaggregated KV cache movement, long-context storage and model-weight transfer.”

NVIDIA Technical Blog · NIXL ai-dynamo/nixl (GitHub)

COMPARISON

Objective comparison (fair, non-disparaging)

Dimension	ZK-Storage WS5000	NVIDIA equivalent (official)
Layer	All-flash storage appliance (hardware base)	Inference / IO software framework (Dynamo·NIXL·GDS)
Disaggregation	Hardware EBOF + NVMe-oF/RoCE	Dynamo Disaggregated Serving (prefill/decode split)
KV-cache offload	KV-cache tiered scheduling (mem↔flash)	KVBM tiers G1→G4 (GPU→CPU→SSD→remote)
GPU direct path	GPUDirect path + NVMe-oF	GPUDirect Storage (GPU↔NVMe/NVMe-oF DMA)
Primary compute fit	Domestic GPU / Ascend 90%+ (S9)	Mainly the NVIDIA GPU ecosystem
Data sovereignty	Strong (self-controlled)	Assess per deployment / compliance
Third-party benchmark	Yes (Beijing Information Science and Technology University, Ascend 910B, S38)	Per official / partner materials
Relationship	Complementary: a sovereign storage base for the paradigm	Open to third-party storage (WEKA / Dell, etc.)

How to read this

ZK figures are labeled vendor spec (S9) / third-party benchmark (S38); NVIDIA capabilities are quoted from official docs (see Sources). This table is an objective dimension-by-dimension reference, not a disparagement of any third party; refer to each party’s latest official information.

COMPLEMENTARY

Complementary, not a replacement

NVIDIA’s KVBM / NIXL are open to third-party storage. Per NVIDIA’s own updates: “Dell integrates PowerScale with Dynamo’s NIXL for 19x faster TTFT” and “WEKA partners with NVIDIA on KV cache storage for Dynamo.”

Where ZK-Storage fits

This confirms that a disaggregated all-flash storage base is a key part of the disaggregated-inference / KV-cache-offload paradigm. ZK-Storage provides that base for sovereign compute (Ascend / domestic GPUs) — mass-producible, independently benchmarked, with data residency.

Interoperable, not adversarialSovereign compute baseData residencyMass-production · validated

ai-dynamo/dynamo (GitHub) updates

VALIDATION

Validation & positioning

Let a reproducible third-party benchmark speak, with an honest positioning.

INDEPENDENT TEST

Third-party benchmark: Beijing Information Science and Technology University · Ascend 910B

Model	ZK-Storage load	NFS load	Load speedup	Service speedup
DeepSeek-32B	6.62 s	563.85 s	85.2×	6.17×
DeepSeek-70B	35.38 s	1284.66 s	36.3×	9.33×

Key result (reproducible)

Against an NFS over TCP/10GbE baseline, ZK-Storage over NVMe-oF (RDMA/RoCE): peak inference load 85.17×; at 40 switches/day, effective token throughput rises +356.9%; median reduction across 7 metrics is 90.9%. From a single source, reproducible and verifiable (S38).

POSITIONING

An honest positioning

Same paradigm: with NVIDIA, we agree slow IO is the hidden bottleneck of LLM compute.
Different layers: NVIDIA provides software / IO frameworks; ZK-Storage provides a mass-producible all-flash storage base.
Complementary: a disaggregated all-flash base is part of the disaggregated-inference / KV-offload paradigm.
Sovereign: deeply tuned for Ascend / domestic GPUs, with data residency, third-party validation and mass production.

In one line

Make every GPU earn its keep — whichever compute lineage it comes from.

SOURCES

Sources & conventions (verifiable)

ZK-Storage perf / specVendor spec (S9): 300 GB/s aggregate bandwidth, 50M random IOPS, 20 μs latency, 90%+ domestic-GPU coverage, 48–72h deployment, ~-40% total cost.

ZK-Storage validationBeijing Information Science and Technology University on Huawei Ascend Atlas 910B, NFS baseline (S38): DeepSeek-32B load 563.85s→6.62s (85.17×); 90.9% median reduction across 7 metrics. From business_plan/outputs/results.json, reproducible.

KV-cache offload savingsIndustry research: up to ~73.7% online-workload cost reduction (S5).

NVIDIA GPUDirect Storage（Magnum IO GDS）NVIDIA Developer · GPUDirect · GPUDirect Storage Overview Guide

NVIDIA Dynamo · 分离式推理服务（Disaggregated Serving）NVIDIA Dynamo · Introduction · ai-dynamo/dynamo (GitHub)

NVIDIA Dynamo KVBM · KV Cache 分层卸载NVIDIA Dynamo · KVBM

NVIDIA NIXL · 推理数据传输库NVIDIA Technical Blog · NIXL · ai-dynamo/nixl (GitHub)

Last updated: 2026-06-28 · ZK figures from business_plan/outputs/results.json (S-codes on the site’s “Data Sources” page); NVIDIA descriptions and links are its official public materials.

THANK YOU

Make every GPU earn its keep

ZK-Storage WS5000 · disaggregated all-flash accelerated storage appliance · Shenzhen Zhongke Hangxing Technology Co., Ltd.

Q & Atechnical discussion welcome

PoCdemo units in stock

Interopbuilding on sovereign compute

Disaggregated All-Flash Storagemeets NVIDIA’s inference paradigm

How this briefing flows

Shared view: faster GPUs are starved by slow IO

The ZK-Storage stack

Disaggregation: compute pool ⟷ lossless fabric ⟷ all-flash pool

Four core technologies

Mapping to NVIDIA: same paradigm, different layers

NVIDIA’s inference paradigm (official)

Disaggregation ↔ Disaggregated Serving

KV-cache offload ↔ KVBM tiers

GPUDirect path ↔ GPUDirect Storage

Data path ↔ NIXL

Objective comparison (fair, non-disparaging)

Complementary, not a replacement

Validation & positioning

Third-party benchmark: Beijing Information Science and Technology University · Ascend 910B

An honest positioning

Sources & conventions (verifiable)

Make every GPU earn its keep

ZK-Storage vs NVIDIA · Technical Briefing Deck

Disaggregated All-Flash Storage
meets NVIDIA’s inference paradigm