Furiosa SDK 2026.1.0 release

hyunsik · February 27, 2026, 3:34am

Hello! We are excited to announce the Furiosa SDK 2026.1.0 release. SDK 2026.1.0 has officially been published.

The blog article about the release 2026.1 is here:

Here are the release note and documents:

Highlights of this release:

LLM Serving

Hybrid Batching: Boosts throughput while maintaining low tail latency, achieving significantly higher requests per second compared to previous versions.
Prefix Caching: Automatically detects and reuses common prompt prefixes across requests using a branch-compressed radix tree, significantly reducing TTFT.
Enhanced Serving Scheduler: The serving scheduler has been completely redesigned to maximize throughput while keeping latency low, leveraging hybrid batching and prefix caching.
Smart Memory Management: Advanced buffer pools with intelligent KV cache allocation for sliding window attention and KV cache memory pressure handling.
Enhanced Observability: Production-grade serving metrics including KV cache utilization, per-device metrics, and request pool statistics — fully integrated with llm-d.

Model & Quantization

Pooling Model Support: Comprehensive support for pooling models to enable critical NLP tasks (e.g., embeddings, scoring, reranking), especially essential for RAG applications.
New Model Families: Qwen3, Exaone4
New Quantization: Fine-grained FP8 dynamic quantization (i.e., DeepSeek-style 2D block weight quantization and dynamic activation quantization)

Distributed & Scalable Inference and Deployment

llm-d: Seamless deployment of large language models across multiple nodes with intelligent request routing based on KV cache usage, prefix cache state, and LLM running/queue depths.
Dynamic Resource Allocation (DRA): Next-gen Kubernetes plugin for hardware accelerators, enabling intelligent NPU resource management with PCIe topology-aware strategies.
NPU Operator Support: Essential cloud-native component for large-scale cluster management — handling device discovery, driver/firmware rolling upgrades, and lifecycle management.

Performance & Compatibility

AoT Wiring: Enhances hybrid batching performance by up to 10–20% through pre-compilation across kernels.
Improved APIs: Tool calling fully compatible with vLLM, additional sampling parameters, prompt logprobs, API key authentication, and content format support.
Transformers 4.57.x and PyTorch 2.7.x support

Topic	Replies	Views
Furiosa SDK 2025.1.0 release Announcements release	118	February 24, 2025
Furiosa SDK 2025.2.0 release Announcements release , sdk , rngd	142	May 19, 2025
Furiosa SDK 2024.2.0 Release Announcements release	76	January 13, 2025
Furiosa SDK 2025.3.0 release Announcements release , rngd , sdk	92	August 4, 2025
Furiosa SDK 2024.2.0 릴리즈 공지사항 release	140	January 13, 2025