Hello! We are excited to announce the Furiosa SDK 2026.1.0 release. SDK 2026.1.0 has officially been published.
The blog article about the release 2026.1 is here:
Here are the release note and documents:
Highlights of this release:
LLM Serving
- Hybrid Batching: Boosts throughput while maintaining low tail latency, achieving significantly higher requests per second compared to previous versions.
- Prefix Caching: Automatically detects and reuses common prompt prefixes across requests using a branch-compressed radix tree, significantly reducing TTFT.
- Enhanced Serving Scheduler: The serving scheduler has been completely redesigned to maximize throughput while keeping latency low, leveraging hybrid batching and prefix caching.
- Smart Memory Management: Advanced buffer pools with intelligent KV cache allocation for sliding window attention and KV cache memory pressure handling.
- Enhanced Observability: Production-grade serving metrics including KV cache utilization, per-device metrics, and request pool statistics — fully integrated with llm-d.
Model & Quantization
- Pooling Model Support: Comprehensive support for pooling models to enable critical NLP tasks (e.g., embeddings, scoring, reranking), especially essential for RAG applications.
- New Model Families: Qwen3, Exaone4
- New Quantization: Fine-grained FP8 dynamic quantization (i.e., DeepSeek-style 2D block weight quantization and dynamic activation quantization)
Distributed & Scalable Inference and Deployment
- llm-d: Seamless deployment of large language models across multiple nodes with intelligent request routing based on KV cache usage, prefix cache state, and LLM running/queue depths.
- Dynamic Resource Allocation (DRA): Next-gen Kubernetes plugin for hardware accelerators, enabling intelligent NPU resource management with PCIe topology-aware strategies.
- NPU Operator Support: Essential cloud-native component for large-scale cluster management — handling device discovery, driver/firmware rolling upgrades, and lifecycle management.
Performance & Compatibility
- AoT Wiring: Enhances hybrid batching performance by up to 10–20% through pre-compilation across kernels.
- Improved APIs: Tool calling fully compatible with vLLM, additional sampling parameters, prompt logprobs, API key authentication, and content format support.
- Transformers 4.57.x and PyTorch 2.7.x support