Furiosa SDK 2026.1.0 release

Hello! We are excited to announce the Furiosa SDK 2026.1.0 release. SDK 2026.1.0 has officially been published.

The blog article about the release 2026.1 is here:

Here are the release note and documents:

Highlights of this release:

LLM Serving

  • Hybrid Batching: Boosts throughput while maintaining low tail latency, achieving significantly higher requests per second compared to previous versions.
  • Prefix Caching: Automatically detects and reuses common prompt prefixes across requests using a branch-compressed radix tree, significantly reducing TTFT.
  • Enhanced Serving Scheduler: The serving scheduler has been completely redesigned to maximize throughput while keeping latency low, leveraging hybrid batching and prefix caching.
  • Smart Memory Management: Advanced buffer pools with intelligent KV cache allocation for sliding window attention and KV cache memory pressure handling.
  • Enhanced Observability: Production-grade serving metrics including KV cache utilization, per-device metrics, and request pool statistics — fully integrated with llm-d.

Model & Quantization

  • Pooling Model Support: Comprehensive support for pooling models to enable critical NLP tasks (e.g., embeddings, scoring, reranking), especially essential for RAG applications.
  • New Model Families: Qwen3, Exaone4
  • New Quantization: Fine-grained FP8 dynamic quantization (i.e., DeepSeek-style 2D block weight quantization and dynamic activation quantization)

Distributed & Scalable Inference and Deployment

  • llm-d: Seamless deployment of large language models across multiple nodes with intelligent request routing based on KV cache usage, prefix cache state, and LLM running/queue depths.
  • Dynamic Resource Allocation (DRA): Next-gen Kubernetes plugin for hardware accelerators, enabling intelligent NPU resource management with PCIe topology-aware strategies.
  • NPU Operator Support: Essential cloud-native component for large-scale cluster management — handling device discovery, driver/firmware rolling upgrades, and lifecycle management.

Performance & Compatibility

  • AoT Wiring: Enhances hybrid batching performance by up to 10–20% through pre-compilation across kernels.
  • Improved APIs: Tool calling fully compatible with vLLM, additional sampling parameters, prompt logprobs, API key authentication, and content format support.
  • Transformers 4.57.x and PyTorch 2.7.x support
2 Likes