Furiosa SDK 2026.3.0 release

Hello! We are excited to announce the Furiosa SDK 2026.3.0 release. SDK 2026.3.0 has officially been published today. Here are the release note and documents:

This release is centered on model coverage. At its core is the new TCL kernel framework and the FXB packaging format, which together make enabling and distributing a new model architecture dramatically faster — and on that foundation 2026.3 brings up first-class multimodal serving and a broad set of new MoE models. Highlights of this release:

  • TCL Kernel Framework & furiosa-kernels
    • The defining change in 2026.3 is the TCL (Tensor Contraction Language) framework — a declarative Python eDSL where a kernel author says what to compute and leaves how (tiling, scheduling, fusion, hardware mapping) to the compiler. It treats the tensor contraction operations DNN models rely on as first-class primitives, matching RNGD’s underlying TCP (Tensor Contraction Processor) architecture.
    • The furiosa-kernels package is the concrete result — a collection of reusable TCL kernels (attention, MoE, vision encoder, …). Enablement now scales with the number of reusable blocks, not the number of models, which is what makes the breadth of new models below possible.
  • New Model Families (built on TCL)
    • Qwen3-VL (e.g. Qwen3-VL-32B) — the first vision-language family on RNGD.
    • gpt-oss (e.g. gpt-oss-120b) — MoE family with MXFP4-quantized expert weights.
    • Solar-Open (e.g. Solar-Open-100B) — MoE family, NVFP4 weights with 16-bit activations and KV cache (NVFP4A16).
    • Qwen3 MoE (e.g. Qwen3-30B-A3B) — MoE family with dynamic FP8 activation quantization; Instruct, Thinking, and Coder variants.
    • K-EXAONE (e.g. K-EXAONE-236B-A23B) — multilingual MoE family using a hybrid sliding-window + global attention scheme; NVFP4A16.
    • Each ships an FXB so it can be served directly from its Hugging Face repository.
  • FXB: Furiosa Executable Bundle
    • 2026.3 introduces the Furiosa Executable Bundle (FXB), Furiosa-LLM’s shareable compiled-artifact format. An .fxb is a single archive you can serve without recompiling, copy to another machine, or publish to the Hugging Face Hub.
    • The defining property of an FXB is its architecture fingerprint: a single bundle is reusable across any Hugging Face model that shares the same fingerprint — including fine-tuned or weight-updated variants — so a model whose own repo ships no .fxb can be served from a compatible cached bundle. (Fingerprint matching is experimental; verify with fxb check.)
    • A dedicated fxb command manages the full lifecycle — building, downloading, caching, compatibility checking, and inspection.
  • Multimodal Serving and Qwen3-VL (experimental)
    • 2026.3 introduces vision-language (multimodal) serving on RNGD, with Qwen3-VL-32B as the first supported model. Image-and-text requests are served through the standard OpenAI-compatible Chat Completions API using image_url content parts.
    • To avoid re-preprocessing the same image, multimodal inputs can be tagged with a stable UUID and reused from a server-side processor cache (--mm-processor-cache-gb). Scheduling/batching of multimodal requests is still being optimized.
  • Overlap Scheduling: Toward Zero-Overhead Batching (experimental)
    • A new overlap scheduler runs one batch ahead: while the NPU executes the current batch, the scheduler concurrently prepares the next one’s metadata, so host-side scheduling overlaps NPU compute instead of stalling it. For throughput-oriented workloads this improves overall throughput and TPOT, at the cost of a small, bounded increase in TTFT. Off by default — enable with --enable-overlap-scheduling.
  • Scoring-Based Data Parallel Routing
    • The prefix-aware DP router from 2026.2 evolves into a scoring-based policy that balances two signals when picking a replica — prefix locality and token-footprint load — with the relative weight chosen through a scoring profile (balanced (default), locality, or load).
  • Broader Platform Support
    • Python 3.14: supported versions are now 3.10–3.14.
    • Broader arm64 (aarch64) support across the native Python wheels and cloud-native images.
    • Rocky Linux 10 / RHEL: driver and firmware now ship as .el10 RPMs, and the cloud-native images are Red Hat OpenShift–certified.

A couple of things to note when upgrading from 2026.2:

  • Firmware update is now an explicit step — installing the firmware image package no longer flashes the device automatically; run the firmware updater (furiosa_rngd_updater_all) afterward. See the Upgrade Guide.
  • ArtifactBuilder / furiosa-llm build is legacy and scheduled for deprecation in favor of FXB and the fxb command.

You can find the supported models at: