Furiosa SDK 2025.3.0 release

hyunsik · August 4, 2025, 3:24am

Hello! We are excited to announce the Furiosa SDK 2025.3.0 release. SDK 2025.3.0 has officially been published today. This release is the firth major release for the RNGD, and provides the streamlined stack to enable LLM on RNGD, including driver, SoC FW, PERT, HAL, Model Compressor, Furiosa Compiler, and all Furiosa SDK components including Furiosa-LLM. We also appreciate the efforts of the qualification team who conducted intensive testing on RNGD. Many thanks to all the teams and individuals who made this release possible.Here are the release note and documents:

Release Note (2025.3.0)
Furiosa Docs (2025.3.0)
(FYI, we no longer need the supplementary guide about the model downloads because all release models are hosted in Hugging Face Hub).

Highlights of this release:

Inter-chip Tensor Parallelism
- Tensor parallelism across multiple NPU cards is now officially supported, enabling efficient scaling of large models and leading to significantly improved throughput.
- To maximize performance, this feature is backed by key optimizations including: optimized PCIe paths for peer-to-peer (P2P) communication, advanced communication scheduling, and compiler tactics that overlap inter-chip DMA with computation.
Compiler and Runtime Optimizations
- Enhanced Global Compiler Optimization: Furiosa compiler’s global optimization capabilities were enhanced to maximize SRAM reuse between transformer blocks. This reduces memory access latency and boosts overall throughput.
- Runtime Optimization: Further optimized the runtime by reducing interference between the host and NPU, improving synchronization across devices, and minimizing overhead between consecutive decoding steps.
The above optimizations yield the following performance improvements compared to the previous release 2025.2:
- Llama 3.1 8B: Up to 4.5% average throughput improvement and up to a 55% average reduction in Time-to-First-Token (TTFT).
- Llama 3.1 70B: Up to 3x average throughput improvement and up to a 35% average reduction in Time-to-First-Token (TTFT).
- The experiment configuration is as follows:
  - input lengths: 1k ~ 12k
  - output lengths: 128 ~ 10k
  - batch size: 1 ~ 128
Expanded Model Support
- New Model Families: Added support for Qwen 2 and 2.5 models (e.g., Qwen2.5-Coder-32B-Instruct).
- New Quantization Support: Added support for W8A16 quantization (e.g., Llama-3.3-70B-Instruct-INT8).
- New pre-compiled artifacts are available on the Hugging Face Hub:

Breaking Changes

The SDK 2025.2.0 cannot load artifacts built with 2025.3.x. Please use the artifact built with 2025.3.x, or rebuild the model again with the new SDK.
furiosa-mlperf is deprecated and is removed from this release. Please use other benchmark tools, such as vLLM benchmark or LLMPerf.

Topic		Replies	Views
Furiosa SDK 2025.1.0 release Announcements release	0	109	February 24, 2025
Furiosa SDK 2025.2.0 release Announcements release , sdk , rngd	5	108	May 20, 2025
Furiosa SDK 2024.2.0 Release Announcements release	0	42	January 13, 2025
Furiosa SDK 2025.3.1 release Announcements release , sdk , rngd	0	80	August 26, 2025
Furiosa SDK 2024.2.0 릴리즈 공지사항 release	0	121	January 13, 2025

Furiosa SDK 2025.3.0 release

Highlights of this release:

Breaking Changes

Related topics