Furiosa SDK 2025.3.0 release

Hello! We are excited to announce the Furiosa SDK 2025.3.0 release. SDK 2025.3.0 has officially been published today. This release is the firth major release for the RNGD, and provides the streamlined stack to enable LLM on RNGD, including driver, SoC FW, PERT, HAL, Model Compressor, Furiosa Compiler, and all Furiosa SDK components including Furiosa-LLM. We also appreciate the efforts of the qualification team who conducted intensive testing on RNGD. Many thanks to all the teams and individuals who made this release possible.Here are the release note and documents:

Highlights of this release:

  • Inter-chip Tensor Parallelism
    • Tensor parallelism across multiple NPU cards is now officially supported, enabling efficient scaling of large models and leading to significantly improved throughput.
    • To maximize performance, this feature is backed by key optimizations including: optimized PCIe paths for peer-to-peer (P2P) communication, advanced communication scheduling, and compiler tactics that overlap inter-chip DMA with computation.
  • Compiler and Runtime Optimizations
    • Enhanced Global Compiler Optimization: Furiosa compiler’s global optimization capabilities were enhanced to maximize SRAM reuse between transformer blocks. This reduces memory access latency and boosts overall throughput.
    • Runtime Optimization: Further optimized the runtime by reducing interference between the host and NPU, improving synchronization across devices, and minimizing overhead between consecutive decoding steps.
  • The above optimizations yield the following performance improvements compared to the previous release 2025.2:
    • Llama 3.1 8B: Up to 4.5% average throughput improvement and up to a 55% average reduction in Time-to-First-Token (TTFT).
    • Llama 3.1 70B: Up to 3x average throughput improvement and up to a 35% average reduction in Time-to-First-Token (TTFT).
    • The experiment configuration is as follows:
      • input lengths: 1k ~ 12k
      • output lengths: 128 ~ 10k
      • batch size: 1 ~ 128
  • Expanded Model Support

Breaking Changes

  • The SDK 2025.2.0 cannot load artifacts built with 2025.3.x. Please use the artifact built with 2025.3.x, or rebuild the model again with the new SDK.

  • furiosa-mlperf is deprecated and is removed from this release. Please use other benchmark tools, such as vLLM benchmark or LLMPerf.

1 Like