Furiosa SDK 2025.1.0 release

Hello! We are excited to announce the Furiosa SDK 2025.1.0 release. SDK 2025.1.0 has officially been published today. This release is the third major release for the RNGD, and provides the streamlined stack to enable LLM on RNGD, including driver, SoC FW, PERT, HAL, cloud-native components, Model Compressor, Furiosa Compiler, and Furiosa LLM.

Here are the release notes are documents:

Key features and improvements in the 2025.1.0 release

  • LLM Latency Optimization (Up to 11.66% TTFT, 11.45% TPOT improvement for <= 30k inputs, 1k outputs)
  • Support Tool-calling in Furiosa LLM (Tool Calling)
  • Support Device remapping (e.g., /dev/rngd/npu2pe0-3/dev/rngd/npu0pe0-3) for container
  • Add the new command line tool furiosa-llm build to build easily an artifact from Hugging Face model (Building a Model Artifact)
  • Fix continuous batch scheduling bugs which occur in certain ranges of sequence lengths and batch sizes
  • Automatic configuration of the maximum KV-cache memory allocation
  • Reduce the fragmentations of runtime memory allocation
  • Allow furiosa-mlperf command to specify pipeline_parallel_size and data_parallel_size
  • Add --allowed-origins argument to furiosa-llm serve (OpenAI-Compatible Server)
  • Fix trust_remote_code support bug in furiosa-llm
  • Support Min-p sampling in SamplingParams (SamplingParams class)
  • Allow npu:X in addition to npu:X:* in devices option
    • e.g., furiosa-llm serve ./model --devices "npu:0"
  • fuiorsa-mlperf command supports npu_q_limit, spare_ratio, allowing to optimize the performance
2 Likes