Hello! We are excited to announce the Furiosa SDK 2025.1.0 release. SDK 2025.1.0 has officially been published today. This release is the third major release for the RNGD, and provides the streamlined stack to enable LLM on RNGD, including driver, SoC FW, PERT, HAL, cloud-native components, Model Compressor, Furiosa Compiler, and Furiosa LLM.
Here are the release notes are documents:
Key features and improvements in the 2025.1.0 release
- LLM Latency Optimization (Up to 11.66% TTFT, 11.45% TPOT improvement for
<=30k inputs, 1k outputs) - Support Tool-calling in Furiosa LLM (Tool Calling)
- Support Device remapping (e.g.,
/dev/rngd/npu2pe0-3→/dev/rngd/npu0pe0-3) for container - Add the new command line tool
furiosa-llm buildto build easily an artifact from Hugging Face model (Building a Model Artifact) - Fix continuous batch scheduling bugs which occur in certain ranges of sequence lengths and batch sizes
- Automatic configuration of the maximum KV-cache memory allocation
- Reduce the fragmentations of runtime memory allocation
- Allow
furiosa-mlperfcommand to specifypipeline_parallel_sizeanddata_parallel_size - Add
--allowed-originsargument tofuriosa-llm serve(OpenAI-Compatible Server) - Fix
trust_remote_codesupport bug in furiosa-llm - Support Min-p sampling in
SamplingParams(SamplingParams class) - Allow
npu:Xin addition tonpu:X:*indevicesoption- e.g.,
furiosa-llm serve ./model --devices "npu:0"
- e.g.,
fuiorsa-mlperfcommand supportsnpu_q_limit,spare_ratio, allowing to optimize the performance