Hello! We are excited to announce the Furiosa SDK 2025.1.0 release. SDK 2025.1.0 has officially been published today. This release is the third major release for the RNGD, and provides the streamlined stack to enable LLM on RNGD, including driver, SoC FW, PERT, HAL, cloud-native components, Model Compressor, Furiosa Compiler, and Furiosa LLM.
Here are the release notes are documents:
Key features and improvements in the 2025.1.0 release
- LLM Latency Optimization (Up to 11.66% TTFT, 11.45% TPOT improvement for
<=
30k inputs, 1k outputs) - Support Tool-calling in Furiosa LLM (Tool Calling)
- Support Device remapping (e.g.,
/dev/rngd/npu2pe0-3
→/dev/rngd/npu0pe0-3
) for container - Add the new command line tool
furiosa-llm build
to build easily an artifact from Hugging Face model (Building a Model Artifact) - Fix continuous batch scheduling bugs which occur in certain ranges of sequence lengths and batch sizes
- Automatic configuration of the maximum KV-cache memory allocation
- Reduce the fragmentations of runtime memory allocation
- Allow
furiosa-mlperf
command to specifypipeline_parallel_size
anddata_parallel_size
- Add
--allowed-origins
argument tofuriosa-llm serve
(OpenAI-Compatible Server) - Fix
trust_remote_code
support bug in furiosa-llm - Support Min-p sampling in
SamplingParams
(SamplingParams class) - Allow
npu:X
in addition tonpu:X:*
indevices
option- e.g.,
furiosa-llm serve ./model --devices "npu:0"
- e.g.,
fuiorsa-mlperf
command supportsnpu_q_limit
,spare_ratio
, allowing to optimize the performance