After updating to the latest SDK 2025.2.0, we can successfully run the OpenAI-Compatible Server using the following command: furiosa-llm serve furiosa-ai/Llama-3.1-8B-Instruct-FP8
However, when attempting to run the furiosa-mlperf command: RUST_BACKTRACE=full furiosa-mlperf llama-3.1-server furiosa-ai/Llama-3.1-8B-Instruct-FP8 ./fp8 --test-mode performance-only, we enconuntered an error:
I’m sorry for the unfriendly error message. Could you please refer to the following page for more information? Please note that the MLPerf benchmark is designed to be used with MLPerf benchmark models.
The cause of the error is as following. The Llama 3.1 8B model you tried to used is built with chunked prefill feature to support a 32k context window. However, MLPerf benchmark tool does not recognize models with this feature, which leads to the error you encountered. Strictly speaking, this is a bug, but since the MLPerf benchmark tool are intended for use with MLPerf models only, it has not been fixed in 2025.2.