안녕하세요.
RNGD 환경에서 LLM 추론을 돌리다 보면 중간에 진행이 멈추고, 에러가 뜨는 현상이 지속적으로 일어납니다.
모델 크기마다 데이터셋 토큰 길이마다 달랐던 것으로 보여졌습니다. 어떤경우 117번째 시도 라거나 특정 구간에서 멈추면 같은 옵션 및 모델로 다시 시도해도 117번째에서 에러가 뜨는 모습을 보였습니다.
현재 앨리스 클라우드 상에서 RNGD 1장 장착 되어있는 서버에서 LLM 추론 진행하고 있습니다.
현재 load 하고 있는 모델은 furiosa-ai/Qwen2.5-Coder-7B-Instruct 입니다.
Docs에 따라 환경 업데이트는 해두었습니다.
환경은 아래와 같습니다
pip list | grep furi
furiosa-llm 2025.3.4
furiosa-llm-models 2025.3.0
furiosa-model-compressor 2025.3.0
furiosa-model-compressor-impl 2025.3.0
furiosa-models-lang 2025.3.0
furiosa-native-compiler 2025.3.1
furiosa-native-llm-common 2025.3.1
furiosa-native-runtime 2025.3.2
furiosa-smi-py 2025.3.0
furiosa-torch-ext 2025.3.1
furiosa-smi info
+-------+------+--------+------------------+------------------+---------+---------+--------------+
| Index | Arch | Device | Firmware | PERT | Temp. | Power | PCI-BDF |
+-------+------+--------+------------------+------------------+---------+---------+--------------+
| 0 | rngd | npu4 | 2025.3.1+bbbbe52 | 2025.3.1+52e5705 | 33.34°C | 35.52 W | 0000:ae:00.0 |
+-------+------+--------+------------------+------------------+---------+---------+--------------+
apt list | grep furiosa
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
furiosa-bench/now 0.10.3-3 amd64 [installed,local]
furiosa-compiler/now 2025.3.0-3 amd64 [installed,local]
furiosa-driver-rngd/now 2025.3.1-3 all [installed,local]
furiosa-firmware-image-rngd/now 2025.3.1 all [installed,local]
furiosa-firmware-tools-rngd/now 2025.3.1-3 amd64 [installed,local]
furiosa-libcompiler/now 0.10.1-3 amd64 [installed,local]
furiosa-libhal-warboy/now 0.12.0-3 amd64 [installed,local]
furiosa-libnux/now 0.10.1-3 amd64 [installed,local]
furiosa-libsmi/now 2025.3.0-3 amd64 [installed,local]
furiosa-pert-rngd/now 2025.3.1-3 amd64 [installed,local]
furiosa-smi/now 2025.3.0-3 amd64 [installed,local]
furiosa-toolkit/now 0.11.0-3 amd64 [installed,local]
모델 평가 및 추론 테스트를 위해서 준비해둔 데이터셋 추론을 돌리다 보면 아래의 에러가 발생합니다.
모델 크기가 14B 이었을 경우에는 더 적은 시기때 같은 에러가 발생하기도 하였습니다.
지속적으로 발생하는 에러이나 원인 파악이 힘들어 문의드립니다.
Processing: 49%|███████████████████████████████████████████▊ | 7308/15000 [41:47<41:26, 3.09it/s]2025-11-12T10:42:19.640940505Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing: 49%|███████████████████████████████████████████▊ | 7309/15000 [41:47<41:17, 3.10it/s]2025-11-12T10:42:19.961309572Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing: 49%|███████████████████████████████████████████▊ | 7310/15000 [41:47<43:15, 2.96it/s]2025-11-12T10:42:20.333402769Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing: 49%|███████████████████████████████████████████▊ | 7311/15000 [41:48<41:29, 3.09it/s]2025-11-12T10:42:20.625068775Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing: 49%|███████████████████████████████████████████▊ | 7312/15000 [41:48<42:04, 3.04it/s]2025-11-12T10:42:20.964313925Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing: 49%|███████████████████████████████████████████▉ | 7313/15000 [41:48<41:24, 3.09it/s]2025-11-12T10:42:21.275457007Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing: 49%|███████████████████████████████████████████▉ | 7314/15000 [41:49<40:47, 3.14it/s]2025-11-12T10:42:21.58286433Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing: 49%|███████████████████████████████████████████▉ | 7315/15000 [41:49<42:22, 3.02it/s]2025-11-12T10:42:21.942428037Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing: 49%|███████████████████████████████████████████▉ | 7316/15000 [41:49<41:30, 3.08it/s]2025-11-12T10:42:22.251342872Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing: 49%|███████████████████████████████████████████▉ | 7317/15000 [41:50<41:00, 3.12it/s]2025-11-12T10:42:22.562960136Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing: 49%|███████████████████████████████████████████▉ | 7318/15000 [41:50<41:18, 3.10it/s]2025-11-12T10:42:22.890509389Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing: 49%|███████████████████████████████████████████▉ | 7319/15000 [41:50<40:36, 3.15it/s]2025-11-12T10:42:23.196059709Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing: 49%|███████████████████████████████████████████▉ | 7320/15000 [41:51<42:37, 3.00it/s]2025-11-12T10:42:23.56522525Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing: 49%|███████████████████████████████████████████▉ | 7321/15000 [41:51<42:28, 3.01it/s]2025-11-12T10:42:23.893866566Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing: 49%|███████████████████████████████████████████▉ | 7322/15000 [41:51<41:43, 3.07it/s]2025-11-12T10:42:24.20620153Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing: 49%|███████████████████████████████████████████▉ | 7323/15000 [41:52<40:54, 3.13it/s]2025-11-12T10:42:24.512590109Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing: 49%|███████████████████████████████████████████▉ | 7324/15000 [41:52<42:21, 3.02it/s]2025-11-12T10:42:24.869532084Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing: 49%|███████████████████████████████████████████▉ | 7325/15000 [41:52<41:49, 3.06it/s]2025-11-12T10:42:25.187321011Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing: 49%|███████████████████████████████████████████▉ | 7326/15000 [41:53<40:38, 3.15it/s]2025-11-12T10:42:25.483135561Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing: 49%|███████████████████████████████████████████▉ | 7327/15000 [41:53<39:11, 3.26it/s]2025-11-12T10:42:25.763111134Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing: 49%|███████████████████████████████████████████▉ | 7328/15000 [41:53<39:41, 3.22it/s]2025-11-12T10:42:26.083347148Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing: 49%|███████████████████████████████████████████▉ | 7329/15000 [41:53<40:44, 3.14it/s]2025-11-12T10:42:26.421144592Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing: 49%|███████████████████████████████████████████▉ | 7330/15000 [41:54<39:16, 3.26it/s]2025-11-12T10:42:26.700366974Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing: 49%|███████████████████████████████████████████▉ | 7331/15000 [41:54<39:39, 3.22it/s]2025-11-12T10:42:27.017917811Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing: 49%|███████████████████████████████████████████▉ | 7332/15000 [41:54<38:08, 3.35it/s]2025-11-12T10:42:27.289145159Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing: 49%|███████████████████████████████████████████▉ | 7333/15000 [41:55<37:46, 3.38it/s]2025-11-12T10:42:27.57947609Z INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
2025-11-12T10:43:29.941093673Z ERROR furiosa_generator::scheduler::generator: Some internal error occurred: sprinter error: Connection timed out (os error 110)
thread 'tokio-runtime-worker' panicked at /home/furiosa/furiosa-ai/furiosa-runtime/furiosa-generator/src/scheduler/generator.rs:408:17:
417 blocks are not reclaimed
[577536, 577725, 577646, 577835, 577567, 577756, 577488, 577677, 577866, 577598, 577787, 577708, 577440, 577629, 577818, 577550, 577739, 577660, 577849, 577581, 577770, 577691, 577880, 577612, 577801, 577533, 577722, 577643, 577832, 577564, 577753, 577485, 577674, 577863, 577595, 577784, 577516, 577705, 577626, 577815, 577547, 577736, 577657, 577846, 577578, 577767, 577499, 577688, 577877, 577609, 577798, 577530, 577719, 577640, 577829, 577561, 577750, 577482, 577671, 577860, 577403, 577592, 577781, 577702, 577891, 577623, 577812, 577544, 577733, 577654, 577843, 577386, 577575, 577764, 577685, 577874, 577606, 577795, 577527, 577716, 577637, 577826, 577369, 577558, 577747, 577668, 577857, 577589, 577778, 577699, 577888, 577620, 577809, 577541, 577730, 577651, 577840, 577572, 577761, 577493, 577682, 577871, 577603, 577792, 577524, 577713, 577445, 577634, 577823, 577555, 577744, 577665, 577854, 577397, 577586, 577775, 577129, 577318, 577696, 577885, 577617, 577806, 577538, 577727, 577648, 577837, 577569, 577758, 577490, 577679, 577868, 577600, 577789, 577521, 577710, 577631, 577820, 577552, 577741, 577473, 577662, 577851, 577583, 577772, 577504, 577693, 577882, 577614, 577803, 577535, 577724, 577645, 577834, 577566, 577755, 577487, 577676, 577865, 577597, 577786, 577707, 577439, 577628, 577817, 577549, 577738, 577659, 577848, 577580, 577769, 577690, 577879, 577611, 577800, 577532, 577721, 577453, 577642, 577831, 577563, 577752, 577484, 577673, 577862, 577594, 577783, 577704, 577625, 577814, 577546, 577735, 577656, 577845, 577577, 577766, 577498, 577687, 577876, 577419, 577608, 577797, 577529, 577718, 577639, 577828, 577560, 577749, 577670, 577859, 577402, 577591, 577780, 577701, 577890, 577622, 577811, 577543, 577732, 577653, 577842, 577574, 577763, 577684, 577873, 577605, 577794, 577526, 577715, 577636, 577825, 577557, 577746, 577667, 577856, 577588, 577777, 577698, 577887, 577619, 577808, 577540, 577729, 577461, 577650, 577839, 577571, 577760, 577492, 577681, 577870, 577602, 577791, 577523, 577712, 577444, 577633, 577822, 577554, 577743, 577475, 577664, 577853, 577396, 577585, 577774, 577695, 577884, 577616, 577805, 577348, 577537, 577726, 577647, 577836, 577568, 577757, 577489, 577678, 577867, 577599, 577788, 577520, 577709, 577630, 577819, 577551, 577740, 577661, 577850, 577582, 577771, 577692, 577881, 577613, 577802, 577534, 577723, 577644, 577833, 577565, 577754, 577675, 577864, 577596, 577785, 577517, 577706, 577627, 577816, 577548, 577737, 577658, 577847, 577579, 577768, 577500, 577689, 577878, 577610, 577799, 577531, 577720, 577452, 577641, 577830, 577562, 577751, 577672, 577861, 577593, 577782, 577703, 577892, 577624, 577813, 577545, 577734, 577655, 577844, 577387, 577576, 577765, 577497, 577686, 577875, 577418, 577607, 577796, 577528, 577717, 577638, 577827, 577370, 577559, 577748, 577669, 577858, 577590, 577779, 577700, 577889, 577621, 577810, 577542, 577731, 577652, 577841, 577573, 577762, 577683, 577872, 577604, 577793, 577525, 577714, 577635, 577824, 577556, 577745, 577666, 577855, 577587, 577776, 577130, 577319, 577508, 577697, 577886, 577618, 577807, 577539, 577728, 577649, 577838, 577570, 577759, 577680, 577869, 577601, 577790, 577522, 577711, 577632, 577821, 577553, 577742, 577474, 577663, 577852, 577584, 577773, 577505, 577694, 577883, 577615, 577804, 577347]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2025-11-12T10:43:29.941643504Z ERROR furiosa_generator::scheduler::generator: Furiosa LLM Engine terminated - pending/subsequent requests will fail
2025-11-12T10:43:30.530987028Z INFO device_runtime::device::threaded::npu: InstructionStream channel has been disconnected
thread '<unnamed>' panicked at /home/furiosa/.cargo/git/checkouts/device-runtime-6898824721bec299/7a920e3/device-runtime/src/device/threaded/npu.rs:82:60:
called `Result::unwrap()` on an `Err` value: Npu(Buf(Os { code: 110, kind: TimedOut, message: "Connection timed out" }))
thread '<unnamed>' panicked at /home/furiosa/.cargo/git/checkouts/device-runtime-6898824721bec299/7a920e3/device-runtime/src/device/native/npu.rs:43:13:
Close not called before drop
stack backtrace:
0: 0x73ee5c1cedf2 - <unknown>
1: 0x73ee5c1f7963 - <unknown>
2: 0x73ee5c1cbd73 - <unknown>
3: 0x73ee5c1cec42 - <unknown>
4: 0x73ee5c1d007c - <unknown>
5: 0x73ee5c1cfe7f - <unknown>
6: 0x73ee5c1d0aa2 - <unknown>
7: 0x73ee5c1d07f6 - <unknown>
8: 0x73ee5c1cf2f9 - <unknown>
9: 0x73ee5c1d04bd - <unknown>
10: 0x73ee5c1f4d00 - <unknown>
11: 0x73ee5c0f2ddf - <unknown>
12: 0x73ee5c0f987b - <unknown>
13: 0x73ee5c0e373f - <unknown>
14: 0x73ee5c0feb61 - <unknown>
15: 0x73ee5c10f085 - <unknown>
16: 0x73ee5c0fe4f2 - <unknown>
17: 0x73ee5c0c96da - <unknown>
18: 0x73ee5c0d4ed4 - <unknown>
19: 0x73ee5c0ca0a3 - <unknown>
20: 0x73ee5c1d271b - <unknown>
21: 0x73ef8cdbeac3 - <unknown>
22: 0x73ef8ce508c0 - <unknown>
thread '<unnamed>' panicked at library/core/src/panicking.rs:226:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
Aborted (core dumped)