Rngd 에서 추론 중 오류

안녕하세요.

RNGD 환경에서 LLM 추론을 돌리다 보면 중간에 진행이 멈추고, 에러가 뜨는 현상이 지속적으로 일어납니다.

모델 크기마다 데이터셋 토큰 길이마다 달랐던 것으로 보여졌습니다. 어떤경우 117번째 시도 라거나 특정 구간에서 멈추면 같은 옵션 및 모델로 다시 시도해도 117번째에서 에러가 뜨는 모습을 보였습니다.

현재 앨리스 클라우드 상에서 RNGD 1장 장착 되어있는 서버에서 LLM 추론 진행하고 있습니다.

현재 load 하고 있는 모델은 furiosa-ai/Qwen2.5-Coder-7B-Instruct 입니다.

Docs에 따라 환경 업데이트는 해두었습니다.

환경은 아래와 같습니다

pip list | grep furi
furiosa-llm                   2025.3.4
furiosa-llm-models            2025.3.0
furiosa-model-compressor      2025.3.0
furiosa-model-compressor-impl 2025.3.0
furiosa-models-lang           2025.3.0
furiosa-native-compiler       2025.3.1
furiosa-native-llm-common     2025.3.1
furiosa-native-runtime        2025.3.2
furiosa-smi-py                2025.3.0
furiosa-torch-ext             2025.3.1
furiosa-smi info
+-------+------+--------+------------------+------------------+---------+---------+--------------+
| Index | Arch | Device | Firmware         | PERT             | Temp.   | Power   | PCI-BDF      |
+-------+------+--------+------------------+------------------+---------+---------+--------------+
|   0   | rngd |  npu4  | 2025.3.1+bbbbe52 | 2025.3.1+52e5705 | 33.34°C | 35.52 W | 0000:ae:00.0 |
+-------+------+--------+------------------+------------------+---------+---------+--------------+
apt list | grep furiosa

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

furiosa-bench/now 0.10.3-3 amd64 [installed,local]
furiosa-compiler/now 2025.3.0-3 amd64 [installed,local]
furiosa-driver-rngd/now 2025.3.1-3 all [installed,local]
furiosa-firmware-image-rngd/now 2025.3.1 all [installed,local]
furiosa-firmware-tools-rngd/now 2025.3.1-3 amd64 [installed,local]
furiosa-libcompiler/now 0.10.1-3 amd64 [installed,local]
furiosa-libhal-warboy/now 0.12.0-3 amd64 [installed,local]
furiosa-libnux/now 0.10.1-3 amd64 [installed,local]
furiosa-libsmi/now 2025.3.0-3 amd64 [installed,local]
furiosa-pert-rngd/now 2025.3.1-3 amd64 [installed,local]
furiosa-smi/now 2025.3.0-3 amd64 [installed,local]
furiosa-toolkit/now 0.11.0-3 amd64 [installed,local]

모델 평가 및 추론 테스트를 위해서 준비해둔 데이터셋 추론을 돌리다 보면 아래의 에러가 발생합니다.

모델 크기가 14B 이었을 경우에는 더 적은 시기때 같은 에러가 발생하기도 하였습니다.

지속적으로 발생하는 에러이나 원인 파악이 힘들어 문의드립니다.

Processing:  49%|███████████████████████████████████████████▊                                              | 7308/15000 [41:47<41:26,  3.09it/s]2025-11-12T10:42:19.640940505Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing:  49%|███████████████████████████████████████████▊                                              | 7309/15000 [41:47<41:17,  3.10it/s]2025-11-12T10:42:19.961309572Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing:  49%|███████████████████████████████████████████▊                                              | 7310/15000 [41:47<43:15,  2.96it/s]2025-11-12T10:42:20.333402769Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing:  49%|███████████████████████████████████████████▊                                              | 7311/15000 [41:48<41:29,  3.09it/s]2025-11-12T10:42:20.625068775Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing:  49%|███████████████████████████████████████████▊                                              | 7312/15000 [41:48<42:04,  3.04it/s]2025-11-12T10:42:20.964313925Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing:  49%|███████████████████████████████████████████▉                                              | 7313/15000 [41:48<41:24,  3.09it/s]2025-11-12T10:42:21.275457007Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing:  49%|███████████████████████████████████████████▉                                              | 7314/15000 [41:49<40:47,  3.14it/s]2025-11-12T10:42:21.58286433Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing:  49%|███████████████████████████████████████████▉                                              | 7315/15000 [41:49<42:22,  3.02it/s]2025-11-12T10:42:21.942428037Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing:  49%|███████████████████████████████████████████▉                                              | 7316/15000 [41:49<41:30,  3.08it/s]2025-11-12T10:42:22.251342872Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing:  49%|███████████████████████████████████████████▉                                              | 7317/15000 [41:50<41:00,  3.12it/s]2025-11-12T10:42:22.562960136Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing:  49%|███████████████████████████████████████████▉                                              | 7318/15000 [41:50<41:18,  3.10it/s]2025-11-12T10:42:22.890509389Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing:  49%|███████████████████████████████████████████▉                                              | 7319/15000 [41:50<40:36,  3.15it/s]2025-11-12T10:42:23.196059709Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing:  49%|███████████████████████████████████████████▉                                              | 7320/15000 [41:51<42:37,  3.00it/s]2025-11-12T10:42:23.56522525Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing:  49%|███████████████████████████████████████████▉                                              | 7321/15000 [41:51<42:28,  3.01it/s]2025-11-12T10:42:23.893866566Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing:  49%|███████████████████████████████████████████▉                                              | 7322/15000 [41:51<41:43,  3.07it/s]2025-11-12T10:42:24.20620153Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing:  49%|███████████████████████████████████████████▉                                              | 7323/15000 [41:52<40:54,  3.13it/s]2025-11-12T10:42:24.512590109Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing:  49%|███████████████████████████████████████████▉                                              | 7324/15000 [41:52<42:21,  3.02it/s]2025-11-12T10:42:24.869532084Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing:  49%|███████████████████████████████████████████▉                                              | 7325/15000 [41:52<41:49,  3.06it/s]2025-11-12T10:42:25.187321011Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing:  49%|███████████████████████████████████████████▉                                              | 7326/15000 [41:53<40:38,  3.15it/s]2025-11-12T10:42:25.483135561Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing:  49%|███████████████████████████████████████████▉                                              | 7327/15000 [41:53<39:11,  3.26it/s]2025-11-12T10:42:25.763111134Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing:  49%|███████████████████████████████████████████▉                                              | 7328/15000 [41:53<39:41,  3.22it/s]2025-11-12T10:42:26.083347148Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing:  49%|███████████████████████████████████████████▉                                              | 7329/15000 [41:53<40:44,  3.14it/s]2025-11-12T10:42:26.421144592Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing:  49%|███████████████████████████████████████████▉                                              | 7330/15000 [41:54<39:16,  3.26it/s]2025-11-12T10:42:26.700366974Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: vulnerable
Processing:  49%|███████████████████████████████████████████▉                                              | 7331/15000 [41:54<39:39,  3.22it/s]2025-11-12T10:42:27.017917811Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing:  49%|███████████████████████████████████████████▉                                              | 7332/15000 [41:54<38:08,  3.35it/s]2025-11-12T10:42:27.289145159Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
outputs: safe
Processing:  49%|███████████████████████████████████████████▉                                              | 7333/15000 [41:55<37:46,  3.38it/s]2025-11-12T10:42:27.57947609Z  INFO furiosa_generator::scheduler::hf_compat: num samples received: 1
2025-11-12T10:43:29.941093673Z ERROR furiosa_generator::scheduler::generator: Some internal error occurred: sprinter error: Connection timed out (os error 110)

thread 'tokio-runtime-worker' panicked at /home/furiosa/furiosa-ai/furiosa-runtime/furiosa-generator/src/scheduler/generator.rs:408:17:
417 blocks are not reclaimed
[577536, 577725, 577646, 577835, 577567, 577756, 577488, 577677, 577866, 577598, 577787, 577708, 577440, 577629, 577818, 577550, 577739, 577660, 577849, 577581, 577770, 577691, 577880, 577612, 577801, 577533, 577722, 577643, 577832, 577564, 577753, 577485, 577674, 577863, 577595, 577784, 577516, 577705, 577626, 577815, 577547, 577736, 577657, 577846, 577578, 577767, 577499, 577688, 577877, 577609, 577798, 577530, 577719, 577640, 577829, 577561, 577750, 577482, 577671, 577860, 577403, 577592, 577781, 577702, 577891, 577623, 577812, 577544, 577733, 577654, 577843, 577386, 577575, 577764, 577685, 577874, 577606, 577795, 577527, 577716, 577637, 577826, 577369, 577558, 577747, 577668, 577857, 577589, 577778, 577699, 577888, 577620, 577809, 577541, 577730, 577651, 577840, 577572, 577761, 577493, 577682, 577871, 577603, 577792, 577524, 577713, 577445, 577634, 577823, 577555, 577744, 577665, 577854, 577397, 577586, 577775, 577129, 577318, 577696, 577885, 577617, 577806, 577538, 577727, 577648, 577837, 577569, 577758, 577490, 577679, 577868, 577600, 577789, 577521, 577710, 577631, 577820, 577552, 577741, 577473, 577662, 577851, 577583, 577772, 577504, 577693, 577882, 577614, 577803, 577535, 577724, 577645, 577834, 577566, 577755, 577487, 577676, 577865, 577597, 577786, 577707, 577439, 577628, 577817, 577549, 577738, 577659, 577848, 577580, 577769, 577690, 577879, 577611, 577800, 577532, 577721, 577453, 577642, 577831, 577563, 577752, 577484, 577673, 577862, 577594, 577783, 577704, 577625, 577814, 577546, 577735, 577656, 577845, 577577, 577766, 577498, 577687, 577876, 577419, 577608, 577797, 577529, 577718, 577639, 577828, 577560, 577749, 577670, 577859, 577402, 577591, 577780, 577701, 577890, 577622, 577811, 577543, 577732, 577653, 577842, 577574, 577763, 577684, 577873, 577605, 577794, 577526, 577715, 577636, 577825, 577557, 577746, 577667, 577856, 577588, 577777, 577698, 577887, 577619, 577808, 577540, 577729, 577461, 577650, 577839, 577571, 577760, 577492, 577681, 577870, 577602, 577791, 577523, 577712, 577444, 577633, 577822, 577554, 577743, 577475, 577664, 577853, 577396, 577585, 577774, 577695, 577884, 577616, 577805, 577348, 577537, 577726, 577647, 577836, 577568, 577757, 577489, 577678, 577867, 577599, 577788, 577520, 577709, 577630, 577819, 577551, 577740, 577661, 577850, 577582, 577771, 577692, 577881, 577613, 577802, 577534, 577723, 577644, 577833, 577565, 577754, 577675, 577864, 577596, 577785, 577517, 577706, 577627, 577816, 577548, 577737, 577658, 577847, 577579, 577768, 577500, 577689, 577878, 577610, 577799, 577531, 577720, 577452, 577641, 577830, 577562, 577751, 577672, 577861, 577593, 577782, 577703, 577892, 577624, 577813, 577545, 577734, 577655, 577844, 577387, 577576, 577765, 577497, 577686, 577875, 577418, 577607, 577796, 577528, 577717, 577638, 577827, 577370, 577559, 577748, 577669, 577858, 577590, 577779, 577700, 577889, 577621, 577810, 577542, 577731, 577652, 577841, 577573, 577762, 577683, 577872, 577604, 577793, 577525, 577714, 577635, 577824, 577556, 577745, 577666, 577855, 577587, 577776, 577130, 577319, 577508, 577697, 577886, 577618, 577807, 577539, 577728, 577649, 577838, 577570, 577759, 577680, 577869, 577601, 577790, 577522, 577711, 577632, 577821, 577553, 577742, 577474, 577663, 577852, 577584, 577773, 577505, 577694, 577883, 577615, 577804, 577347]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2025-11-12T10:43:29.941643504Z ERROR furiosa_generator::scheduler::generator: Furiosa LLM Engine terminated - pending/subsequent requests will fail
2025-11-12T10:43:30.530987028Z  INFO device_runtime::device::threaded::npu: InstructionStream channel has been disconnected

thread '<unnamed>' panicked at /home/furiosa/.cargo/git/checkouts/device-runtime-6898824721bec299/7a920e3/device-runtime/src/device/threaded/npu.rs:82:60:
called `Result::unwrap()` on an `Err` value: Npu(Buf(Os { code: 110, kind: TimedOut, message: "Connection timed out" }))

thread '<unnamed>' panicked at /home/furiosa/.cargo/git/checkouts/device-runtime-6898824721bec299/7a920e3/device-runtime/src/device/native/npu.rs:43:13:
Close not called before drop
stack backtrace:
   0:     0x73ee5c1cedf2 - <unknown>
   1:     0x73ee5c1f7963 - <unknown>
   2:     0x73ee5c1cbd73 - <unknown>
   3:     0x73ee5c1cec42 - <unknown>
   4:     0x73ee5c1d007c - <unknown>
   5:     0x73ee5c1cfe7f - <unknown>
   6:     0x73ee5c1d0aa2 - <unknown>
   7:     0x73ee5c1d07f6 - <unknown>
   8:     0x73ee5c1cf2f9 - <unknown>
   9:     0x73ee5c1d04bd - <unknown>
  10:     0x73ee5c1f4d00 - <unknown>
  11:     0x73ee5c0f2ddf - <unknown>
  12:     0x73ee5c0f987b - <unknown>
  13:     0x73ee5c0e373f - <unknown>
  14:     0x73ee5c0feb61 - <unknown>
  15:     0x73ee5c10f085 - <unknown>
  16:     0x73ee5c0fe4f2 - <unknown>
  17:     0x73ee5c0c96da - <unknown>
  18:     0x73ee5c0d4ed4 - <unknown>
  19:     0x73ee5c0ca0a3 - <unknown>
  20:     0x73ee5c1d271b - <unknown>
  21:     0x73ef8cdbeac3 - <unknown>
  22:     0x73ef8ce508c0 - <unknown>

thread '<unnamed>' panicked at library/core/src/panicking.rs:226:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
Aborted (core dumped)

안녕하세요? 불편 드려 죄송합니다. 혹시 어떤 오류인지 오류 메세지나 증상에 대해 조금 더 자세히 기술 해주실 수 있으실까요?