안녕하세요. 퓨리오사 레니게이드에서 빌드 중에 궁금한 것이 있어 문의드립니다.
Throughput 성능 최적화 방법 문의 글을 참고하여 모델 artifact를 추가해 빌드를 시도했습니다.
from furiosa_llm.artifact.builder import ArtifactBuilder
RELEASE_PREFILL_BUCKETS = [
(1, 256), (1, 320), (1, 384), (1, 512), (1, 640), (1, 768), (1, 1024), (2, 1024), (4, 1024),
]
RELEASE_DECODE_BUCKETS = [
*[(1, 1024), (4, 1024), (8, 1024), (16, 1024), (32, 1024), (64, 1024)],
*[(1, 2048), (4, 2048), (8, 2048), (16, 2048), (32, 2048)],
*[(1, 4096), (4, 4096), (8, 4096), (16, 4096), (32, 4096)],
*[(1, 8192), (4, 8192), (8, 8192), (16, 8192)],
*[(1, 16384), (4, 16384), (8, 16384)],
*[(1, 32768), (4, 32768)],
]
builder = ArtifactBuilder(
"meta-llama/Llama-3.1-8B-Instruct",
"<any name>",
tensor_parallel_size=1,
prefill_buckets=RELEASE_PREFILL_BUCKETS,
decode_buckets=RELEASE_DECODE_BUCKETS,
max_seq_len_to_capture=32 * 1024,
prefill_chunk_size=8 * 1024,
)
builder.build("./Output-Llama-3.1-8B-Instruct",
num_pipeline_builder_workers=1,
num_compile_workers=1)
Quantized_furiosa_llm_models.llama3.symbolic.aramco_specdec.LlamaForCausalLM-kv16384-b1-attn24576 에서 컴파일 중 아래와 같은 오류가 발생했습니다.
ERROR: failed to lower the operator2944(no tactic):
AttentionKernel(mask_type: BoolCondition, mask_tagged_shape: [2_1=8192, 5_1=24576], key: 5)
sparsity: FullSquare(2=24576[16384..24576], 5=24576(R))/batches:[]/-
name: attention_matmul_1_softmax__matmul#Kernelized:2944#PreLower:2944
input tensors: 4
input tensor 1299: [8, 4, 8192, 2, 64], 67108864 B, bf16
source: unknown
input tensor 1300: [24576, 8, 2, 64], 50331648 B, bf16
source: unknown
input tensor 1301: [8192, 24576], 201326592 B, bool
source: unknown
input tensor 1302: [24576, 8, 2, 64], 50331648 B, bf16
source: unknown
total bytes: 369098752
output tensors: 1
output tensor 1303: [8, 4, 8192, 2, 64], 67108864 B, bf16
source: unknown
total bytes: 67108864
Encountered exception!
non_shared_configs(w/o past_kv): xxx(너무 길어서 생략)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/furiosa_llm/parallelize/pipeline/builder/converter.py", line 819, in compile_gm_and_get_preprocessed_gm_hash
compiled = compile(
RuntimeError: fail to compile: Other error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/etri_workspace/furiosa-npu-experiments/compile_llama_manual_bucket.py", line 26, in <module>
builder.build("./Output-Llama-3.1-8B-Instruct",
File "/usr/local/lib/python3.10/dist-packages/furiosa_llm/artifact/builder.py", line 687, in build
target_model_artifact, target_model_pipelines = self._build_model_artifact(
File "/usr/local/lib/python3.10/dist-packages/furiosa_llm/artifact/builder.py", line 590, in _build_model_artifact
pipelines_with_metadata = build_pipelines(
File "/usr/local/lib/python3.10/dist-packages/furiosa_llm/artifact/helper.py", line 285, in build_pipelines
pipelines = pipeline_builder.build_pipelines(
File "/usr/local/lib/python3.10/dist-packages/furiosa_llm/parallelize/pipeline/builder/api.py", line 844, in build_pipelines
return PipelineBuilder.compile_supertasks_in_parallel(
File "/usr/local/lib/python3.10/dist-packages/furiosa_llm/parallelize/pipeline/builder/api.py", line 1036, in compile_supertasks_in_parallel
_compile_supertasks_in_pipeline(
File "/usr/local/lib/python3.10/dist-packages/furiosa_llm/parallelize/pipeline/builder/api.py", line 250, in _compile_supertasks_in_pipeline
compile_result, hash_val = GraphModuleConverter.compile_gm_and_get_preprocessed_gm_hash(
File "/usr/local/lib/python3.10/dist-packages/furiosa_llm/parallelize/pipeline/builder/converter.py", line 837, in compile_gm_and_get_preprocessed_gm_hash
raise RuntimeError(f"Compilation failed with error {e}")
RuntimeError: Compilation failed with error fail to compile: Other error
- 오류 발생 원인: KV cache가 16,384, sequence length가 24,576으로 큰데 TP=1로만 빌드하기 때문에 발생하는 것이 맞을까요?
- 모델 artifact를 적용한 TP=1 모델 컴파일 방법: TP=1에서 컴파일된 모델을 얻고 싶은데, RELEASE_DECODE_BUCKETS에서 (1, 16384) 만 제외하면 될지 궁금합니다. TP=1에 적합한 모델 artifact가 있다면 공유 부탁드립니다!
- 모델 artifact와 BucketWithOutputLogitsSize간의 상관 관계: 알려주신 모델 artifact는 prefill + decode 더하면 34가지였는데, 아래 4가지가 더해서 생성된 것은 38가지 였습니다. 어떤 것을 기준으로 생성되는 것인지 알려주시면 감사드리겠습니다.
- 추가된 설정값: (kv0-b1-attn8192), (kv8192-b1-attn16384), (kv16384-b1-attn24576), (kv24576-b1-attn32768)