[EN]
I want to benchmark LLM models involving sliding window attention via furiosa-llm.
Unfortunately, my model’s architecture is not on your supported models list and and it seems like the only way to meet my goal is to port the model definition from huggingface to furiosa_llm_models for furiosa_models_lang using basic building blocks there.
However, regarding the attention, I found out some checks that explicitly says that SWA is not supported for now:
# furiosa_models_lang/furiosa_models/attention/attention.py
class AttentionLayer(nn.Module):
"""General interface for attention mechanisms in Transformer-based models.
This class implements a flexible attention mechanism that supports various
Transformer-based architectures, including:
- Large Language Models (LLMs) (e.g., GPT, LLaMA)
- Encoder-Only models (e.g., BERT, RoBERTa)
- Encoder-Decoder models (e.g., T5, BART)
- Multi-modal architectures with cross-modal attention mechanisms
(planned, not yet fully implemented).
This interface utilizes a customizable `AttentionBackend`, allowing the application of
attention mechanisms across different tasks and architectures.
Args:
num_heads (int): Number of attention heads.
head_size (int): Dimensionality of each attention head.
scale (float): Scaling factor for dot-product attention.
num_kv_heads (Optional[int]): Number of key-value heads. Defaults to `num_heads`.
cache_config (Optional[CacheConfig]): Configuration for key-value (KV) cache.
quant_config (Optional[QuantizationConfig]): Configuration for quantization.
alibi_slopes (Optional[List[float]]): ALiBi positional bias slopes (not implemented).
blocksparse_params (Optional[Dict[str, Any]]): Block-sparse attention parameters
(not implemented).
logits_soft_cap (Optional[float]): Logit value cap (not implemented).
per_layer_sliding_window (Optional[int]): Sliding window size for specific layers
(not implemented).
prefix (str): Optional prefix for identifying the layer.
attn_type (AttentionType): Type of attention (e.g., "decoder").
Raises:
NotImplementedError: If an unsupported feature (e.g., ALiBi, block-sparse attention) is
provided.
ValueError: If the block size is not explicitly defined in `CacheConfig`, or mismatched
with the backend's default block size.
"""
def __init__(
self,
num_heads: int,
head_size: int,
scale: float,
num_kv_heads: Optional[int] = None,
cache_config: Optional[CacheConfig] = None,
quant_config: Optional[QuantizationConfig] = None,
alibi_slopes: Optional[List[float]] = None,
blocksparse_params: Optional[Dict[str, Any]] = None,
logits_soft_cap: Optional[float] = None,
per_layer_sliding_window: Optional[int] = None,
prefix: str = "",
attn_type: AttentionType = AttentionType.DECODER,
) -> None:
super().__init__()
if alibi_slopes is not None:
raise NotImplementedError("ALiBi slopes are not implemented yet.")
if blocksparse_params is not None:
raise NotImplementedError("Block-sparse attention is not implemented yet.")
if logits_soft_cap is not None:
raise NotImplementedError("Logits soft cap is not implemented yet.")
if per_layer_sliding_window is not None:
raise NotImplementedError("Per-layer sliding window is not implemented yet.")
# furiosa_models_lang/furiosa_models/attention/backends/llm.py
class LLMAttentionImpl(AttentionImplBase):
"""Implementation of LLM-specific attention using `PagedAttention`.
This class manages KV cache operations and computes scaled dot-product attention.
Args:
num_heads (int): Number of attention heads.
head_size (int): Size of each attention head.
scale (float): Scaling factor for dot-product attention.
num_kv_heads (int): Number of KV heads.
alibi_slopes (Optional[List[float]]): Slopes for ALiBi attention.
sliding_window (Optional[int]): Window size for local attention.
blocksparse_params (Optional[Dict[str, Any]]): Parameters for block-sparse attention.
logits_soft_cap (Optional[float]): Soft cap for attention logits.
kv_cache_dtype (torch.dtype): KV cache data type (`"auto"` by default).
attn_type (AttentionType): Type of attention (default: `"decoder"`).
Raises:
ValueError: If `num_heads` is not divisible by `num_kv_heads`.
NotImplementedError: If ALiBi, sliding window, or block-sparse attention is used.
"""
def __init__(
self,
num_heads: int,
head_size: int,
scale: float,
num_kv_heads: int,
alibi_slopes: Optional[List[float]] = None,
sliding_window: Optional[int] = None,
blocksparse_params: Optional[Dict[str, Any]] = None,
logits_soft_cap: Optional[float] = None,
kv_cache_dtype: torch.dtype = torch.float32,
attn_type: AttentionType = AttentionType.DECODER,
) -> None:
num_kv_heads = num_kv_heads or num_heads
if num_heads % num_kv_heads != 0:
raise ValueError(
f"Number of heads ({num_heads}) must be divisible by number of "
f"KV heads ({num_kv_heads})."
)
if kv_cache_dtype not in [torch.float32, torch.float16, torch.bfloat16]:
raise ValueError(
f"Invalid KV cache data type: {kv_cache_dtype}. "
f"Supported types are torch.float32, torch.float16, torch.bfloat16"
)
if attn_type != AttentionType.DECODER:
raise NotImplementedError(f"Attention type '{attn_type}' is not yet supported.")
if alibi_slopes is not None:
raise NotImplementedError("ALiBi attention is not yet supported.")
if sliding_window is not None:
raise NotImplementedError("Sliding window attention is not yet supported.")
if blocksparse_params is not None:
raise NotImplementedError("Block-sparse attention is not yet supported.")
if logits_soft_cap is not None:
raise NotImplementedError("Soft cap for attention logits is not yet supported.")
In this circumstances, is there any way I can implement the model on RNGD with SWA by myself except for disabling sliding window attention and setting max-model-len equal to window size so that sliding window takes no effect at all? If I have to do full attention for all layers, is there any way to provide per-layer attention masks so that I can simulate sliding window attention with attention masks?
[KR]
SWA를 포함한 모델을 RNGD 카드에서 furiosa-llm을 통해 돌려보고자 합니다. 애석하게도 해당 모델의 아키텍처가 현재 furiosa-llm 또는 furiosa-llm-models, furiosa-models-lang에 없어 제가 직접 huggingface에서 모델 definition을 porting해와야 하는 상황으로 보입니다.
다만, SWA 관련하여 다른 구현들을 살펴보니 Attention module의 구현에서 SWA를 지원하지 않는다고 명시적으로 분기해놓은 코드를 발견하였습니다.
현재 상황에서, max-model-len을 sliding window size로 설정하여 사실상 sliding window attention을 하지 않게 만드는 것 외에 제가 SWA를 포함한 모델을 돌릴 수 있는 방법이 있나요? 만약 없어 모든 layer에서 full attention을 해야한다면, layer별로 attention mask를 달리 넣어서 sliding window attention과 의미적으론 동일한 연산을 하게 하는 방법이 있을까요?