Models with sliding window attention

[EN]

I want to benchmark LLM models involving sliding window attention via furiosa-llm.
Unfortunately, my model’s architecture is not on your supported models list and and it seems like the only way to meet my goal is to port the model definition from huggingface to furiosa_llm_models for furiosa_models_lang using basic building blocks there.

However, regarding the attention, I found out some checks that explicitly says that SWA is not supported for now:

# furiosa_models_lang/furiosa_models/attention/attention.py

class AttentionLayer(nn.Module):
    """General interface for attention mechanisms in Transformer-based models.

    This class implements a flexible attention mechanism that supports various
    Transformer-based architectures, including:
      - Large Language Models (LLMs) (e.g., GPT, LLaMA)
      - Encoder-Only models (e.g., BERT, RoBERTa)
      - Encoder-Decoder models (e.g., T5, BART)
      - Multi-modal architectures with cross-modal attention mechanisms
        (planned, not yet fully implemented).

    This interface utilizes a customizable `AttentionBackend`, allowing the application of
    attention mechanisms across different tasks and architectures.

    Args:
        num_heads (int): Number of attention heads.
        head_size (int): Dimensionality of each attention head.
        scale (float): Scaling factor for dot-product attention.
        num_kv_heads (Optional[int]): Number of key-value heads. Defaults to `num_heads`.
        cache_config (Optional[CacheConfig]): Configuration for key-value (KV) cache.
        quant_config (Optional[QuantizationConfig]): Configuration for quantization.
        alibi_slopes (Optional[List[float]]): ALiBi positional bias slopes (not implemented).
        blocksparse_params (Optional[Dict[str, Any]]): Block-sparse attention parameters
            (not implemented).
        logits_soft_cap (Optional[float]): Logit value cap (not implemented).
        per_layer_sliding_window (Optional[int]): Sliding window size for specific layers
            (not implemented).
        prefix (str): Optional prefix for identifying the layer.
        attn_type (AttentionType): Type of attention (e.g., "decoder").

    Raises:
        NotImplementedError: If an unsupported feature (e.g., ALiBi, block-sparse attention) is
            provided.
        ValueError: If the block size is not explicitly defined in `CacheConfig`, or mismatched
            with the backend's default block size.
    """

    def __init__(
        self,
        num_heads: int,
        head_size: int,
        scale: float,
        num_kv_heads: Optional[int] = None,
        cache_config: Optional[CacheConfig] = None,
        quant_config: Optional[QuantizationConfig] = None,
        alibi_slopes: Optional[List[float]] = None,
        blocksparse_params: Optional[Dict[str, Any]] = None,
        logits_soft_cap: Optional[float] = None,
        per_layer_sliding_window: Optional[int] = None,
        prefix: str = "",
        attn_type: AttentionType = AttentionType.DECODER,
    ) -> None:
        super().__init__()

        if alibi_slopes is not None:
            raise NotImplementedError("ALiBi slopes are not implemented yet.")
        if blocksparse_params is not None:
            raise NotImplementedError("Block-sparse attention is not implemented yet.")
        if logits_soft_cap is not None:
            raise NotImplementedError("Logits soft cap is not implemented yet.")
        if per_layer_sliding_window is not None:
            raise NotImplementedError("Per-layer sliding window is not implemented yet.")


# furiosa_models_lang/furiosa_models/attention/backends/llm.py

class LLMAttentionImpl(AttentionImplBase):
    """Implementation of LLM-specific attention using `PagedAttention`.

    This class manages KV cache operations and computes scaled dot-product attention.

    Args:
        num_heads (int): Number of attention heads.
        head_size (int): Size of each attention head.
        scale (float): Scaling factor for dot-product attention.
        num_kv_heads (int): Number of KV heads.
        alibi_slopes (Optional[List[float]]): Slopes for ALiBi attention.
        sliding_window (Optional[int]): Window size for local attention.
        blocksparse_params (Optional[Dict[str, Any]]): Parameters for block-sparse attention.
        logits_soft_cap (Optional[float]): Soft cap for attention logits.
        kv_cache_dtype (torch.dtype): KV cache data type (`"auto"` by default).
        attn_type (AttentionType): Type of attention (default: `"decoder"`).

    Raises:
        ValueError: If `num_heads` is not divisible by `num_kv_heads`.
        NotImplementedError: If ALiBi, sliding window, or block-sparse attention is used.
    """

    def __init__(
        self,
        num_heads: int,
        head_size: int,
        scale: float,
        num_kv_heads: int,
        alibi_slopes: Optional[List[float]] = None,
        sliding_window: Optional[int] = None,
        blocksparse_params: Optional[Dict[str, Any]] = None,
        logits_soft_cap: Optional[float] = None,
        kv_cache_dtype: torch.dtype = torch.float32,
        attn_type: AttentionType = AttentionType.DECODER,
    ) -> None:
        num_kv_heads = num_kv_heads or num_heads
        if num_heads % num_kv_heads != 0:
            raise ValueError(
                f"Number of heads ({num_heads}) must be divisible by number of "
                f"KV heads ({num_kv_heads})."
            )

        if kv_cache_dtype not in [torch.float32, torch.float16, torch.bfloat16]:
            raise ValueError(
                f"Invalid KV cache data type: {kv_cache_dtype}. "
                f"Supported types are torch.float32, torch.float16, torch.bfloat16"
            )

        if attn_type != AttentionType.DECODER:
            raise NotImplementedError(f"Attention type '{attn_type}' is not yet supported.")

        if alibi_slopes is not None:
            raise NotImplementedError("ALiBi attention is not yet supported.")
        if sliding_window is not None:
            raise NotImplementedError("Sliding window attention is not yet supported.")
        if blocksparse_params is not None:
            raise NotImplementedError("Block-sparse attention is not yet supported.")
        if logits_soft_cap is not None:
            raise NotImplementedError("Soft cap for attention logits is not yet supported.")


In this circumstances, is there any way I can implement the model on RNGD with SWA by myself except for disabling sliding window attention and setting max-model-len equal to window size so that sliding window takes no effect at all? If I have to do full attention for all layers, is there any way to provide per-layer attention masks so that I can simulate sliding window attention with attention masks?

[KR]

SWA를 포함한 모델을 RNGD 카드에서 furiosa-llm을 통해 돌려보고자 합니다. 애석하게도 해당 모델의 아키텍처가 현재 furiosa-llm 또는 furiosa-llm-models, furiosa-models-lang에 없어 제가 직접 huggingface에서 모델 definition을 porting해와야 하는 상황으로 보입니다.

다만, SWA 관련하여 다른 구현들을 살펴보니 Attention module의 구현에서 SWA를 지원하지 않는다고 명시적으로 분기해놓은 코드를 발견하였습니다.

현재 상황에서, max-model-len을 sliding window size로 설정하여 사실상 sliding window attention을 하지 않게 만드는 것 외에 제가 SWA를 포함한 모델을 돌릴 수 있는 방법이 있나요? 만약 없어 모든 layer에서 full attention을 해야한다면, layer별로 attention mask를 달리 넣어서 sliding window attention과 의미적으론 동일한 연산을 하게 하는 방법이 있을까요?

Hi,

Could you please let me know which model architecture you are trying to use (or deploy)? Our model support approach is very similar to other LLM serving frameworks, such as vLLM and TensorRT-LLM.

Because LLMs require special optimizations (like paged attention and flashinfer) and specific input formats for their specialized kernels, our serving framework requires models to follow a specific architectural structure. The models-lang and furiosa-llm-models repositories are our collection of supported model packages.

If your model’s architecture isn’t included in those packages, it is difficult to build your model in practice. However, we are planning to open source furiosa-models-lang with well-documented guides, allowing users to port their models.

Also, if you let us know the architecture you need, we can consider adding support for it to our future roadmap.

FYI, we are currently preparing our next release, which is scheduled for the end of November. This release will include support for Exaone4 and Qwen3.

I was thinking about gemma3.

I understand that running models with furiosa-llm requires properly porting the model definition to match a specific architectural structure. I also understand that once the ported model is traced into an FX graph via the FX tracer, it is passed to the Furiosa compiler stack, which then generates the device binary (graph). That’s why I tried modifying the package by referring to the models that are already included, to add my own model.

So, to confirm my understanding, Even if I port my model including SWA in python level(say adding model architecture to models-lang) by simulating SWA with full attention but applying dynamically generated attention mask right after and ensure that computation pattern to be captured inside the traced fx graph and hand it over to the compiler, as the compiler backend will not handle it properly for now, it will end up in compilation failure or unintended output. Is my understanding correct?

That’s correct. furiosa-llm works as you mentioned.

So, to confirm my understanding, Even if I port my model including SWA in python level(say adding model architecture to models-lang) by simulating SWA with full attention but applying dynamically generated attention mask right after and ensure that computation pattern to be captured inside the traced fx graph and hand it over to the compiler, as the compiler backend will not handle it properly for now, it will end up in compilation failure or unintended output. Is my understanding correct?

In practice, you will face many problems if you try what we don’t officially support. I’m sorry about that. That’s why I don’t recommend. Next year, we will be able to open source furiosa-models, furiosa-llm, and some low-level APIs, allowing users to port their models in a better way. I strongly recommend you wait until we publish them publicly.