I would like to consult about some details regarding the model preparation of RNGD.
Vision models
According to the introduction of RNGD, I noticed that “RNGD is a chip designed for deep learning inference, supporting high-performance Large Language Models (LLM), Multi-Modal LLM, Vision models, and other deep learning models.”
I would like to confirm whether the vision models listed in Furiosa Models are also supported by RNGD.
Model zoo
I would like to consult if only the models listed in the model zoo are supported by the compiler.
Partial model compilation
Is it possible to compile only one or several transformer blocks separately instead of the entire model?
Pipeline Parallelism Across Devices
Since pipeline parallelism among RNGD devices is supported, is it possible to implement pipeline parallelism between GPUs and RNGDs?
Thank you for the interesting question. Please find my inline responses to your questions below.
Vision models
According to the introduction of RNGD, I noticed that “RNGD is a chip designed for deep learning inference, supporting high-performance Large Language Models (LLM), Multi-Modal LLM, Vision models, and other deep learning models.”
I would like to confirm whether the vision models listed in Furiosa Models are also supported by RNGD.
RNGD and its software stack do support vision models. However, they are not currently listed because additional time is needed to bring them to production-level quality. According to our current milestone, we plan to officially announce support around Q4.
Model zoo
I would like to consult if only the models listed in the model zoo are supported by the compiler.
Currently, we only support the models listed at Supported Models.
In fact, the Furiosa Compiler is capable of compiling a wider range of models, but we currently list only those that meet product-level quality standards.
We plan to gradually unlock these restrictions. Starting in Q3, we aim to allow compilation of models that share the same Hugging Face Transformers’s implementation and expose eager mode and the torch.compile() backend too, so that users can experiment with various compilation paths on their own.
Partial model compilation
Is it possible to compile only one or several transformer blocks separately instead of the entire model?
Yes, it is possible. By setting the num_hidden_layers parameter to ArtifactBuilder, only the specified number of transformer blocks will be compiled.
Pipeline Parallelism Across Devices
Since pipeline parallelism among RNGD devices is supported, is it possible to implement pipeline parallelism between GPUs and RNGDs?
That’s a very interesting question. In theory, it’s possible, but there are several practical challenges to address. For example, RNGD cards support peer-to-peer communication without involving the host when transferring intermediate data. However, it’s still unclear whether memory transfers without involving the host are possible between GPUs and RNGDs.
While we do not plan to officially support pipeline parallelism between GPU and RNGD, once eager mode and the torch.compile() backend are available, advanced users or hackers might be able to experiment with and implement such setups themselves.