This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Features

This section contains the advanced features of llmaz.

1 - Broad Inference Backends Support

If you want to integrate more backends into llmaz, please refer to this PR. It’s always welcomed.

llama.cpp

llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

ollama

ollama is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.

SGLang

SGLang is yet another fast serving framework for large language models and vision language models.

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.

Text-Generation-Inference

text-generation-inference is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.

vLLM

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs

2 - Heterogeneous Cluster Support

A llama2-7B model can be running on 1xA100 GPU, also on 1xA10 GPU, even on 1x4090 and a variety of other types of GPUs as well, that’s what we called resource fungibility. In practical scenarios, we may have a heterogeneous cluster with different GPU types, and high-end GPUs will stock out a lot, to meet the SLOs of the service as well as the cost, we need to schedule the workloads on different GPU types. With the ResourceFungibility in the InftyAI scheduler, we can simply achieve this with at most 8 alternative GPU types.

How to use

Enable InftyAI scheduler

Edit the values.global.yaml file to modify the following values:

kube-scheduler:
  enabled: true

globalConfig:
  configData: |-
    scheduler-name: inftyai-scheduler    

Run make helm-upgrade to install or upgrade llmaz.

3 - Distributed Inference

Support multi-host & homogeneous xPyD distributed serving with LWS from day 0. Will implement the heterogeneous xPyD in the future.