1 - Installation

This section introduces the installation guidance for llmaz.

Prerequisites

Requirements:

  • Kubernetes version >= 1.26. LWS requires Kubernetes version v1.26 or higher. If you are using a lower Kubernetes version and most of your workloads rely on single-node inference, we may consider replacing LWS with a Deployment-based approach. This fallback plan would involve using Kubernetes Deployments to manage single-node inference workloads efficiently. See #32 for more details and updates.
  • Helm 3, see installation.
  • Prometheus, see installation.

Note: llmaz helm chart will by default install

  • Envoy Gateway and Envoy AI Gateway as the frontier in the llmaz-system, if you already installed these two components or want to deploy in other namespaces , append --set envoy-gateway.enabled=false --set envoy-ai-gateway.enabled=false to the command below.
  • Open WebUI as the default chatbot, if you want to disable it, append --set open-webui.enabled=false to the command below.

Install a released version

Install

helm install llmaz oci://registry-1.docker.io/inftyai/llmaz --namespace llmaz-system --create-namespace --version 0.0.9

Uninstall

helm uninstall llmaz --namespace llmaz-system
kubectl delete ns llmaz-system

If you want to delete the CRDs as well, run

kubectl delete crd \
    openmodels.llmaz.io \
    backendruntimes.inference.llmaz.io \
    playgrounds.inference.llmaz.io \
    services.inference.llmaz.io

Install from source

Change configurations

If you want to change the default configurations, please change the values in values.global.yaml.

Do not change the values in values.yaml because it’s auto-generated and will be overwritten.

Install

git clone https://github.com/inftyai/llmaz.git && cd llmaz
kubectl create ns llmaz-system && kubens llmaz-system
make helm-install

Uninstall

helm uninstall llmaz --namespace llmaz-system
kubectl delete ns llmaz-system

If you want to delete the CRDs as well, run

kubectl delete crd \
    openmodels.llmaz.io \
    backendruntimes.inference.llmaz.io \
    playgrounds.inference.llmaz.io \
    services.inference.llmaz.io

Upgrade

Once you changed your code, run the command to upgrade the controller:

IMG=<image-registry>:<tag> make helm-upgrade

2 - Integrations

This section contains the llmaz integration information.

2.1 - Envoy AI Gateway

Envoy AI Gateway is an open source project for using Envoy Gateway to handle request traffic from application clients to Generative AI services.

How to use

1. Enable Envoy Gateway and Envoy AI Gateway

Both of them are enabled by default in values.global.yaml and will be deployed in llmaz-system.

envoy-gateway:
    enabled: true
envoy-ai-gateway:
    enabled: true

However, Envoy Gateway and Envoy AI Gateway can be deployed standalone in case you want to deploy them in other namespaces.

2. Basic AI Gateway Example

To expose your models via Envoy Gateway, you need to create a GatewayClass, Gateway, and AIGatewayRoute. The following example shows how to do this.

We’ll deploy two models Qwen/Qwen2-0.5B-Instruct-GGUF and Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF with llama.cpp (cpu only) and expose them via Envoy AI Gateway.

The full example is here, apply it.

3. Check Envoy AI Gateway APIs

If Open-WebUI is enabled, you can chat via the webui (recommended), see documentation. Otherwise, following the steps below to test the Envoy AI Gateway APIs.

I. Port-forwarding the LoadBalancer service in llmaz-system, like:

kubectl port-forward svc/envoy-default-default-envoy-ai-gateway-dbec795a 8080:80

II. Query http://localhost:8008/v1/models | jq ., available models will be listed. Expected response will look like this:

{
  "data": [
    {
      "id": "qwen2-0.5b",
      "created": 1745327294,
      "object": "model",
      "owned_by": "Envoy AI Gateway"
    },
    {
      "id": "qwen2.5-coder",
      "created": 1745327294,
      "object": "model",
      "owned_by": "Envoy AI Gateway"
    }
  ],
  "object": "list"
}

III. Query http://localhost:8080/v1/chat/completions to chat with the model. Here, we ask the qwen2-0.5b model, the query will look like:

curl -H "Content-Type: application/json"     -d '{
        "model": "qwen2-0.5b",
        "messages": [
            {
                "role": "system",
                "content": "Hi."
            }
        ]
    }'     http://localhost:8080/v1/chat/completions | jq .

Expected response will look like this:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today?"
      }
    }
  ],
  "created": 1745327371,
  "model": "qwen2-0.5b",
  "system_fingerprint": "b5124-bc091a4d",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 10,
    "prompt_tokens": 10,
    "total_tokens": 20
  },
  "id": "chatcmpl-AODlT8xnf4OjJwpQH31XD4yehHLnurr0",
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 319.876,
    "prompt_per_token_ms": 319.876,
    "prompt_per_second": 3.1262114069201816,
    "predicted_n": 10,
    "predicted_ms": 1309.393,
    "predicted_per_token_ms": 130.9393,
    "predicted_per_second": 7.63712651587415
  }
}

2.2 - Open WebUI

Open WebUI is a user-friendly AI interface with OpenAI-compatible APIs, serving as the default chatbot for llmaz.

Prerequisites

  • Make sure you’re located in llmaz-system namespace, haven’t tested with other namespaces.
  • Make sure EnvoyGateway and Envoy AI Gateway are installed, both of them are installed by default in llmaz. See AI Gateway for more details.

How to use

If open-webui already installed, what you need to do is just update the OpenAI API endpoint in the admin settings. You can get the value from step2 & 3 below. Otherwise, following the steps here to install open-webui.

  1. Enable Open WebUI in the values.global.yaml file, open-webui is enabled by default.

    open-webui:
      enabled: true
    

    Optional to set the persistence=true to persist the data, recommended for production.

  2. Run kubectl get svc -n llmaz-system to list out the services, the output looks like:

    envoy-default-default-envoy-ai-gateway-dbec795a   LoadBalancer   10.96.145.150   <pending>     80:30548/TCP                              132m
    envoy-gateway                                     ClusterIP      10.96.52.76     <none>        18000/TCP,18001/TCP,18002/TCP,19001/TCP   172m
    
  3. Set openaiBaseApiUrl in the values.global.yaml like:

    open-webui:
      enabled: true
      openaiBaseApiUrl: http://envoy-default-default-envoy-ai-gateway-dbec795a.llmaz-system.svc.cluster.local/v1
    
  4. Run make install-chatbot to install the chatbot.

  5. Port forwarding by:

    kubectl port-forward svc/open-webui 8080:80
    
  6. Visit http://localhost:8080 to access the Open WebUI.

  7. Configure the administrator for the first time.

That’s it! You can now chat with llmaz models with Open WebUI.

2.3 - Prometheus Operator

Currently, llmaz has already integrated metrics. This document provides deployment steps explaining how to install and configure Prometheus Operator in a Kubernetes cluster.

Install the prometheus operator

Please follow the documentation to install

# Installing the prometheus operator
root@VM-0-5-ubuntu:/home/ubuntu# kubectl get pods
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-55b5c96cf8-jl2nx   1/1     Running   0          12s

Ensure that the Prometheus Operator Pod is running successfully.

Install the ServiceMonitor CR for llmaz

To enable monitoring for the llmaz system, you need to install the ServiceMonitor custom resource (CR). You can either modify the Helm chart prometheus according to the documentation or use make install-prometheus in Makefile.

  • Using Helm Chart: to modify the values.global.yaml
prometheus:
  # -- Whether to enable Prometheus metrics exporting.
  enable: true
  • Using Makefile Command: make install-prometheus
root@VM-0-5-ubuntu:/home/ubuntu/llmaz# make install-prometheus
kubectl apply --server-side -k config/prometheus
serviceaccount/llmaz-prometheus serverside-applied
clusterrole.rbac.authorization.k8s.io/llmaz-prometheus serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/llmaz-prometheus serverside-applied
prometheus.monitoring.coreos.com/llmaz-prometheus serverside-applied
servicemonitor.monitoring.coreos.com/llmaz-controller-manager-metrics-monitor serverside-applied

Verify that the necessary resources have been created:

  • ServiceMonitor
root@VM-0-5-ubuntu:/home/ubuntu/llmaz# kubectl get ServiceMonitor -n llmaz-system
NAME                                       AGE
llmaz-controller-manager-metrics-monitor   59s
  • Prometheus Pods
root@VM-0-5-ubuntu:/home/ubuntu/llmaz# kubectl get pods -n llmaz-system
NAME                                        READY   STATUS    RESTARTS   AGE
llmaz-controller-manager-7ff8f7d9bd-vztls   2/2     Running   0          28s
prometheus-llmaz-prometheus-0               2/2     Running   0          27s
  • Services
root@VM-0-5-ubuntu:/home/ubuntu/llmaz# kubectl get svc -n llmaz-system
NAME                                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
llmaz-controller-manager-metrics-service   ClusterIP   10.96.79.226    <none>        8443/TCP   46s
llmaz-webhook-service                      ClusterIP   10.96.249.226   <none>        443/TCP    46s
prometheus-operated                        ClusterIP   None            <none>        9090/TCP   45s

View metrics using the prometheus UI

Use port forwarding to access the Prometheus UI from your local machine:

root@VM-0-5-ubuntu:/home/ubuntu# kubectl port-forward services/prometheus-operated 9090:9090 --address 0.0.0.0 -n llmaz-system
Forwarding from 0.0.0.0:9090 -> 9090

If using kind, we can use port-forward, kubectl port-forward services/prometheus-operated 39090:9090 --address 0.0.0.0 -n llmaz-system This allows us to access prometheus using a browser: http://localhost:9090/query

prometheus

2.4 - Supported Inference Backends

If you want to integrate more backends into llmaz, please refer to this PR. It’s always welcomed.

llama.cpp

llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

SGLang

SGLang is yet another fast serving framework for large language models and vision language models.

Text-Generation-Inference

text-generation-inference is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.

ollama

ollama is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.

vLLM

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs

3 - Develop Guidance

This section contains a develop guidance for people who want to learn more about this project.

Project Structure

llmaz # root
├── bin # where the binaries locates, like the kustomize, ginkgo, etc.
├── chart # where the helm chart locates
├── cmd # where the main entry locates
├── docs # where all the documents locate, like examples, installation guidance, etc.
├── llmaz # where the model loader logic locates
├── pkg # where the main logic for Kubernetes controllers locates

API design

Core APIs

See the API Reference for more details.

Inference APIs

See the API Reference for more details.

4 - Reference

This section contains the llmaz reference information.

4.1 - llmaz core API

Generated API reference documentation for llmaz.io/v1alpha1.

Resource Types

OpenModel

Appears in:

OpenModel is the Schema for the open models API

FieldDescription
apiVersion
string
llmaz.io/v1alpha1
kind
string
OpenModel
spec [Required]
ModelSpec
No description provided.
status [Required]
ModelStatus
No description provided.

Flavor

Appears in:

Flavor defines the accelerator requirements for a model and the necessary parameters in autoscaling. Right now, it will be used in two places:

  • Pod scheduling with node selectors specified.
  • Cluster autoscaling with essential parameters provided.
FieldDescription
name [Required]
FlavorName

Name represents the flavor name, which will be used in model claim.

limits
k8s.io/api/core/v1.ResourceList

Limits defines the required accelerators to serve the model for each replica, like <nvidia.com/gpu: 8>. For multi-hosts cases, the limits here indicates the resource requirements for each replica, usually equals to the TP size. Not recommended to set the cpu and memory usage here:

  • if using playground, you can define the cpu/mem usage at backendConfig.
  • if using inference service, you can define the cpu/mem at the container resources. However, if you define the same accelerator resources at playground/service as well, the resources will be overwritten by the flavor limit here.
nodeSelector
map[string]string

NodeSelector represents the node candidates for Pod placements, if a node doesn't meet the nodeSelector, it will be filtered out in the resourceFungibility scheduler plugin. If nodeSelector is empty, it means every node is a candidate.

params
map[string]string

Params stores other useful parameters and will be consumed by cluster-autoscaler / Karpenter for autoscaling or be defined as model parallelism parameters like TP or PP size. E.g. with autoscaling, when scaling up nodes with 8x Nvidia A00, the parameter can be injected with <INSTANCE-TYPE: p4d.24xlarge> for AWS. Preset parameters: TP, PP, INSTANCE-TYPE.

FlavorName

(Alias of string)

Appears in:

InferenceConfig

Appears in:

InferenceConfig represents the inference configurations for the model.

FieldDescription
flavors
[]Flavor

Flavors represents the accelerator requirements to serve the model. Flavors are fungible following the priority represented by the slice order.

ModelHub

Appears in:

ModelHub represents the model registry for model downloads.

FieldDescription
name
string

Name refers to the model registry, such as huggingface.

modelID [Required]
string

ModelID refers to the model identifier on model hub, such as meta-llama/Meta-Llama-3-8B.

filename [Required]
string

Filename refers to a specified model file rather than the whole repo. This is helpful to download a specified GGUF model rather than downloading the whole repo which includes all kinds of quantized models. TODO: this is only supported with Huggingface, add support for ModelScope in the near future. Note: once filename is set, allowPatterns and ignorePatterns should be left unset.

revision
string

Revision refers to a Git revision id which can be a branch name, a tag, or a commit hash.

allowPatterns
[]string

AllowPatterns refers to files matched with at least one pattern will be downloaded.

ignorePatterns
[]string

IgnorePatterns refers to files matched with any of the patterns will not be downloaded.

ModelName

(Alias of string)

Appears in:

ModelRef

Appears in:

ModelRef refers to a created Model with it's role.

FieldDescription
name [Required]
ModelName

Name represents the model name.

role
ModelRole

Role represents the model role once more than one model is required. Such as a draft role, which means running with SpeculativeDecoding, and default arguments for backend will be searched in backendRuntime with the name of speculative-decoding.

ModelRole

(Alias of string)

Appears in:

ModelSource

Appears in:

ModelSource represents the source of the model. Only one model source will be used.

FieldDescription
modelHub
ModelHub

ModelHub represents the model registry for model downloads.

uri
URIProtocol

URI represents a various kinds of model sources following the uri protocol, protocol://, e.g.

  • oss://./
  • ollama://llama3.3
  • host://

ModelSpec

Appears in:

ModelSpec defines the desired state of Model

FieldDescription
familyName [Required]
ModelName

FamilyName represents the model type, like llama2, which will be auto injected to the labels with the key of llmaz.io/model-family-name.

source [Required]
ModelSource

Source represents the source of the model, there're several ways to load the model such as loading from huggingface, OCI registry, s3, host path and so on.

inferenceConfig [Required]
InferenceConfig

InferenceConfig represents the inference configurations for the model.

ModelStatus

Appears in:

ModelStatus defines the observed state of Model

FieldDescription
conditions [Required]
[]k8s.io/apimachinery/pkg/apis/meta/v1.Condition

Conditions represents the Inference condition.

URIProtocol

(Alias of string)

Appears in:

URIProtocol represents the protocol of the URI.

4.2 - llmaz inference API

Generated API reference documentation for inference.llmaz.io/v1alpha1.

Resource Types

Playground

Appears in:

Playground is the Schema for the playgrounds API

FieldDescription
apiVersion
string
inference.llmaz.io/v1alpha1
kind
string
Playground
spec [Required]
PlaygroundSpec
No description provided.
status [Required]
PlaygroundStatus
No description provided.

Service

Appears in:

Service is the Schema for the services API

FieldDescription
apiVersion
string
inference.llmaz.io/v1alpha1
kind
string
Service
spec [Required]
ServiceSpec
No description provided.
status [Required]
ServiceStatus
No description provided.

BackendName

(Alias of string)

Appears in:

BackendRuntime

Appears in:

BackendRuntime is the Schema for the backendRuntime API

FieldDescription
spec [Required]
BackendRuntimeSpec
No description provided.
status [Required]
BackendRuntimeStatus
No description provided.

BackendRuntimeConfig

Appears in:

FieldDescription
backendName
BackendName

BackendName represents the inference backend under the hood, e.g. vLLM.

version
string

Version represents the backend version if you want a different one from the default version.

envs
[]k8s.io/api/core/v1.EnvVar

Envs represents the environments set to the container.

configName [Required]
string

ConfigName represents the recommended configuration name for the backend, It will be inferred from the models in the runtime if not specified, e.g. default, speculative-decoding.

args
[]string

Args defined here will "append" the args defined in the recommendedConfig, either explicitly configured in configName or inferred in the runtime.

resources
ResourceRequirements

Resources represents the resource requirements for backend, like cpu/mem, accelerators like GPU should not be defined here, but at the model flavors, or the values here will be overwritten. Resources defined here will "overwrite" the resources in the recommendedConfig.

sharedMemorySize
k8s.io/apimachinery/pkg/api/resource.Quantity

SharedMemorySize represents the size of /dev/shm required in the runtime of inference workload. SharedMemorySize defined here will "overwrite" the sharedMemorySize in the recommendedConfig.

BackendRuntimeSpec

Appears in:

BackendRuntimeSpec defines the desired state of BackendRuntime

FieldDescription
command
[]string

Command represents the default command for the backendRuntime.

image [Required]
string

Image represents the default image registry of the backendRuntime. It will work together with version to make up a real image.

version [Required]
string

Version represents the default version of the backendRuntime. It will be appended to the image as a tag.

envs
[]k8s.io/api/core/v1.EnvVar

Envs represents the environments set to the container.

lifecycle
k8s.io/api/core/v1.Lifecycle

Lifecycle represents hooks executed during the lifecycle of the container.

livenessProbe
k8s.io/api/core/v1.Probe

Periodic probe of backend liveness. Backend will be restarted if the probe fails. Cannot be updated.

readinessProbe
k8s.io/api/core/v1.Probe

Periodic probe of backend readiness. Backend will be removed from service endpoints if the probe fails.

startupProbe
k8s.io/api/core/v1.Probe

StartupProbe indicates that the Backend has successfully initialized. If specified, no other probes are executed until this completes successfully. If this probe fails, the backend will be restarted, just as if the livenessProbe failed. This can be used to provide different probe parameters at the beginning of a backend's lifecycle, when it might take a long time to load data or warm a cache, than during steady-state operation.

recommendedConfigs
[]RecommendedConfig

RecommendedConfigs represents the recommended configurations for the backendRuntime.

BackendRuntimeStatus

Appears in:

BackendRuntimeStatus defines the observed state of BackendRuntime

FieldDescription
conditions [Required]
[]k8s.io/apimachinery/pkg/apis/meta/v1.Condition

Conditions represents the Inference condition.

ElasticConfig

Appears in:

FieldDescription
minReplicas
int32

MinReplicas indicates the minimum number of inference workloads based on the traffic. Default to 1. MinReplicas couldn't be 0 now, will support serverless in the future.

maxReplicas
int32

MaxReplicas indicates the maximum number of inference workloads based on the traffic. Default to nil means there's no limit for the instance number.

scaleTrigger
ScaleTrigger

ScaleTrigger defines the rules to scale the workloads. Only one trigger cloud work at a time, mostly used in Playground. ScaleTrigger defined here will "overwrite" the scaleTrigger in the recommendedConfig.

HPATrigger

Appears in:

HPATrigger represents the configuration of the HorizontalPodAutoscaler. Inspired by kubernetes.io/pkg/apis/autoscaling/types.go#HorizontalPodAutoscalerSpec. Note: HPA component should be installed in prior.

FieldDescription
metrics
[]k8s.io/api/autoscaling/v2.MetricSpec

metrics contains the specifications for which to use to calculate the desired replica count (the maximum replica count across all metrics will be used). The desired replica count is calculated multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa. See the individual metric source types for more information about how each type of metric must respond.

behavior
k8s.io/api/autoscaling/v2.HorizontalPodAutoscalerBehavior

behavior configures the scaling behavior of the target in both Up and Down directions (scaleUp and scaleDown fields respectively). If not set, the default HPAScalingRules for scale up and scale down are used.

PlaygroundSpec

Appears in:

PlaygroundSpec defines the desired state of Playground

FieldDescription
replicas
int32

Replicas represents the replica number of inference workloads.

modelClaim
ModelClaim

ModelClaim represents claiming for one model, it's a simplified use case of modelClaims. Most of the time, modelClaim is enough. ModelClaim and modelClaims are exclusive configured.

modelClaims
ModelClaims

ModelClaims represents claiming for multiple models for more complicated use cases like speculative-decoding. ModelClaims and modelClaim are exclusive configured.

backendRuntimeConfig
BackendRuntimeConfig

BackendRuntimeConfig represents the inference backendRuntime configuration under the hood, e.g. vLLM, which is the default backendRuntime.

elasticConfig [Required]
ElasticConfig

ElasticConfig defines the configuration for elastic usage, e.g. the max/min replicas.

PlaygroundStatus

Appears in:

PlaygroundStatus defines the observed state of Playground

FieldDescription
conditions [Required]
[]k8s.io/apimachinery/pkg/apis/meta/v1.Condition

Conditions represents the Inference condition.

replicas [Required]
int32

Replicas track the replicas that have been created, whether ready or not.

selector [Required]
string

Selector points to the string form of a label selector which will be used by HPA.

RecommendedConfig

Appears in:

RecommendedConfig represents the recommended configurations for the backendRuntime, user can choose one of them to apply.

FieldDescription
name [Required]
string

Name represents the identifier of the config.

args
[]string

Args represents all the arguments for the command. Argument around with {{ .CONFIG }} is a configuration waiting for render.

resources
ResourceRequirements

Resources represents the resource requirements for backend, like cpu/mem, accelerators like GPU should not be defined here, but at the model flavors, or the values here will be overwritten.

sharedMemorySize
k8s.io/apimachinery/pkg/api/resource.Quantity

SharedMemorySize represents the size of /dev/shm required in the runtime of inference workload.

scaleTrigger
ScaleTrigger

ScaleTrigger defines the rules to scale the workloads. Only one trigger cloud work at a time.

ResourceRequirements

Appears in:

TODO: Do not support DRA yet, we can support that once needed.

FieldDescription
limits
k8s.io/api/core/v1.ResourceList

Limits describes the maximum amount of compute resources allowed. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

requests
k8s.io/api/core/v1.ResourceList

Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. Requests cannot exceed Limits. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

ScaleTrigger

Appears in:

ScaleTrigger defines the rules to scale the workloads. Only one trigger cloud work at a time, mostly used in Playground.

FieldDescription
hpa [Required]
HPATrigger

HPA represents the trigger configuration of the HorizontalPodAutoscaler.

ServiceSpec

Appears in:

ServiceSpec defines the desired state of Service. Service controller will maintain multi-flavor of workloads with different accelerators for cost or performance considerations.

FieldDescription
modelClaims [Required]
ModelClaims

ModelClaims represents multiple claims for different models.

replicas
int32

Replicas represents the replica number of inference workloads.

workloadTemplate [Required]
sigs.k8s.io/lws/api/leaderworkerset/v1.LeaderWorkerTemplate

WorkloadTemplate defines the template for leader/worker pods

rolloutStrategy
sigs.k8s.io/lws/api/leaderworkerset/v1.RolloutStrategy

RolloutStrategy defines the strategy that will be applied to update replicas when a revision is made to the leaderWorkerTemplate.

ServiceStatus

Appears in:

ServiceStatus defines the observed state of Service

FieldDescription
conditions [Required]
[]k8s.io/apimachinery/pkg/apis/meta/v1.Condition

Conditions represents the Inference condition.

replicas [Required]
int32

Replicas track the replicas that have been created, whether ready or not.

selector [Required]
string

Selector points to the string form of a label selector, the HPA will be able to autoscale your resource.