This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Integrations

This section contains the llmaz integration information.

1 - Envoy AI Gateway

Envoy AI Gateway is an open source project for using Envoy Gateway to handle request traffic from application clients to Generative AI services.

How to use

1. Enable Envoy Gateway and Envoy AI Gateway

Both of them are enabled by default in values.global.yaml and will be deployed in llmaz-system.

envoy-gateway:
    enabled: true
envoy-ai-gateway:
    enabled: true

However, Envoy Gateway and Envoy AI Gateway can be deployed standalone in case you want to deploy them in other namespaces.

2. Basic AI Gateway Example

To expose your models via Envoy Gateway, you need to create a GatewayClass, Gateway, and AIGatewayRoute. The following example shows how to do this.

We’ll deploy two models Qwen/Qwen2-0.5B-Instruct-GGUF and Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF with llama.cpp (cpu only) and expose them via Envoy AI Gateway.

The full example is here, apply it.

3. Check Envoy AI Gateway APIs

If Open-WebUI is enabled, you can chat via the webui (recommended), see documentation. Otherwise, following the steps below to test the Envoy AI Gateway APIs.

I. Port-forwarding the LoadBalancer service in llmaz-system, like:

kubectl port-forward svc/envoy-default-default-envoy-ai-gateway-dbec795a 8080:80

II. Query http://localhost:8008/v1/models | jq ., available models will be listed. Expected response will look like this:

{
  "data": [
    {
      "id": "qwen2-0.5b",
      "created": 1745327294,
      "object": "model",
      "owned_by": "Envoy AI Gateway"
    },
    {
      "id": "qwen2.5-coder",
      "created": 1745327294,
      "object": "model",
      "owned_by": "Envoy AI Gateway"
    }
  ],
  "object": "list"
}

III. Query http://localhost:8080/v1/chat/completions to chat with the model. Here, we ask the qwen2-0.5b model, the query will look like:

curl -H "Content-Type: application/json"     -d '{
        "model": "qwen2-0.5b",
        "messages": [
            {
                "role": "system",
                "content": "Hi."
            }
        ]
    }'     http://localhost:8080/v1/chat/completions | jq .

Expected response will look like this:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today?"
      }
    }
  ],
  "created": 1745327371,
  "model": "qwen2-0.5b",
  "system_fingerprint": "b5124-bc091a4d",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 10,
    "prompt_tokens": 10,
    "total_tokens": 20
  },
  "id": "chatcmpl-AODlT8xnf4OjJwpQH31XD4yehHLnurr0",
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 319.876,
    "prompt_per_token_ms": 319.876,
    "prompt_per_second": 3.1262114069201816,
    "predicted_n": 10,
    "predicted_ms": 1309.393,
    "predicted_per_token_ms": 130.9393,
    "predicted_per_second": 7.63712651587415
  }
}

2 - Open WebUI

Open WebUI is a user-friendly AI interface with OpenAI-compatible APIs, serving as the default chatbot for llmaz.

Prerequisites

  • Make sure you’re located in llmaz-system namespace, haven’t tested with other namespaces.
  • Make sure EnvoyGateway and Envoy AI Gateway are installed, both of them are installed by default in llmaz. See AI Gateway for more details.

How to use

If open-webui already installed, what you need to do is just update the OpenAI API endpoint in the admin settings. You can get the value from step2 & 3 below. Otherwise, following the steps here to install open-webui.

  1. Enable Open WebUI in the values.global.yaml file, open-webui is enabled by default.

    open-webui:
      enabled: true
    

    Optional to set the persistence=true to persist the data, recommended for production.

  2. Run kubectl get svc -n llmaz-system to list out the services, the output looks like:

    envoy-default-default-envoy-ai-gateway-dbec795a   LoadBalancer   10.96.145.150   <pending>     80:30548/TCP                              132m
    envoy-gateway                                     ClusterIP      10.96.52.76     <none>        18000/TCP,18001/TCP,18002/TCP,19001/TCP   172m
    
  3. Set openaiBaseApiUrl in the values.global.yaml like:

    open-webui:
      enabled: true
      openaiBaseApiUrl: http://envoy-default-default-envoy-ai-gateway-dbec795a.llmaz-system.svc.cluster.local/v1
    
  4. Run make install-chatbot to install the chatbot.

  5. Port forwarding by:

    kubectl port-forward svc/open-webui 8080:80
    
  6. Visit http://localhost:8080 to access the Open WebUI.

  7. Configure the administrator for the first time.

That’s it! You can now chat with llmaz models with Open WebUI.

3 - Prometheus Operator

Currently, llmaz has already integrated metrics. This document provides deployment steps explaining how to install and configure Prometheus Operator in a Kubernetes cluster.

Install the prometheus operator

Please follow the documentation to install

# Installing the prometheus operator
root@VM-0-5-ubuntu:/home/ubuntu# kubectl get pods
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-55b5c96cf8-jl2nx   1/1     Running   0          12s

Ensure that the Prometheus Operator Pod is running successfully.

Install the ServiceMonitor CR for llmaz

To enable monitoring for the llmaz system, you need to install the ServiceMonitor custom resource (CR). You can either modify the Helm chart prometheus according to the documentation or use make install-prometheus in Makefile.

  • Using Helm Chart: to modify the values.global.yaml
prometheus:
  # -- Whether to enable Prometheus metrics exporting.
  enable: true
  • Using Makefile Command: make install-prometheus
root@VM-0-5-ubuntu:/home/ubuntu/llmaz# make install-prometheus
kubectl apply --server-side -k config/prometheus
serviceaccount/llmaz-prometheus serverside-applied
clusterrole.rbac.authorization.k8s.io/llmaz-prometheus serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/llmaz-prometheus serverside-applied
prometheus.monitoring.coreos.com/llmaz-prometheus serverside-applied
servicemonitor.monitoring.coreos.com/llmaz-controller-manager-metrics-monitor serverside-applied

Verify that the necessary resources have been created:

  • ServiceMonitor
root@VM-0-5-ubuntu:/home/ubuntu/llmaz# kubectl get ServiceMonitor -n llmaz-system
NAME                                       AGE
llmaz-controller-manager-metrics-monitor   59s
  • Prometheus Pods
root@VM-0-5-ubuntu:/home/ubuntu/llmaz# kubectl get pods -n llmaz-system
NAME                                        READY   STATUS    RESTARTS   AGE
llmaz-controller-manager-7ff8f7d9bd-vztls   2/2     Running   0          28s
prometheus-llmaz-prometheus-0               2/2     Running   0          27s
  • Services
root@VM-0-5-ubuntu:/home/ubuntu/llmaz# kubectl get svc -n llmaz-system
NAME                                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
llmaz-controller-manager-metrics-service   ClusterIP   10.96.79.226    <none>        8443/TCP   46s
llmaz-webhook-service                      ClusterIP   10.96.249.226   <none>        443/TCP    46s
prometheus-operated                        ClusterIP   None            <none>        9090/TCP   45s

View metrics using the prometheus UI

Use port forwarding to access the Prometheus UI from your local machine:

root@VM-0-5-ubuntu:/home/ubuntu# kubectl port-forward services/prometheus-operated 9090:9090 --address 0.0.0.0 -n llmaz-system
Forwarding from 0.0.0.0:9090 -> 9090

If using kind, we can use port-forward, kubectl port-forward services/prometheus-operated 39090:9090 --address 0.0.0.0 -n llmaz-system This allows us to access prometheus using a browser: http://localhost:9090/query

prometheus

4 - Supported Inference Backends

If you want to integrate more backends into llmaz, please refer to this PR. It’s always welcomed.

llama.cpp

llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

SGLang

SGLang is yet another fast serving framework for large language models and vision language models.

Text-Generation-Inference

text-generation-inference is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.

ollama

ollama is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.

vLLM

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs