This is the multi-page printable view of this section. Click here to print.
Integrations
1 - Envoy AI Gateway
Envoy AI Gateway is an open source project for using Envoy Gateway to handle request traffic from application clients to Generative AI services.
How to use
1. Enable Envoy Gateway and Envoy AI Gateway
Both of them are enabled by default in values.global.yaml
and will be deployed in llmaz-system.
envoy-gateway:
enabled: true
envoy-ai-gateway:
enabled: true
However, Envoy Gateway and Envoy AI Gateway can be deployed standalone in case you want to deploy them in other namespaces.
2. Basic AI Gateway Example
To expose your models via Envoy Gateway, you need to create a GatewayClass, Gateway, and AIGatewayRoute. The following example shows how to do this.
We’ll deploy two models Qwen/Qwen2-0.5B-Instruct-GGUF
and Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF
with llama.cpp (cpu only) and expose them via Envoy AI Gateway.
The full example is here, apply it.
3. Check Envoy AI Gateway APIs
If Open-WebUI is enabled, you can chat via the webui (recommended), see documentation. Otherwise, following the steps below to test the Envoy AI Gateway APIs.
I. Port-forwarding the LoadBalancer
service in llmaz-system, like:
kubectl port-forward svc/envoy-default-default-envoy-ai-gateway-dbec795a 8080:80
II. Query http://localhost:8008/v1/models | jq .
, available models will be listed. Expected response will look like this:
{
"data": [
{
"id": "qwen2-0.5b",
"created": 1745327294,
"object": "model",
"owned_by": "Envoy AI Gateway"
},
{
"id": "qwen2.5-coder",
"created": 1745327294,
"object": "model",
"owned_by": "Envoy AI Gateway"
}
],
"object": "list"
}
III. Query http://localhost:8080/v1/chat/completions
to chat with the model. Here, we ask the qwen2-0.5b
model, the query will look like:
curl -H "Content-Type: application/json" -d '{
"model": "qwen2-0.5b",
"messages": [
{
"role": "system",
"content": "Hi."
}
]
}' http://localhost:8080/v1/chat/completions | jq .
Expected response will look like this:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?"
}
}
],
"created": 1745327371,
"model": "qwen2-0.5b",
"system_fingerprint": "b5124-bc091a4d",
"object": "chat.completion",
"usage": {
"completion_tokens": 10,
"prompt_tokens": 10,
"total_tokens": 20
},
"id": "chatcmpl-AODlT8xnf4OjJwpQH31XD4yehHLnurr0",
"timings": {
"prompt_n": 1,
"prompt_ms": 319.876,
"prompt_per_token_ms": 319.876,
"prompt_per_second": 3.1262114069201816,
"predicted_n": 10,
"predicted_ms": 1309.393,
"predicted_per_token_ms": 130.9393,
"predicted_per_second": 7.63712651587415
}
}
2 - Open WebUI
Open WebUI is a user-friendly AI interface with OpenAI-compatible APIs, serving as the default chatbot for llmaz.
Prerequisites
- Make sure you’re located in llmaz-system namespace, haven’t tested with other namespaces.
- Make sure EnvoyGateway and Envoy AI Gateway are installed, both of them are installed by default in llmaz. See AI Gateway for more details.
How to use
If open-webui already installed, what you need to do is just update the OpenAI API endpoint in the admin settings. You can get the value from step2 & 3 below. Otherwise, following the steps here to install open-webui.
Enable Open WebUI in the
values.global.yaml
file, open-webui is enabled by default.open-webui: enabled: true
Optional to set the
persistence=true
to persist the data, recommended for production.Run
kubectl get svc -n llmaz-system
to list out the services, the output looks like:envoy-default-default-envoy-ai-gateway-dbec795a LoadBalancer 10.96.145.150 <pending> 80:30548/TCP 132m envoy-gateway ClusterIP 10.96.52.76 <none> 18000/TCP,18001/TCP,18002/TCP,19001/TCP 172m
Set
openaiBaseApiUrl
in thevalues.global.yaml
like:open-webui: enabled: true openaiBaseApiUrl: http://envoy-default-default-envoy-ai-gateway-dbec795a.llmaz-system.svc.cluster.local/v1
Run
make install-chatbot
to install the chatbot.Port forwarding by:
kubectl port-forward svc/open-webui 8080:80
Visit http://localhost:8080 to access the Open WebUI.
Configure the administrator for the first time.
That’s it! You can now chat with llmaz models with Open WebUI.
3 - Prometheus Operator
Currently, llmaz has already integrated metrics. This document provides deployment steps explaining how to install and configure Prometheus Operator in a Kubernetes cluster.
Install the prometheus operator
Please follow the documentation to install
# Installing the prometheus operator
root@VM-0-5-ubuntu:/home/ubuntu# kubectl get pods
NAME READY STATUS RESTARTS AGE
prometheus-operator-55b5c96cf8-jl2nx 1/1 Running 0 12s
Ensure that the Prometheus Operator Pod is running successfully.
Install the ServiceMonitor CR for llmaz
To enable monitoring for the llmaz system, you need to install the ServiceMonitor custom resource (CR).
You can either modify the Helm chart prometheus according to the documentation or use make install-prometheus
in Makefile.
- Using Helm Chart: to modify the values.global.yaml
prometheus:
# -- Whether to enable Prometheus metrics exporting.
enable: true
- Using Makefile Command:
make install-prometheus
root@VM-0-5-ubuntu:/home/ubuntu/llmaz# make install-prometheus
kubectl apply --server-side -k config/prometheus
serviceaccount/llmaz-prometheus serverside-applied
clusterrole.rbac.authorization.k8s.io/llmaz-prometheus serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/llmaz-prometheus serverside-applied
prometheus.monitoring.coreos.com/llmaz-prometheus serverside-applied
servicemonitor.monitoring.coreos.com/llmaz-controller-manager-metrics-monitor serverside-applied
Check Related Resources
Verify that the necessary resources have been created:
- ServiceMonitor
root@VM-0-5-ubuntu:/home/ubuntu/llmaz# kubectl get ServiceMonitor -n llmaz-system
NAME AGE
llmaz-controller-manager-metrics-monitor 59s
- Prometheus Pods
root@VM-0-5-ubuntu:/home/ubuntu/llmaz# kubectl get pods -n llmaz-system
NAME READY STATUS RESTARTS AGE
llmaz-controller-manager-7ff8f7d9bd-vztls 2/2 Running 0 28s
prometheus-llmaz-prometheus-0 2/2 Running 0 27s
- Services
root@VM-0-5-ubuntu:/home/ubuntu/llmaz# kubectl get svc -n llmaz-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
llmaz-controller-manager-metrics-service ClusterIP 10.96.79.226 <none> 8443/TCP 46s
llmaz-webhook-service ClusterIP 10.96.249.226 <none> 443/TCP 46s
prometheus-operated ClusterIP None <none> 9090/TCP 45s
View metrics using the prometheus UI
Use port forwarding to access the Prometheus UI from your local machine:
root@VM-0-5-ubuntu:/home/ubuntu# kubectl port-forward services/prometheus-operated 9090:9090 --address 0.0.0.0 -n llmaz-system
Forwarding from 0.0.0.0:9090 -> 9090
If using kind, we can use port-forward, kubectl port-forward services/prometheus-operated 39090:9090 --address 0.0.0.0 -n llmaz-system
This allows us to access prometheus using a browser: http://localhost:9090/query
4 - Supported Inference Backends
If you want to integrate more backends into llmaz, please refer to this PR. It’s always welcomed.
llama.cpp
llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.
SGLang
SGLang is yet another fast serving framework for large language models and vision language models.
Text-Generation-Inference
text-generation-inference is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.
ollama
ollama is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.
vLLM
vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs