Learn More GitHub

Easy, advanced inference platform for large language models on Kubernetes.

Key Features

Easy of Use

People can quick deploy a LLM service with minimal configurations.

Broad Backends Support

llmaz supports a wide range of advanced inference backends for different scenarios, like vLLM, Text-Generation-Inference, SGLang, llama.cpp. Find the full list of supported backends here.

Accelerator Fungibility

llmaz supports serving the same LLM with various accelerators to optimize cost and performance.

Various Model Providers

llmaz supports a wide range of model providers, such as HuggingFace, ModelScope, ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.

Multi-Host Support

llmaz supports both single-host and multi-host scenarios with LWS from day 0.

AI Gateway Support

Offering capabilities like token-based rate limiting, model routing with the integration of Envoy AI Gateway.

Build-in ChatUI

Out-of-the-box chatbot support with the integration of Open WebUI, offering capacities like function call, RAG, web search and more, see configurations here.

Scaling Efficiency

llmaz supports horizontal scaling with HPA by default and will integrate with autoscaling components like Cluster-Autoscaler or Karpenter for smart scaling across different clouds.

Efficient Model Distribution (WIP)

Out-of-the-box model cache system support with Manta, still under development right now with architecture reframing.