llmaz

Easy, advanced inference platform for large language models on Kubernetes.

Key Features

Easy of Use

People can quick deploy a LLM service with minimal configurations.

Broad Backends Support

llmaz supports a wide range of advanced inference backends for different scenarios, like vLLM, Text-Generation-Inference, SGLang, llama.cpp. Find the full list of supported backends here.

Heterogeneous Cluster Support

llmaz supports serving the same LLM with various accelerators to optimize cost and performance.

Various Model Providers

llmaz supports a wide range of model providers, such as HuggingFace, ModelScope, ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.

Distributed Serving

Multi-host & homogeneous xPyD distributed serving support with LWS from day 0. Will implement the heterogeneous xPyD in the future.

AI Gateway Support

Offering capabilities like token-based rate limiting, model routing with the integration of Envoy AI Gateway.

Scaling Efficiency

Horizontal Pod scaling with HPA based on LLM-focused metrics and node(spot instance) autoscaling with Karpenter.

Build-in ChatUI

Out-of-the-box chatbot support with the integration of Open WebUI, offering capacities like function call, RAG, web search and more, see configurations here.