Key Features
Easy of Use
People can quick deploy a LLM service with minimal configurations.
Broad Backends Support
llmaz supports a wide range of advanced inference backends for different scenarios, like vLLM, Text-Generation-Inference, SGLang, llama.cpp. Find the full list of supported backends here.
Accelerator Fungibility
llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
Various Model Providers
llmaz supports a wide range of model providers, such as HuggingFace, ModelScope, ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
AI Gateway Support
Offering capabilities like token-based rate limiting, model routing with the integration of Envoy AI Gateway.
Build-in ChatUI
Out-of-the-box chatbot support with the integration of Open WebUI, offering capacities like function call, RAG, web search and more, see configurations here.
Scaling Efficiency
llmaz supports horizontal scaling with HPA by default and will integrate with autoscaling components like Cluster-Autoscaler or Karpenter for smart scaling across different clouds.
Efficient Model Distribution (WIP)
Out-of-the-box model cache system support with Manta, still under development right now with architecture reframing.