This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Reference

This section contains the llmaz reference information.

1: llmaz core API
2: llmaz inference API

1 - llmaz core API

Generated API reference documentation for llmaz.io/v1alpha1.

Resource Types

OpenModel

`OpenModel`

Appears in:

OpenModel is the Schema for the open models API

Field	Description
`apiVersion` string	`llmaz.io/v1alpha1`
`kind` string	`OpenModel`
`spec` [Required] `ModelSpec`	No description provided.
`status` [Required] `ModelStatus`	No description provided.

`Flavor`

Appears in:

InferenceConfig

Flavor defines the accelerator requirements for a model and the necessary parameters in autoscaling. Right now, it will be used in two places:

Pod scheduling with node selectors specified.
Cluster autoscaling with essential parameters provided.

Field	Description
`name` [Required] `FlavorName`	Name represents the flavor name, which will be used in model claim.
`limits` `k8s.io/api/core/v1.ResourceList`	Limits defines the required accelerators to serve the model for each replica, like <nvidia.com/gpu: 8>. For multi-hosts cases, the limits here indicates the resource requirements for each replica, usually equals to the TP size. Not recommended to set the cpu and memory usage here: if using playground, you can define the cpu/mem usage at backendConfig. if using inference service, you can define the cpu/mem at the container resources. However, if you define the same accelerator resources at playground/service as well, the resources will be overwritten by the flavor limit here.
`nodeSelector` `map[string]string`	NodeSelector represents the node candidates for Pod placements, if a node doesn't meet the nodeSelector, it will be filtered out in the resourceFungibility scheduler plugin. If nodeSelector is empty, it means every node is a candidate.
`params` `map[string]string`	Params stores other useful parameters and will be consumed by cluster-autoscaler / Karpenter for autoscaling or be defined as model parallelism parameters like TP or PP size. E.g. with autoscaling, when scaling up nodes with 8x Nvidia A00, the parameter can be injected with <INSTANCE-TYPE: p4d.24xlarge> for AWS. Preset parameters: TP, PP, INSTANCE-TYPE.

`FlavorName`

(Alias of string)

Appears in:

Flavor

`InferenceConfig`

Appears in:

ModelSpec

InferenceConfig represents the inference configurations for the model.

Field	Description
`flavors` `[]Flavor`	Flavors represents the accelerator requirements to serve the model. Flavors are fungible following the priority represented by the slice order.

`ModelHub`

Appears in:

ModelSource

ModelHub represents the model registry for model downloads.

Field	Description
`name` `string`	Name refers to the model registry, such as huggingface.
`modelID` [Required] `string`	ModelID refers to the model identifier on model hub, such as meta-llama/Meta-Llama-3-8B.
`filename` [Required] `string`	Filename refers to a specified model file rather than the whole repo. This is helpful to download a specified GGUF model rather than downloading the whole repo which includes all kinds of quantized models. TODO: this is only supported with Huggingface, add support for ModelScope in the near future. Note: once filename is set, allowPatterns and ignorePatterns should be left unset.
`revision` `string`	Revision refers to a Git revision id which can be a branch name, a tag, or a commit hash.
`allowPatterns` `[]string`	AllowPatterns refers to files matched with at least one pattern will be downloaded.
`ignorePatterns` `[]string`	IgnorePatterns refers to files matched with any of the patterns will not be downloaded.

`ModelName`

(Alias of string)

Appears in:

`ModelRef`

Appears in:

ModelRef refers to a created Model with it's role.

Field	Description
`name` [Required] `ModelName`	Name represents the model name.
`role` `ModelRole`	Role represents the model role once more than one model is required. Such as a draft role, which means running with SpeculativeDecoding, and default arguments for backend will be searched in backendRuntime with the name of speculative-decoding.

`ModelRole`

(Alias of string)

Appears in:

ModelRef

`ModelSource`

Appears in:

ModelSpec

ModelSource represents the source of the model. Only one model source will be used.

Field Description

Field	Description
`modelHub` `ModelHub`	ModelHub represents the model registry for model downloads.
`uri` `URIProtocol`	URI represents a various kinds of model sources following the uri protocol, protocol://, e.g. oss://./ ollama://llama3.3 host://

modelHub
ModelHub

ModelHub represents the model registry for model downloads.

uri
URIProtocol

URI represents a various kinds of model sources following the uri protocol, protocol://, e.g.

oss://./
ollama://llama3.3
host://

`ModelSpec`

Appears in:

OpenModel

ModelSpec defines the desired state of Model

Field	Description
`familyName` [Required] `ModelName`	FamilyName represents the model type, like llama2, which will be auto injected to the labels with the key of `llmaz.io/model-family-name`.
`source` [Required] `ModelSource`	Source represents the source of the model, there're several ways to load the model such as loading from huggingface, OCI registry, s3, host path and so on.
`inferenceConfig` [Required] `InferenceConfig`	InferenceConfig represents the inference configurations for the model.
`ownedBy` `string`	OwnedBy represents the owner of the running models serving by the backends, which will be exported as the field of "OwnedBy" in openai-compatible API "/models". Default to "llmaz" if not set.
`createdAt` `k8s.io/apimachinery/pkg/apis/meta/v1.Time`	CreatedAt represents the creation timestamp of the running models serving by the backends, which will be exported as the field of "Created" in openai-compatible API "/models". It follows the format of RFC 3339, for example "2024-05-21T10:00:00Z".

`ModelStatus`

Appears in:

OpenModel

ModelStatus defines the observed state of Model

Field	Description
`conditions` [Required] `[]k8s.io/apimachinery/pkg/apis/meta/v1.Condition`	Conditions represents the Inference condition.

`URIProtocol`

(Alias of string)

Appears in:

ModelSource

URIProtocol represents the protocol of the URI.

2 - llmaz inference API

Generated API reference documentation for inference.llmaz.io/v1alpha1.

Resource Types

`Playground`

Appears in:

Playground is the Schema for the playgrounds API

Field	Description
`apiVersion` string	`inference.llmaz.io/v1alpha1`
`kind` string	`Playground`
`spec` [Required] `PlaygroundSpec`	No description provided.
`status` [Required] `PlaygroundStatus`	No description provided.

`Service`

Appears in:

Service is the Schema for the services API

Field	Description
`apiVersion` string	`inference.llmaz.io/v1alpha1`
`kind` string	`Service`
`spec` [Required] `ServiceSpec`	No description provided.
`status` [Required] `ServiceStatus`	No description provided.

`BackendName`

(Alias of string)

Appears in:

BackendRuntimeConfig

`BackendRuntime`

Appears in:

BackendRuntime is the Schema for the backendRuntime API

Field	Description
`spec` [Required] `BackendRuntimeSpec`	No description provided.
`status` [Required] `BackendRuntimeStatus`	No description provided.

`BackendRuntimeConfig`

Appears in:

PlaygroundSpec

Field	Description
`backendName` `BackendName`	BackendName represents the inference backend under the hood, e.g. vLLM.
`version` `string`	Version represents the backend version if you want a different one from the default version.
`envs` `[]k8s.io/api/core/v1.EnvVar`	Envs represents the environments set to the container.
`configName` [Required] `string`	ConfigName represents the recommended configuration name for the backend, It will be inferred from the models in the runtime if not specified, e.g. default, speculative-decoding.
`args` `[]string`	Args defined here will "append" the args defined in the recommendedConfig, either explicitly configured in configName or inferred in the runtime.
`resources` `ResourceRequirements`	Resources represents the resource requirements for backend, like cpu/mem, accelerators like GPU should not be defined here, but at the model flavors, or the values here will be overwritten. Resources defined here will "overwrite" the resources in the recommendedConfig.
`sharedMemorySize` `k8s.io/apimachinery/pkg/api/resource.Quantity`	SharedMemorySize represents the size of /dev/shm required in the runtime of inference workload. SharedMemorySize defined here will "overwrite" the sharedMemorySize in the recommendedConfig.

`BackendRuntimeSpec`

Appears in:

BackendRuntime

BackendRuntimeSpec defines the desired state of BackendRuntime

Field	Description
`command` `[]string`	Command represents the default command for the backendRuntime.
`image` [Required] `string`	Image represents the default image registry of the backendRuntime. It will work together with version to make up a real image.
`version` [Required] `string`	Version represents the default version of the backendRuntime. It will be appended to the image as a tag.
`envs` `[]k8s.io/api/core/v1.EnvVar`	Envs represents the environments set to the container.
`lifecycle` `k8s.io/api/core/v1.Lifecycle`	Lifecycle represents hooks executed during the lifecycle of the container.
`livenessProbe` `k8s.io/api/core/v1.Probe`	Periodic probe of backend liveness. Backend will be restarted if the probe fails. Cannot be updated.
`readinessProbe` `k8s.io/api/core/v1.Probe`	Periodic probe of backend readiness. Backend will be removed from service endpoints if the probe fails.
`startupProbe` `k8s.io/api/core/v1.Probe`	StartupProbe indicates that the Backend has successfully initialized. If specified, no other probes are executed until this completes successfully. If this probe fails, the backend will be restarted, just as if the livenessProbe failed. This can be used to provide different probe parameters at the beginning of a backend's lifecycle, when it might take a long time to load data or warm a cache, than during steady-state operation.
`recommendedConfigs` `[]RecommendedConfig`	RecommendedConfigs represents the recommended configurations for the backendRuntime.

`BackendRuntimeStatus`

Appears in:

BackendRuntime

BackendRuntimeStatus defines the observed state of BackendRuntime

Field	Description
`conditions` [Required] `[]k8s.io/apimachinery/pkg/apis/meta/v1.Condition`	Conditions represents the Inference condition.

`ElasticConfig`

Appears in:

PlaygroundSpec

Field	Description
`minReplicas` `int32`	MinReplicas indicates the minimum number of inference workloads based on the traffic. Default to 1. MinReplicas couldn't be 0 now, will support serverless in the future.
`maxReplicas` [Required] `int32`	MaxReplicas indicates the maximum number of inference workloads based on the traffic. Default to nil means there's no limit for the instance number.
`scaleTrigger` `ScaleTrigger`	ScaleTrigger defines the rules to scale the workloads. Only one trigger cloud work at a time, mostly used in Playground. ScaleTrigger defined here will "overwrite" the scaleTrigger in the recommendedConfig.

`HPATrigger`

Appears in:

ScaleTrigger

HPATrigger represents the configuration of the HorizontalPodAutoscaler. Inspired by kubernetes.io/pkg/apis/autoscaling/types.go#HorizontalPodAutoscalerSpec. Note: HPA component should be installed in prior.

Field Description

Field	Description
`metrics` `[]k8s.io/api/autoscaling/v2.MetricSpec`	metrics contains the specifications for which to use to calculate the desired replica count (the maximum replica count across all metrics will be used). The desired replica count is calculated multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa. See the individual metric source types for more information about how each type of metric must respond.
`behavior` `k8s.io/api/autoscaling/v2.HorizontalPodAutoscalerBehavior`	behavior configures the scaling behavior of the target in both Up and Down directions (scaleUp and scaleDown fields respectively). If not set, the default HPAScalingRules for scale up and scale down are used.

metrics
[]k8s.io/api/autoscaling/v2.MetricSpec

metrics contains the specifications for which to use to calculate the desired replica count (the maximum replica count across all metrics will be used). The desired replica count is calculated multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa. See the individual metric source types for more information about how each type of metric must respond.

behavior
k8s.io/api/autoscaling/v2.HorizontalPodAutoscalerBehavior

behavior configures the scaling behavior of the target in both Up and Down directions (scaleUp and scaleDown fields respectively). If not set, the default HPAScalingRules for scale up and scale down are used.

`PlaygroundSpec`

Appears in:

Playground

PlaygroundSpec defines the desired state of Playground

Field	Description
`replicas` `int32`	Replicas represents the replica number of inference workloads.
`modelClaim` `ModelClaim`	ModelClaim represents claiming for one model, it's a simplified use case of modelClaims. Most of the time, modelClaim is enough. ModelClaim and modelClaims are exclusive configured.
`modelClaims` `ModelClaims`	ModelClaims represents claiming for multiple models for more complicated use cases like speculative-decoding. ModelClaims and modelClaim are exclusive configured.
`backendRuntimeConfig` `BackendRuntimeConfig`	BackendRuntimeConfig represents the inference backendRuntime configuration under the hood, e.g. vLLM, which is the default backendRuntime.
`elasticConfig` `ElasticConfig`	ElasticConfig defines the configuration for elastic usage, e.g. the max/min replicas.

`PlaygroundStatus`

Appears in:

Playground

PlaygroundStatus defines the observed state of Playground

Field	Description
`conditions` [Required] `[]k8s.io/apimachinery/pkg/apis/meta/v1.Condition`	Conditions represents the Inference condition.
`replicas` [Required] `int32`	Replicas track the replicas that have been created, whether ready or not.
`selector` [Required] `string`	Selector points to the string form of a label selector which will be used by HPA.

`RecommendedConfig`

Appears in:

BackendRuntimeSpec

RecommendedConfig represents the recommended configurations for the backendRuntime, user can choose one of them to apply.

Field	Description
`name` [Required] `string`	Name represents the identifier of the config.
`args` `[]string`	Args represents all the arguments for the command. Argument around with {{ .CONFIG }} is a configuration waiting for render.
`resources` `ResourceRequirements`	Resources represents the resource requirements for backend, like cpu/mem, accelerators like GPU should not be defined here, but at the model flavors, or the values here will be overwritten.
`sharedMemorySize` `k8s.io/apimachinery/pkg/api/resource.Quantity`	SharedMemorySize represents the size of /dev/shm required in the runtime of inference workload.
`scaleTrigger` `ScaleTrigger`	ScaleTrigger defines the rules to scale the workloads. Only one trigger cloud work at a time.

`ResourceRequirements`

Appears in:

TODO: Do not support DRA yet, we can support that once needed.

Field	Description
`limits` `k8s.io/api/core/v1.ResourceList`	Limits describes the maximum amount of compute resources allowed. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
`requests` `k8s.io/api/core/v1.ResourceList`	Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. Requests cannot exceed Limits. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

`ScaleTrigger`

Appears in:

ScaleTrigger defines the rules to scale the workloads. Only one trigger cloud work at a time, mostly used in Playground.

Field	Description
`hpa` [Required] `HPATrigger`	HPA represents the trigger configuration of the HorizontalPodAutoscaler.

`ServiceSpec`

Appears in:

Service

ServiceSpec defines the desired state of Service. Service controller will maintain multi-flavor of workloads with different accelerators for cost or performance considerations.

Field	Description
`modelClaims` [Required] `ModelClaims`	ModelClaims represents multiple claims for different models.
`replicas` `int32`	Replicas represents the replica number of inference workloads.
`workloadTemplate` [Required] `sigs.k8s.io/lws/api/leaderworkerset/v1.LeaderWorkerTemplate`	WorkloadTemplate defines the template for leader/worker pods
`rolloutStrategy` `sigs.k8s.io/lws/api/leaderworkerset/v1.RolloutStrategy`	RolloutStrategy defines the strategy that will be applied to update replicas when a revision is made to the leaderWorkerTemplate.

`ServiceStatus`

Appears in:

Service

ServiceStatus defines the observed state of Service

Field	Description
`conditions` [Required] `[]k8s.io/apimachinery/pkg/apis/meta/v1.Condition`	Conditions represents the Inference condition.
`replicas` [Required] `int32`	Replicas track the replicas that have been created, whether ready or not.
`selector` [Required] `string`	Selector points to the string form of a label selector, the HPA will be able to autoscale your resource.