KServe⚓︎

Links to external documentation

KServe is a standard Model Inference Platform on Kubernetes, built for highly scalable use cases:

Provide a performant, standardized inference protocol across ML frameworks.
Support modern serverless inference workloads with Autoscaling including Scale to Zero on GPU.
Simple and pluggable production serving for production ML serving including prediction, pre/post-processing, monitoring and explainability.

Enabling node selector for inference services⚓︎

KServe supports placing model workloads on specific nodes via node selectors.

To enable this feature, add the following values to knative-serving/config-features ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-features
  namespace: knative-serving
data:
  kubernetes.podspec-affinity: "enabled"
  kubernetes.podspec-nodeselector: "enabled"
  kubernetes.podspec-tolerations: "enabled"

You can now specify nodes in your InferenceService spec:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen2-5-72b-instruct
spec:
  predictor:
    nodeSelector:  # also works with affinity and tolerations
      nvidia.com/mig.config: all-disabled  # these are node labels
      nvidia.com/gpu.product: NVIDIA-H100-NVL

Enabling local model cache for an S3 bucket⚓︎

KServe's local model cache helps to significantly speed up Pod startup time for LLM inference. The KServe documentation example by the link above gives an overview on how to cache model from a persistent volume. The below documentation provides an example on how to enable caching for a model stored in an S3 bucket.

To put a model into an S3 bucket, follow the instructions from here, but ignore the last step (creating an InferenceService). Instead, execute the following steps:

First, enable local model cache, as specified in the documentation.

Create a custom ClusterStorageContainer. It is the custom resource that will be responsible for loading the model from an S3 bucket:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterStorageContainer
metadata:
  name: s3
spec:
  container:
    name: storage-initializer
    image: kserve/storage-initializer:v0.15.2
    env:
      # the secret should already be created during loading model into an S3 bucket
      - name: AWS_ACCESS_KEY_ID
        valueFrom:
          { secretKeyRef: { name: s3creds-kserve, key: AWS_ACCESS_KEY_ID } }
      - name: AWS_SECRET_ACCESS_KEY
        valueFrom:
          { secretKeyRef: { name: s3creds-kserve, key: AWS_SECRET_ACCESS_KEY } }
      - name: AWS_ENDPOINT_URL
        value: http://minio.minio:80
      - name: S3_USE_HTTPS
        value: "0"
    resources:
      requests: { cpu: "1", memory: 2Gi }
      limits: { cpu: "1", memory: 4Gi }
  supportedUriFormats:
    - prefix: s3://
  workloadType: localModelDownloadJob

Create LocalModelNodeGroup custom resource. It specifies the nodes where the model will be cached.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: serving.kserve.io/v1alpha1
kind: LocalModelNodeGroup
metadata:
  name: workers
spec:
  storageLimit: 170Gi
  persistentVolumeClaimSpec:
    accessModes: [ReadWriteOnce]
    resources: { requests: { storage: 170Gi } }
    storageClassName: local-storage
    volumeMode: Filesystem
    volumeName: models
  persistentVolumeSpec:
    accessModes: [ReadWriteOnce]
    volumeMode: Filesystem
    capacity: { storage: 170Gi }
    local: { path: /models }
    storageClassName: local-storage
    nodeAffinity:
      required:
        # example node selector - nodes with H100 GPUs
        nodeSelectorTerms:
          - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values: ["NVIDIA-H100-NVL"]

Workaround step. Local Model Cache is virtually disabled for Kubeflow by default, as of this writing (see this issue). This is reflected in this manifest as a delete patch.
- You can manually deploy the DaemonSet from here.
- Alternatively, uncomment the mentioned delete patch in the gitops branch's corresponding manifest (paas/manifests/applications/kserve/kserve/kustomization.yaml), push to origin and sync kubeflow Argo app.
- You might need to manually label the nodes where you want to cache the model with kserve/localmodel: worker label.
If this step was executed correctly, you should see DaemonSet's pods running on each labelled node.

Create a CR responsible for model caching. This should trigger caching jobs on each labelled node.

apiVersion: serving.kserve.io/v1alpha1
kind: LocalModelCache
metadata:
  name: qwen-72b-weights
spec:
  sourceModelUri: s3://prokube-demo-profile-data/qwen  # path to where config.json can be found
  modelSize: 140Gi
  nodeGroups:
    - workers

Ensure that the caching is finished: download Jobs are complete or run

kubectl get localmodelnode -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.modelStatus}{"\n"}{end}'

– the status on all listed nodes should be "ModelDownloaded". After that, deploy an InferenceService:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen2-5-72b-instruct
spec:
  predictor:
    nodeSelector:
      nvidia.com/mig.config: all-disabled
      nvidia.com/gpu.product: NVIDIA-H100-NVL
    model:
      modelFormat:
        name: huggingface
      # this should match LocalModelCache
      storageUri: "s3://prokube-demo-profile-data/qwen"
      args:
        - --served-model-name=qwen2-5-72b-instruct
        - --dtype=bfloat16
        - --max-model-len=10240
        - --quantization=fp8
        - --tensor-parallel-size=2
      resources:
        limits:
          cpu: "8"
          memory: 64Gi
          nvidia.com/gpu: "2"
        requests:
          cpu: "3"
          memory: 64Gi
          nvidia.com/gpu: "2"

The model startup time should be dramatically reduced! To ensure cluster safety, delete the created kserve-localmodelnode-agent DaemonSet after executing all the steps.