Skip to content

KServe⚓︎

Links to external documentation

→ KServe Documentation

KServe is a standard Model Inference Platform on Kubernetes, built for highly scalable use cases:

  • Provide a performant, standardized inference protocol across ML frameworks.
  • Support modern serverless inference workloads with Autoscaling including Scale to Zero on GPU.
  • Simple and pluggable production serving for production ML serving including prediction, pre/post-processing, monitoring and explainability.

Enabling node selector for inference services⚓︎

KServe supports placing model workloads on specific nodes via node selectors.

To enable this feature, add the following values to knative-serving/config-features ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-features
  namespace: knative-serving
data:
  kubernetes.podspec-affinity: "enabled"
  kubernetes.podspec-nodeselector: "enabled"
  kubernetes.podspec-tolerations: "enabled"

You can now specify nodes in your InferenceService spec:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen2-5-72b-instruct
spec:
  predictor:
    nodeSelector:  # also works with affinity and tolerations
      nvidia.com/mig.config: all-disabled  # these are node labels
      nvidia.com/gpu.product: NVIDIA-H100-NVL

Enabling local model cache for an S3 bucket⚓︎

KServe's local model cache helps to significantly speed up Pod startup time for LLM inference. The KServe documentation example by the link above gives an overview on how to cache model from a persistent volume. The below documentation provides an example on how to enable caching for a model stored in an S3 bucket.

To put a model into an S3 bucket, follow the instructions from here, but ignore the last step (creating an InferenceService). Instead, execute the following steps:

  1. First, enable local model cache, as specified in the documentation.
  2. Create a custom ClusterStorageContainer. It is the custom resource that will be responsible for loading the model from an S3 bucket:

    apiVersion: serving.kserve.io/v1alpha1
    kind: ClusterStorageContainer
    metadata:
      name: s3
    spec:
      container:
        name: storage-initializer
        image: kserve/storage-initializer:v0.15.2
        env:
          # the secret should already be created during loading model into an S3 bucket
          - name: AWS_ACCESS_KEY_ID
            valueFrom:
              { secretKeyRef: { name: s3creds-kserve, key: AWS_ACCESS_KEY_ID } }
          - name: AWS_SECRET_ACCESS_KEY
            valueFrom:
              { secretKeyRef: { name: s3creds-kserve, key: AWS_SECRET_ACCESS_KEY } }
          - name: AWS_ENDPOINT_URL
            value: http://minio.minio:80
          - name: S3_USE_HTTPS
            value: "0"
        resources:
          requests: { cpu: "1", memory: 2Gi }
          limits: { cpu: "1", memory: 4Gi }
      supportedUriFormats:
        - prefix: s3://
      workloadType: localModelDownloadJob
    

  3. Create LocalModelNodeGroup custom resource. It specifies the nodes where the model will be cached.

    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: local-storage
    provisioner: kubernetes.io/no-provisioner
    volumeBindingMode: WaitForFirstConsumer
    ---
    apiVersion: serving.kserve.io/v1alpha1
    kind: LocalModelNodeGroup
    metadata:
      name: workers
    spec:
      storageLimit: 170Gi
      persistentVolumeClaimSpec:
        accessModes: [ReadWriteOnce]
        resources: { requests: { storage: 170Gi } }
        storageClassName: local-storage
        volumeMode: Filesystem
        volumeName: models
      persistentVolumeSpec:
        accessModes: [ReadWriteOnce]
        volumeMode: Filesystem
        capacity: { storage: 170Gi }
        local: { path: /models }
        storageClassName: local-storage
        nodeAffinity:
          required:
            # example node selector - nodes with H100 GPUs
            nodeSelectorTerms:
              - matchExpressions:
                  - key: nvidia.com/gpu.product
                    operator: In
                    values: ["NVIDIA-H100-NVL"]
    

  4. Workaround step. Local Model Cache is virtually disabled for Kubeflow by default, as of this writing (see this issue). This is reflected in this manifest as a delete patch.

    • You can manually deploy the DaemonSet from here.
    • Alternatively, uncomment the mentioned delete patch in the gitops branch's corresponding manifest (paas/manifests/applications/kserve/kserve/kustomization.yaml), push to origin and sync kubeflow Argo app.
    • You might need to manually label the nodes where you want to cache the model with kserve/localmodel: worker label.

    If this step was executed correctly, you should see DaemonSet's pods running on each labelled node.

  5. Create a CR responsible for model caching. This should trigger caching jobs on each labelled node.

    apiVersion: serving.kserve.io/v1alpha1
    kind: LocalModelCache
    metadata:
      name: qwen-72b-weights
    spec:
      sourceModelUri: s3://prokube-demo-profile-data/qwen  # path to where config.json can be found
      modelSize: 140Gi
      nodeGroups:
        - workers
    

  6. Ensure that the caching is finished: download Jobs are complete or run

    kubectl get localmodelnode -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.modelStatus}{"\n"}{end}'
    
    – the status on all listed nodes should be "ModelDownloaded". After that, deploy an InferenceService:

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: qwen2-5-72b-instruct
    spec:
      predictor:
        nodeSelector:
          nvidia.com/mig.config: all-disabled
          nvidia.com/gpu.product: NVIDIA-H100-NVL
        model:
          modelFormat:
            name: huggingface
          # this should match LocalModelCache
          storageUri: "s3://prokube-demo-profile-data/qwen"
          args:
            - --served-model-name=qwen2-5-72b-instruct
            - --dtype=bfloat16
            - --max-model-len=10240
            - --quantization=fp8
            - --tensor-parallel-size=2
          resources:
            limits:
              cpu: "8"
              memory: 64Gi
              nvidia.com/gpu: "2"
            requests:
              cpu: "3"
              memory: 64Gi
              nvidia.com/gpu: "2"
    

The model startup time should be dramatically reduced! To ensure cluster safety, delete the created kserve-localmodelnode-agent DaemonSet after executing all the steps.