KServe⚓︎
Links to external documentation
KServe is a standard Model Inference Platform on Kubernetes, built for highly scalable use cases:
- Provide a performant, standardized inference protocol across ML frameworks.
- Support modern serverless inference workloads with Autoscaling including Scale to Zero on GPU.
- Simple and pluggable production serving for production ML serving including prediction, pre/post-processing, monitoring and explainability.
Enabling node selector for inference services⚓︎
KServe supports placing model workloads on specific nodes via node selectors.
To enable this feature, add the following values to knative-serving/config-features ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: config-features
namespace: knative-serving
data:
kubernetes.podspec-affinity: "enabled"
kubernetes.podspec-nodeselector: "enabled"
kubernetes.podspec-tolerations: "enabled"
You can now specify nodes in your InferenceService spec:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: qwen2-5-72b-instruct
spec:
predictor:
nodeSelector: # also works with affinity and tolerations
nvidia.com/mig.config: all-disabled # these are node labels
nvidia.com/gpu.product: NVIDIA-H100-NVL
Enabling local model cache for an S3 bucket⚓︎
KServe's local model cache helps to significantly speed up Pod startup time for LLM inference. The KServe documentation example by the link above gives an overview on how to cache model from a persistent volume. The below documentation provides an example on how to enable caching for a model stored in an S3 bucket.
To put a model into an S3 bucket, follow the instructions from here, but ignore the last step (creating an InferenceService). Instead, execute the following steps:
- First, enable local model cache, as specified in the documentation.
-
Create a custom
ClusterStorageContainer. It is the custom resource that will be responsible for loading the model from an S3 bucket:apiVersion: serving.kserve.io/v1alpha1 kind: ClusterStorageContainer metadata: name: s3 spec: container: name: storage-initializer image: kserve/storage-initializer:v0.15.2 env: # the secret should already be created during loading model into an S3 bucket - name: AWS_ACCESS_KEY_ID valueFrom: { secretKeyRef: { name: s3creds-kserve, key: AWS_ACCESS_KEY_ID } } - name: AWS_SECRET_ACCESS_KEY valueFrom: { secretKeyRef: { name: s3creds-kserve, key: AWS_SECRET_ACCESS_KEY } } - name: AWS_ENDPOINT_URL value: http://minio.minio:80 - name: S3_USE_HTTPS value: "0" resources: requests: { cpu: "1", memory: 2Gi } limits: { cpu: "1", memory: 4Gi } supportedUriFormats: - prefix: s3:// workloadType: localModelDownloadJob -
Create
LocalModelNodeGroupcustom resource. It specifies the nodes where the model will be cached.apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: local-storage provisioner: kubernetes.io/no-provisioner volumeBindingMode: WaitForFirstConsumer --- apiVersion: serving.kserve.io/v1alpha1 kind: LocalModelNodeGroup metadata: name: workers spec: storageLimit: 170Gi persistentVolumeClaimSpec: accessModes: [ReadWriteOnce] resources: { requests: { storage: 170Gi } } storageClassName: local-storage volumeMode: Filesystem volumeName: models persistentVolumeSpec: accessModes: [ReadWriteOnce] volumeMode: Filesystem capacity: { storage: 170Gi } local: { path: /models } storageClassName: local-storage nodeAffinity: required: # example node selector - nodes with H100 GPUs nodeSelectorTerms: - matchExpressions: - key: nvidia.com/gpu.product operator: In values: ["NVIDIA-H100-NVL"] -
Workaround step. Local Model Cache is virtually disabled for Kubeflow by default, as of this writing (see this issue). This is reflected in this manifest as a delete patch.
- You can manually deploy the DaemonSet from here.
- Alternatively, uncomment the mentioned delete patch in the gitops branch's corresponding manifest (
paas/manifests/applications/kserve/kserve/kustomization.yaml), push to origin and sync kubeflow Argo app. - You might need to manually label the nodes where you want to cache the model with
kserve/localmodel: workerlabel.
If this step was executed correctly, you should see DaemonSet's pods running on each labelled node.
-
Create a CR responsible for model caching. This should trigger caching jobs on each labelled node.
apiVersion: serving.kserve.io/v1alpha1 kind: LocalModelCache metadata: name: qwen-72b-weights spec: sourceModelUri: s3://prokube-demo-profile-data/qwen # path to where config.json can be found modelSize: 140Gi nodeGroups: - workers -
Ensure that the caching is finished: download Jobs are complete or run
– the status on all listed nodes should be "ModelDownloaded". After that, deploy an InferenceService:kubectl get localmodelnode -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.modelStatus}{"\n"}{end}'apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: qwen2-5-72b-instruct spec: predictor: nodeSelector: nvidia.com/mig.config: all-disabled nvidia.com/gpu.product: NVIDIA-H100-NVL model: modelFormat: name: huggingface # this should match LocalModelCache storageUri: "s3://prokube-demo-profile-data/qwen" args: - --served-model-name=qwen2-5-72b-instruct - --dtype=bfloat16 - --max-model-len=10240 - --quantization=fp8 - --tensor-parallel-size=2 resources: limits: cpu: "8" memory: 64Gi nvidia.com/gpu: "2" requests: cpu: "3" memory: 64Gi nvidia.com/gpu: "2"
The model startup time should be dramatically reduced! To ensure cluster safety, delete the created kserve-localmodelnode-agent DaemonSet after executing all the steps.