Example Usages⚓︎

KServe supports multiple ML frameworks. Below are examples for different model types:

scikit-learn Example⚓︎

For deploying a model, you need to create an InferenceService object. Check the KServe documentation for a first example.

Storing and accessing the model

KServe is somewhat differently configured than the default setup. See below the instructions adjusted for a prokube cluster.

1. Store the model and create `InferenceService`⚓︎

The KServe scikit-learn (sklearn) model, defined in the example InferenceService above, is stored in the Google Cloud Storage. This requires some additional configuration for it to be accessible in a prokube cluster.

The simplest workaround is to download the model and upload it to a MinIO bucket in a prokube cluster. This will require configured GCloud SDK. The workflow is the following:

Download the model to a current local directory:

gsutil cp -r gs://kfserving-examples/models/sklearn/1.0/ .

Upload the downloaded model to a MinIO bucket BUCKET_NAME through the web interface.

Create a ServiceAccount for KServe which gives it access to the buckets in your namespace and a matching secret with the right annotations:

export NAMESPACE="<YOUR_NAMESPACE>"
kubectl get secret s3creds -n ${NAMESPACE} -o yaml \
  | yq e '.metadata.name = "s3creds-kserve" |
          .metadata.annotations."serving.kserve.io/s3-endpoint" = "minio.minio" |
          .metadata.annotations."serving.kserve.io/s3-usehttps" = "0" |
          del(.metadata.uid, .metadata.resourceVersion, .metadata.creationTimestamp, .metadata.annotations."kubectl.kubernetes.io/last-applied-configuration")' - \
  | kubectl apply -n ${NAMESPACE} -f -

kubectl apply -n ${NAMESPACE} -f - <<'EOF'
apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3creds
secrets:
  - name: s3creds-kserve
EOF

4. Specify the model path and namespace in the inference service and define the service:

kubectl apply -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
  namespace: {NS}  
spec:
  predictor:
    serviceAccountName: s3creds
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://{BUCKET_NAME}/{PATH_TO_MODEL}"
EOF

2. Access the model⚓︎

To access the model from outside the cluster, you need to send POST requests to a specific URL. The URL is constructed as follows:

https://${INGRESS_HOST}/serving/${NS}/${INFERENCE_SERVICE_NAME}/v1/models/${MODEL_NAME}:predict

The INGRESS_HOST is the domain name of the cluster. The NS is the namespace where you deployed the model, e.g., your own namespace. The INFERENCE_SERVICE_NAME is the name of the InferenceService object you created.

If you followed along the KServe Documentation and created an InferenceService named sklearn-iris, your namespace is called john-doe and your prokube installation is available under prokube.example.com, you would use the following URL: https://prokube.example.com/serving/john-doe/sklearn-iris/v1/models/sklearn-iris:predict.

You will also need to set the x-api-key header to the value of a valid API key (you can get this from your administrator, see FAQ answer).

curl example⚓︎

# save test data to a json file
cat <<EOF > "./iris-input.json"
{
  "instances": [
    [6.8,  2.8,  4.8,  1.4],
    [6.0,  3.4,  4.5,  1.6]
  ]
}
EOF

# run prediction
curl \
  -H "Content-Type: application/json" \
  -H "x-api-key: ${X-API-KEY}" \
  "https://${INGRESS_HOST}/serving/${NS}/sklearn-iris/v1/models/sklearn-iris:predict" \
  -d @./iris-input.json \

You may add -k flag to curl command to ignore SSL errors (not recommended).

Tip

If everything is configured correctly, you will get {"predictions": [1, 1]} as the last line in the response.

Python example⚓︎

import requests

url = f"https://{INGRESS_HOST}/serving/{NS}/sklearn-iris/v1/models/sklearn-iris:predict"
headers = {
    "x-api-key": X_API_KEY
}

response = requests.post(url, headers=headers, json={"instances": [[5.1, 3.5, 1.4, 0.2]]})
print(response.json())

PyTorch with Triton Example⚓︎

For deep learning models created with PyTorch, we can use NVIDIA's Triton Inference Server for high-performance serving. This is especially useful for models that benefit from GPU acceleration.

1. Export a PyTorch model for Triton⚓︎

Links to external documentation

→ Compile your pytorch lightning model to TorchScript

To properly serve with Triton, you need to create a specific directory structure:

model_repository/
  └── your_model_name/                   
       ├── config.pbtxt            # Triton configuration file
       └── 1/                      # Model version
           └── model.pt            # TorchScript model file

The config.pbtxt file should look like:

name: "your_model_name"
platform: "pytorch_libtorch"
max_batch_size: 64

input [
  {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ feature_dim ]  # your model's input dimension
  }
]

output [
  {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ feature_dim ]  # your model's output dimension
  }
]

2. Create the InferenceService⚓︎

kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: "pytorch-model"
  namespace: {NS}
spec:
  predictor:
    triton:
      storageUri: "s3://{BUCKET_NAME}/{PATH_TO_MODEL_REPOSITORY}"
      runtimeVersion: "22.12-py3"  # Choose appropriate version
      env:
      - name: OMP_NUM_THREADS
        value: "1"
      resources:
        limits:
          nvidia.com/gpu: 1
        requests:
          nvidia.com/gpu: 1
EOF

3. Access the model⚓︎

Similar to the sklearn example, you can access your PyTorch model through the Triton server using the KServe API. For Triton, you can use either the external URL (similar to the sklearn example) or the cluster-internal URL:

import requests
import json
import numpy as np

# Configuration
INFERENCE_SERVICE_NAME = "pytorch-model"  # Name of your InferenceService
MODEL_NAME = "your_model_name"  # Name of your model as defined in config.pbtxt
NAMESPACE = "your-namespace"  # Your namespace
API_KEY = "your-api-key"  # Get this from your administrator

# Prepare your input data
# Adjust batch_size and feature_dim according to your model's requirements
batch_size = 1
feature_dim = 10  # Example: 10 features
input_data = np.random.rand(batch_size, feature_dim).astype(np.float32).tolist()

# Cluster-internal URL (for accessing from within the cluster)
url = f"http://{INFERENCE_SERVICE_NAME}.{NAMESPACE}.svc.cluster.local/v2/models/{MODEL_NAME}/infer"

headers = {
    "Content-Type": "application/json",
    "X-API-Key": API_KEY
}

payload = {
    "inputs": [{
        "name": "input__0",
        "shape": [batch_size, feature_dim],
        "datatype": "FP32",
        "data": input_data
    }]
}

response = requests.post(url, headers=headers, data=json.dumps(payload))
print(response.json())

Serving models from an S3 bucket⚓︎

To serve models from an S3 bucket, which is especially useful for deploying LLMs in production, we might want to cache models on the node. One way to do this is using KServe's local model cache. However, this requires admin access to cluster (see here for details).

This section provides a simpler alternative on how to load the model into an S3 bucket using prokube custom kubeflow notebooks. This approach, however, does not reduce model load time, as the model still needs to be loaded into the pod's ephemeral storage.

First, create a Jupyter notebook in the Kubeflow UI, don't forget to adjust the PVC size according to your model's storage needs. Create a terminal in the notebook UI and run the following commands:

  export HUGGING_FACE_HUB_TOKEN="YOUR_TOKEN"

# 0) (optional) speed + resume cache on disk
export HF_HOME="${HOME}/.cache/huggingface"      # keeps chunks so re-runs resume

# 0) define variables and directories
export MINIO_ALIAS="minio"                              # default in prokube notebooks
export BUCKET_NAME="prokube-demo-profile-data"          # default bucket for default profile

export HF_MODEL="Qwen/Qwen2.5-72B-Instruct"             # example
export MODEL_SUBDIR="qwen2-5-72b-instruct"

# 1) download from HF to a local dir (no symlinks; resumable)
pip install -q --upgrade "huggingface_hub[cli]>=0.25,<0.28" hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
huggingface-cli download ${HF_MODEL} \
  --local-dir "${HOME}/models/${MODEL_SUBDIR}" \
  --local-dir-use-symlinks False \
  --resume \
  --max-workers 4

# 2) copy to MinIO
mc cp \
  "${HOME}/models/${MODEL_SUBDIR}/" \
  "${MINIO_ALIAS}/${BUCKET_NAME}/${MODEL_SUBDIR}/" \
  --recursive

To be able to pull the model from a bucket, we would need a Secret with read access to MinIO with correct annotations:

# we need s3creds secret with custom annotations
export NAMESPACE="<YOUR_NAMESPACE>"
kubectl get secret s3creds -n ${NAMESPACE} -o yaml \
  | yq e '.metadata.name = "s3creds-kserve" |
          .metadata.annotations."serving.kserve.io/s3-endpoint" = "minio.minio" |
          .metadata.annotations."serving.kserve.io/s3-usehttps" = "0" |
          del(.metadata.uid, .metadata.resourceVersion, .metadata.creationTimestamp, .metadata.annotations."kubectl.kubernetes.io/last-applied-configuration")' - \
  | kubectl apply -n ${NAMESPACE} -f -

Then, we need to deploy a service account that uses those credentials:

kubectl apply -n ${NAMESPACE} -f - <<'EOF'
apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3creds
secrets:
  - name: s3creds-kserve
EOF

After those are configured, deploy the InferenceService with the following manifest:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen2-5-72b-instruct
spec:
  predictor:
    serviceAccountName: s3creds  # our SA
    model:
      modelFormat:
        name: huggingface
      storageUri: "s3://prokube-demo-profile-data"  # dir where "config.json" for the model lives
      args:
        - --served-model-name=qwen2-5-72b-instruct
      resources:
        limits:
          cpu: "8"
          memory: 64Gi
          nvidia.com/gpu: "2"
        requests:
          cpu: "3"
          memory: 64Gi
          nvidia.com/gpu: "2"

Serving models from MLflow⚓︎

Check out the prokube MLflow documentation.