Autoscaling⚓︎

KServe supports 3 types of autoscaling: with Knative Pod Autoscaler (KPA), with Kubernetes HPA and with KEDA.

Knative Pod Autoscaler (KPA)⚓︎

This section covers Knative-based autoscaling, which is the default for KServe.

The huggingface examples below are largely based on the upstream documentation.

A prerequisite to run test examples below is the hey tool for load testing (used to generate traffic). You need to install it to the machine where you will make the API calls to the models. For example, to install it on a MacBook, run:

brew install hey

Example⚓︎

Autoscaling works for models running both on CPU and on GPU. The example below is CPU-only, but can be easily adjusted for using GPU-based model.

To create an example workload, deploy a KServe InferenceService custom resource. KServe will take care of creating everything else for you, including the Pod and the route.

One way to enable autoscaling is to add scaleTarget and scaleMetric keys to the InferenceService predictor spec. The example below uses scaleMetric: concurrency to determine the need for autoscaling, and scaleTarget: 1, indicating (in combination with this metric) soft limit for the number of concurrent requests per replica. If there is a sudden burst of requests, this value can be exceeded. A strict per-replica concurrency cap, if desired, is set separately via containerConcurrency (see later examples).

The following code snippet deploys an example for testing autoscaling. It can be run from your local terminal or from a kubeflow notebook terminal. You need a kubeconfig for connecting to your cluster, to run the kubectl command. Also change the NAMESPACE below to your namespace, if you want the workload to be deployed there.

export METRIC="concurrency"         # alternative: qps
export NAMESPACE="default"          # change to your desired namespace
export ISVC_NAME="distilbert-cpu"
export MODEL_NAME="distilbert"

kubectl apply -n ${NAMESPACE} -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: ${ISVC_NAME}
spec:
  predictor:
    scaleTarget: 1
    scaleMetric: ${METRIC}
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=${MODEL_NAME}
        - --model_id=distilbert-base-uncased-finetuned-sst-2-english
      resources:
        limits:
          cpu: "4"
          memory: 8Gi
        requests:
          cpu: "2"
          memory: 4Gi
EOF

Get InferenceService's status and wait for READY to be True (you can also track all of that in the Kubeflow UI -> Endpoints):

kubectl get isvc $ISVC_NAME -n $NAMESPACE

Sample output:

NAME             URL                                               READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION              AGE
distilbert-cpu   https://<DOMAIN>/serving/default/distilbert-cpu   True           100                              distilbert-cpu-predictor-00002   2m

When the model is ready, get the endpoint URL (requires jq, otherwise just copypaste from previous command's output):

export MODEL_URL=$(kubectl get isvc $ISVC_NAME -n $NAMESPACE -o json | jq -r ".status.url")

Run concurrent requests with hey (you can obtain the API key from your admin, if you don't have one):

  export X_API_KEY="<your key>"

# 5 concurrent runners
hey -z 30s -c 5 \
  -m POST \
  -H "Content-Type: application/json" \
  -H "x-api-key: ${X_API_KEY}" \
  -d '{"instances":["MLOps is wonderful!"]}' \
  ${MODEL_URL}/v1/models/${MODEL_NAME}:predict

Check the number of running pods now:

kubectl get po -n $NAMESPACE | grep $ISVC_NAME

KServe uses the Knative autoscaler which is based on the average number of in-flight requests per pod (concurrency). As the scaling target is set to 1, and we load the service with 5 concurrent requests, the autoscaler tries scaling up to 5 pods. In practice, more than 5 pods can be created: if the cold-start time is long, incoming requests accumulate while no pods are ready, which triggers Knative's aggressive (panic) scale-up behavior. Take a look at hey output:

Response time histogram:
  0.422 [1]     |■■
  2.163 [23]    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  3.905 [5]     |■■■■■■■■■
  5.647 [2]     |■■■
  7.389 [2]     |■■■
  9.131 [2]     |■■■
  10.872 [1]    |■■
  12.614 [1]    |■■
  14.356 [1]    |■■
  16.098 [0]    |
  17.839 [4]    |■■■■■■■

You notice that there are 4 requests on this histogram that take around 18s - that's the cold start time cost to initially spawn the pods and download the model to be ready to serve. The cold start may take longer (to pull the serving image) if the image is not cached on the node that the pod is scheduled on.

You can also check out Knative Serving - Scaling Debugging dashboard in Grafana for additional visual details.

Knative Autoscaling Dashboard

Alternatively, you can set export METRIC=qps in the example in the beginning of this section, to enable scaling based on the number of queries per second (QPS). In this mode, the autoscaler calculates average concurrency over a 60-second window, so it takes a minute to stabilize at the desired concurrency level. However, it also calculates a 6-second panic window and will enter into panic mode if that window reaches 2x target concurrency.

An example hey command for this case:

  export X_API_KEY="<your key>"

# run 5 queries per second: build a queue
hey -z 30s -q 5 \
  -m POST \
  -H "Content-Type: application/json" \
  -H "x-api-key: ${X_API_KEY}" \
  -d '{"instances":["MLOps is wonderful!"]}' \
  ${MODEL_URL}/v1/models/${MODEL_NAME}:predict

Further metrics are available in Knative docs.

Customization and Knobs⚓︎

There are several ways to influence autoscaling, available through InferenceService spec:

spec:
  predictor:                  # or transformer
    scaleTarget: 1            # value of scaleMetric, exceeding which causes upscaling
    scaleMetric: qps          # alternatives: concurrency, memory, cpu
    containerConcurrency: 10  # hard limit on concurrency per replica
    minReplicas: 0            # minimum available replicas (0 enables scaling to zero)
    maxReplicas: 10           # max available replicas

You can also influence scaling behaviour by adding knative-related annotations (advanced) to the InferenceService spec. See details here. Example spec:

spec:
  predictor:
    annotations:
      autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"
      autoscaling.knative.dev/metric: "concurrency"
      autoscaling.knative.dev/target-utilization-percentage: "70"

KEDA (Kubernetes Event-driven Autoscaling)⚓︎

This section covers manual KEDA setup with custom Prometheus metrics. For upstream reference, see the KServe KEDA autoscaler documentation.

For LLM inference workloads, Knative's request-based autoscaling (concurrency, QPS) is often not optimal. LLM requests vary wildly in duration depending on prompt and output token count, and what actually matters for user experience is throughput saturation and GPU memory pressure, not just request count.

KEDA enables autoscaling based on custom Prometheus metrics from vLLM, such as:

Token throughput: Scale when per-replica tokens/second approaches capacity
GPU KV-cache utilization: Scale when GPU memory pressure increases (long-context requests can fill the cache without generating high token throughput)
Request queue depth: Scale based on waiting requests

Why not TTFT (Time To First Token)?

It may seem tempting to scale based on TTFT — after all, it directly measures user-perceived latency. However, TTFT can be a reactive metric: when a new replica joins and starts absorbing requests, the average TTFT across all replicas may drop significantly. With AverageValue, the desired replica count is ceil(currentReplicas × currentMetric / targetMetric), so a large metric drop after scale-up can immediately push the desired count back down, creating a feedback loop: TTFT rises -> scale-up -> TTFT drops sharply -> scale-down -> repeat (flapping). Whether this actually happens depends on how much the metric changes when a replica is added, so TTFT requires careful per-model evaluation before using it as a scaling trigger.

Token throughput is a safer default: it reflects sustained demand and stays elevated as long as the workload is under pressure, regardless of replica count.

KEDA requires RawDeployment mode

KEDA and Knative Pod Autoscaler (KPA) cannot coexist on the same InferenceService. Knative manages replicas through its own autoscaler, and KEDA manages them through its own HPA — if both are active, they fight over the replica count.

When using KEDA, you must set the InferenceService to RawDeployment mode via the annotation serving.kserve.io/deploymentMode: "RawDeployment". This tells KServe to create a plain Kubernetes Deployment instead of a Knative Revision, leaving autoscaling entirely to KEDA.

KServe native KEDA integration (alternative)

Since KServe 0.16, you can also use the annotation serving.kserve.io/autoscalerClass: "keda" and specify metrics directly in the InferenceService spec.predictor.autoScaling field. KServe then creates the ScaledObject for you. See the upstream docs for details. The manual approach shown below gives you full control over the ScaledObject configuration.

Prerequisites⚓︎

KEDA must be installed in the cluster. Ask your administrator to install it if it is not available:

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

Example with vLLM⚓︎

This example deploys a small vLLM model and configures KEDA to scale based on token throughput. It can be run on CPU, but the same approach applies to GPU deployments.

1. Deploy the InferenceService⚓︎

export NAMESPACE="default"

kubectl apply -n ${NAMESPACE} -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: opt-125m
  annotations:
    # RawDeployment mode — required for KEDA.
    # Creates a plain Deployment instead of a Knative Revision.
    serving.kserve.io/deploymentMode: "RawDeployment"
    # Tell KServe not to create its own HPA (KEDA will manage scaling).
    serving.kserve.io/autoscalerClass: "external"
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 3
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=opt-125m
        - --model_id=facebook/opt-125m
        - --backend=vllm
        - --dtype=float32
        - --max-model-len=512
      # Explicit port declaration is required in RawDeployment mode
      # for PodMonitor to discover the metrics endpoint.
      ports:
        - name: user-port
          containerPort: 8080
          protocol: TCP
      resources:
        requests:
          cpu: "2"
          memory: 4Gi
        limits:
          cpu: "4"
          memory: 8Gi
EOF

Wait for the InferenceService to become ready:

kubectl get isvc opt-125m -n ${NAMESPACE} -w

RawDeployment naming

In RawDeployment mode, KServe creates a Deployment named {isvc-name}-predictor. For example, opt-125m produces a Deployment called opt-125m-predictor. This is the name you reference in the KEDA ScaledObject. (In contrast, Knative Serverless mode creates Deployments with revision suffixes like opt-125m-predictor-00001-deployment.)

2. Verify Prometheus is scraping vLLM metrics⚓︎

PodMonitor already configured

prokube clusters include a cluster-wide PodMonitor (monitoring/vllm) that automatically scrapes vLLM metrics from any KServe InferenceService pod. It selects pods with the serving.kserve.io/inferenceservice label (set by KServe) and scrapes the user-port endpoint. You do not need to create your own PodMonitor.

The cluster-wide PodMonitor also rewrites the model_name label to match the InferenceService name (e.g., opt-125m), so your KEDA queries should use the ISVC name, not the HuggingFace model ID.

Wait ~30 seconds after the pod starts, then verify metrics are available:

# From within the cluster, or via port-forward to Prometheus
curl -s --get \
  --data-urlencode "query=vllm:prompt_tokens_total{namespace=\"$NAMESPACE\"}" \
  http://kube-prometheus-stack-prometheus.monitoring.svc:9090/prometheus/api/v1/query

# should return {"status":"success"...

3. Create the KEDA ScaledObject⚓︎

kubectl apply -n ${NAMESPACE} -f - <<EOF
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: opt-125m-scaledobject
spec:
  scaleTargetRef:
    name: opt-125m-predictor   # must match the Deployment name
  minReplicaCount: 1
  maxReplicaCount: 3
  pollingInterval: 15          # how often KEDA checks the metric (seconds)
  cooldownPeriod: 120          # seconds after last trigger activation before scaling to minReplicaCount
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          # Scale up immediately when threshold is breached
          stabilizationWindowSeconds: 0
        scaleDown:
          # Hold extra replicas for 2 minutes after load subsides
          stabilizationWindowSeconds: 120
          policies:
            - type: Pods
              value: 1          # remove at most 1 replica per minute
              periodSeconds: 60
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
        query: >-
          sum(rate(vllm:prompt_tokens_total{namespace="${NAMESPACE}",model_name="opt-125m"}[2m]))
          + sum(rate(vllm:generation_tokens_total{namespace="${NAMESPACE}",model_name="opt-125m"}[2m]))
        metricType: AverageValue
        threshold: "5"
EOF

How AverageValue works

KEDA's AverageValue metric type takes the query result — here, a global sum of tokens/second across all pods — and divides it by the current replica count to compute a per-replica average. A threshold of 5 means: "scale up when each replica is handling more than 5 tokens/second on average." As load increases, the global sum grows, pushing the per-replica average above the threshold and triggering scale-up.

vLLM metric names

vLLM uses colons in metric names (e.g., vllm:prompt_tokens_total). In some PromQL contexts you may need the {"__name__"="..."} syntax, but in KEDA's Prometheus trigger the colon syntax works directly.

Adjust namespace and model name

Replace default and opt-125m in the Prometheus queries with the actual namespace and --model_name (or --served-model-name) of your deployment.

4. Load test and observe scaling⚓︎

Use the load-generator.py script from the examples repository. Follow the Quick Start in that README to download and run it. It has two calibrated presets for the opt-125m model on CPU:

Mode	Workers	Sleep between requests	Expected throughput	Scaling result
`stable-2`	1	8s	~8 tok/s	Scales to 2 replicas and holds
`stable-3`	2	2s	~22 tok/s	Scales to 3 replicas and holds

Scaling math

With AverageValue and threshold: 5, the HPA computes desiredReplicas = ceil(currentReplicas × currentMetric / threshold), where currentMetric is the per-replica average. Equivalently for a global total: ceil(total_tok_per_sec / 5). So 8 tok/s → 2 replicas, 22 tok/s → 5 (capped at maxReplicas=3).

Importantly, scale-down only triggers when the desired count drops below the current count. For example, with 2 replicas and a target of 5 tok/s/replica, desiredReplicas = ceil(2 × currentMetric / 5) only falls to 1 when currentMetric ≤ 2.5 tok/s/replica (total ≤ 5 tok/s) — roughly half the target. A metric anywhere above that keeps the desired count at 2. Token throughput remains elevated under sustained load, so the replica count stabilizes — this is why a steady, moderate load produces stable scaling rather than oscillation.

Choosing the Right Metric⚓︎

Metric	Good for	Watch out
Token throughput (prompt + generation rate)	General-purpose LLM scaling. Stays elevated under sustained load.	Requires tuning the threshold per model/hardware — see the examples repository for a calibration script.
GPU KV-cache utilization	Long-context workloads that fill cache without high throughput.	Not available on CPU deployments.
Waiting requests (`vllm:num_requests_waiting`)	Very responsive, good for latency-sensitive APIs.	Can be noisy with bursty traffic.
TTFT (Time To First Token)	Monitoring/alerting.	Requires careful evaluation as a scaling trigger — depending on how much TTFT changes when a replica joins, it can cause flapping (see warning above).

You can combine multiple triggers in one ScaledObject (OR logic — any trigger can cause scale-up):

triggers:
  # Primary: token throughput
  - type: prometheus
    metadata:
      serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
      query: >-
        sum(rate(vllm:prompt_tokens_total{namespace="my-namespace",model_name="my-model"}[5m]))
        + sum(rate(vllm:generation_tokens_total{namespace="my-namespace",model_name="my-model"}[5m]))
      metricType: AverageValue
      threshold: "2500"
  # Secondary: GPU KV-cache pressure
  - type: prometheus
    metadata:
      serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
      query: >-
        avg(vllm:gpu_cache_usage_perc{namespace="my-namespace",model_name="my-model"})
      metricType: AverageValue
      threshold: "0.75"

Production Considerations⚓︎

For GPU-based production deployments, consider these additional settings in the ScaledObject spec:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
spec:
  # ... scaleTargetRef, triggers, etc. ...
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 0    # scale up immediately
        scaleDown:
          # Each GPU replica may have a large local model cache.
          # Tearing a node down means a full re-download on next scale-up.
          # Hold extra replicas longer to avoid expensive cold starts.
          stabilizationWindowSeconds: 900  # 15 minutes
          policies:
            - type: Pods
              value: 1
              periodSeconds: 300           # at most 1 replica removed per 5 minutes

Tune the token-throughput threshold using load testing. Identify the per-replica throughput at which latency begins to degrade and use that value as the scale-up trigger.

The optimal value varies depending on the model, hardware, batching efficiency, and request patterns. See the examples repository for a step-by-step calibration workflow you can run against your own deployment.

For a complete working example with ready-to-apply manifests, see the examples repository.