Skip to content

KServe⚓︎

Links to external documentation

→ KServe Documentation

KServe is a standard Model Inference Platform on Kubernetes, built for highly scalable use cases:

  • Provides performant, standardized inference protocol across ML frameworks.
  • Supports modern serverless inference workloads with autoscaling, including scale to zero on GPU.
  • Provides simple and pluggable production serving for ML workloads, including prediction, pre/post-processing, monitoring, and explainability.

Example Usages⚓︎

Learn how to deploy and access models with KServe, including scikit-learn, PyTorch with Triton, serving from S3 buckets, and MLflow integration. See Example Usages for step-by-step guides.

Autoscaling⚓︎

KServe supports multiple autoscaling strategies including the Knative Pod Autoscaler (KPA) and KEDA for LLM workloads. Learn how to configure concurrency-based, QPS-based, and custom metric-based scaling. Visit Autoscaling for details.

Version Matching⚓︎

Understand how to match library versions between your training environment and KServe's serving runtime to avoid compatibility issues. See Version Matching.


FAQ⚓︎

If you encounter issues with InferenceServices, authentication, or model scheduling, refer to the Model Serving FAQ for solutions and troubleshooting steps.