KServe⚓︎

Links to external documentation

KServe is a standard Model Inference Platform on Kubernetes, built for highly scalable use cases:

Provides performant, standardized inference protocol across ML frameworks.
Supports modern serverless inference workloads with autoscaling, including scale to zero on GPU.
Provides simple and pluggable production serving for ML workloads, including prediction, pre/post-processing, monitoring, and explainability.

Example Usages⚓︎

Learn how to deploy and access models with KServe, including scikit-learn, PyTorch with Triton, serving from S3 buckets, and MLflow integration. See Example Usages for step-by-step guides.

Autoscaling⚓︎

KServe supports multiple autoscaling strategies including the Knative Pod Autoscaler (KPA) and KEDA for LLM workloads. Learn how to configure concurrency-based, QPS-based, and custom metric-based scaling. Visit Autoscaling for details.

Version Matching⚓︎

Understand how to match library versions between your training environment and KServe's serving runtime to avoid compatibility issues. See Version Matching.

FAQ⚓︎

If you encounter issues with InferenceServices, authentication, or model scheduling, refer to the Model Serving FAQ for solutions and troubleshooting steps.