KServe⚓︎
Links to external documentation
KServe is a standard Model Inference Platform on Kubernetes, built for highly scalable use cases:
- Provides performant, standardized inference protocol across ML frameworks.
- Supports modern serverless inference workloads with autoscaling, including scale to zero on GPU.
- Provides simple and pluggable production serving for ML workloads, including prediction, pre/post-processing, monitoring, and explainability.
Example Usages⚓︎
Learn how to deploy and access models with KServe, including scikit-learn, PyTorch with Triton, serving from S3 buckets, and MLflow integration. See Example Usages for step-by-step guides.
Autoscaling⚓︎
KServe supports multiple autoscaling strategies including the Knative Pod Autoscaler (KPA) and KEDA for LLM workloads. Learn how to configure concurrency-based, QPS-based, and custom metric-based scaling. Visit Autoscaling for details.
Version Matching⚓︎
Understand how to match library versions between your training environment and KServe's serving runtime to avoid compatibility issues. See Version Matching.
FAQ⚓︎
If you encounter issues with InferenceServices, authentication, or model scheduling, refer to the Model Serving FAQ for solutions and troubleshooting steps.