Katib - Hyperparameter Tuning⚓︎
Links to external documentation
Katib is a Kubernetes-native tool for hyperparameter tuning and neural architecture search (NAS). It automates the search for optimal model configurations, running multiple training trials in parallel and tracking results.
What Katib Does⚓︎
- Hyperparameter tuning: Find optimal values for learning rate, batch size, regularization, number of layers, and other training parameters
- Neural architecture search: Automatically explore network architectures to find the best structure for your problem
- Early stopping: Terminate underperforming trials early to save compute resources
What Katib Does NOT Do⚓︎
Katib focuses specifically on hyperparameter optimization. It does not provide:
- Automatic feature engineering or feature selection
- Automatic model selection (choosing between random forests, neural networks, etc.)
- Data preprocessing or cleaning
- End-to-end "push button" AutoML
You bring your training code and define what parameters to tune - Katib handles the search.
Supported Algorithms⚓︎
Katib supports many search algorithms including:
- Bayesian optimization - Uses probabilistic models to guide the search
- Tree of Parzen Estimators (TPE) - Sequential model-based optimization
- Random Search - Simple but effective baseline
- Hyperband - Efficient early stopping strategy
- CMA-ES - Covariance Matrix Adaptation Evolution Strategy for continuous parameters
- ENAS / DARTS - Neural architecture search algorithms
Optuna Integration⚓︎
Katib has its own suggestion services for each algorithm, but if you're already familiar with Optuna, Katib can use it as an optional backend. This lets you leverage Optuna's algorithms and pruning strategies while benefiting from Kubernetes-native execution - your experiments run in parallel across the cluster with automatic resource management and tracking.
Framework Support⚓︎
Katib is framework-agnostic and works with any ML library: PyTorch, TensorFlow, XGBoost, scikit-learn, or custom training code in any language.
FAQ⚓︎
My Katib Trials are very slow⚓︎
Have a look at the resource settings of the pods created by Katib. Those are set in the Experiment manifest or via the python API when creating the Experiment.
See the example
Experiment
and have a look at the resources section of the trialSpec template (the example
of a cpu limit of 0.5 is probably not what you want). When using the Katib
python SDK set the keyword argument resources_per_trial, see the Katib
documentation
for details.
The Katib Experiment controller keeps dying and no new trials are started⚓︎
The Katib Experiment suggestion controller gets killed by the OOM killer (you
will see a message like OOMKilled: Out of memory) because it claims more
memory than its limit. The controller is then restarted by the Kubernetes
scheduler, but the new controller will also get killed by the OOM killer.
Admins can increase the default limits of new controllers (the memory limits are
100Mi by default) by editing the ConfigMap kubeflow/katib-config. See the
following example:
"bayesianoptimization": {
"image": "docker.io/kubeflowkatib/suggestion-skopt:latest",
"resources" : {
"limits": {
"memory": "300Mi",
"cpu": "10"
},
"requests": {
"memory": "300Mi",
"cpu": "500m"
}
}
},
Note, especially the bayesianoptimization suggestion controller can also use
quite a lot of CPU to calculate new suggestions. The very low default limit
might lead to long wait times after a trial is finished until the parameters for
the next trial is started.
As a workaround, you can also edit the deployment of the controller and increase the resources in the template. This will lead to the deployment restarting the suggestion controller with the new resources.
I'm getting errors related to shared memory⚓︎
Modern applications such as PyTorch utilize shared memory to allow multiple processes to access the same physical memory area. This is particularly useful in PyTorch when training neural networks with multiple workers in the DataLoader. However, in Kubernetes pods, including those running Katib trials, the default shared memory allocation is only 64MB, which can be quickly exhausted. This might lead to errors like:
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory [shm]
To address this issue, consider increasing the shared memory allocation for your Katib jobs. This can be adjusted in your Katib experiment manifest by configuring the pod's shared memory settings as follows:
apiVersion: kubeflow.org/v1beta1
kind: Experiment
...
spec:
...
trialSpec:
...
volumeMounts:
- name: dshm
mountPath: /dev/shm
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: "512Mi" # optional