Katib - Hyperparameter Tuning⚓︎

Links to external documentation

Katib is a Kubernetes-native tool for hyperparameter tuning and neural architecture search (NAS). It automates the search for optimal model configurations, running multiple training trials in parallel and tracking results.

What Katib Does⚓︎

Hyperparameter tuning: Find optimal values for learning rate, batch size, regularization, number of layers, and other training parameters
Neural architecture search: Automatically explore network architectures to find the best structure for your problem
Early stopping: Terminate underperforming trials early to save compute resources

What Katib Does NOT Do⚓︎

Katib focuses specifically on hyperparameter optimization. It does not provide:

Automatic feature engineering or feature selection
Automatic model selection (choosing between random forests, neural networks, etc.)
Data preprocessing or cleaning
End-to-end "push button" AutoML

You bring your training code and define what parameters to tune - Katib handles the search.

Supported Algorithms⚓︎

Katib supports many search algorithms including:

Bayesian optimization - Uses probabilistic models to guide the search
Tree of Parzen Estimators (TPE) - Sequential model-based optimization
Random Search - Simple but effective baseline
Hyperband - Efficient early stopping strategy
CMA-ES - Covariance Matrix Adaptation Evolution Strategy for continuous parameters
ENAS / DARTS - Neural architecture search algorithms

Optuna Integration⚓︎

Katib has its own suggestion services for each algorithm, but if you're already familiar with Optuna, Katib can use it as an optional backend. This lets you leverage Optuna's algorithms and pruning strategies while benefiting from Kubernetes-native execution - your experiments run in parallel across the cluster with automatic resource management and tracking.

Framework Support⚓︎

Katib is framework-agnostic and works with any ML library: PyTorch, TensorFlow, XGBoost, scikit-learn, or custom training code in any language.

FAQ⚓︎

My Katib Trials are very slow⚓︎

Have a look at the resource settings of the pods created by Katib. Those are set in the Experiment manifest or via the python API when creating the Experiment.

See the example Experiment and have a look at the resources section of the trialSpec template (the example of a cpu limit of 0.5 is probably not what you want). When using the Katib python SDK set the keyword argument resources_per_trial, see the Katib documentation for details.

The Katib Experiment controller keeps dying and no new trials are started⚓︎

The Katib Experiment suggestion controller gets killed by the OOM killer (you will see a message like OOMKilled: Out of memory) because it claims more memory than its limit. The controller is then restarted by the Kubernetes scheduler, but the new controller will also get killed by the OOM killer.

Admins can increase the default limits of new controllers (the memory limits are 100Mi by default) by editing the ConfigMap kubeflow/katib-config. See the following example:

  "bayesianoptimization": {
    "image": "docker.io/kubeflowkatib/suggestion-skopt:latest",
    "resources" : {
      "limits": {
        "memory": "300Mi",
        "cpu": "10"
      },
    "requests": {
        "memory": "300Mi",
        "cpu": "500m"
      }
    }
  },

Note, especially the bayesianoptimization suggestion controller can also use quite a lot of CPU to calculate new suggestions. The very low default limit might lead to long wait times after a trial is finished until the parameters for the next trial is started.

As a workaround, you can also edit the deployment of the controller and increase the resources in the template. This will lead to the deployment restarting the suggestion controller with the new resources.

Modern applications such as PyTorch utilize shared memory to allow multiple processes to access the same physical memory area. This is particularly useful in PyTorch when training neural networks with multiple workers in the DataLoader. However, in Kubernetes pods, including those running Katib trials, the default shared memory allocation is only 64MB, which can be quickly exhausted. This might lead to errors like:

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory [shm]

To address this issue, consider increasing the shared memory allocation for your Katib jobs. This can be adjusted in your Katib experiment manifest by configuring the pod's shared memory settings as follows:

apiVersion: kubeflow.org/v1beta1
kind: Experiment
...
spec:
    ... 
    trialSpec:
      ...
    volumeMounts:
    - name: dshm
      mountPath: /dev/shm
  volumes:
  - name: dshm
    emptyDir:
      medium: Memory
      sizeLimit: "512Mi" # optional