Pipelines⚓︎
Links to external documentation
→ Kubeflow Pipelines Documentation
Motivation⚓︎
Kubeflow Pipelines (KFP) is a platform for building and deploying portable, scalable, and reproducible (machine learning) workflows based on Docker containers. KFP allows the realization of complex workflows including parallel execution of workloads. Using pipelines ensures reproducibility (by keeping records of the pipeline run's details) and efficiency (by only blocking resources when they are actually needed).
Kubeflow pipelines can be authored and started from Jupyter notebooks, making it easy to get started with them, but can be defined and started from any environment that can interact with the Kubernetes API.
Overview⚓︎
Hint
Version 1 of Kubeflow Pipelines is deprecated and will be removed in the future. Version 2 is the current version and is the only one supported by prokube.
With Kubeflow Pipelines you can build components from your own Python functions or from existing containerized applications. They are the single steps of execution in a pipeline. Pipelines are directed acyclic graphs (DAG) of components. Each component of the pipeline is run in a separate container on the Kubernetes cluster.
Managing pipelines in the Kubeflow central dashboard⚓︎
- Experiments is a workspace where you can try different configurations of your pipelines.
- Pipelines is where you can create, edit, and run pipelines.
- Runs are the single executions of a pipeline.
- Recurring runs are scheduled runs of a pipeline (realized as Jobs in Kubernetes)
- Executions are the individual component executions of a pipeline run.
- Artifacts are the outputs of pipeline runs.
Components⚓︎
Docker Images⚓︎
Components are always executed as Kubernetes pods, which in turn run Docker containers. If the component is based on a custom-built Docker image, you can build it using the pk-podman-shell from within your notebook.
Lightweight Components⚓︎
Lightweight components are components created directly from Python functions and are the easiest to create.
But they require the Python function to be self-contained, meaning that all imports and symbols must be declared/defined within the function body.
External dependencies can be added as Python packages in the component definition as required-packages. However, these packages are reinstalled into the base image with each execution of the component, which can lead to longer execution times.
Containerized Python Components⚓︎
Containerized Python components offer to have a deeper control over the container but also introduce increased complexity.
Containerized Python components are created from Python functions and can depend on symbols defined outside of the function, imports outside of the function, code in adjacent Python modules, etc. They are run in a container with a custom-built base image. These components require building custom images (see pk-podman-shell) and lead to less reproducibility in the pipeline run compared to the other types.
Lightweight components also use containerized Python scripts under the hood, so don't get confused by the naming.
Container Components⚓︎
Container components offer to use any container image. This makes it possible to author components that execute shell scripts, use other languages or binaries, etc., all from within a KFP pipeline. Custom images can be built using the pk-podman-shell.
Passing Parameters between Components⚓︎
You can use component inputs and outputs to pass small amount of data between components. For lightweight components, this is done by defining the function signature. For containerized components, this is done by defining the component interface in the component definition.
Only a few data types are supported for passing parameters between components, such as string, integer, float, bool, lists, and dictionaries. For more complex data types, you can use artifacts, see the next section.
Input and Output of External Data⚓︎
For input and output of arbitrary data, you can use artifacts. There is a special kind of components called Importer Components to help you with reading data into your pipeline. Artifacts can be used as inputs and outputs of components.
Pipelines⚓︎
Composing a Pipeline⚓︎
To build a pipeline, you start with composing your components into a pipeline
Component Execution Order⚓︎
The order of component execution in a pipeline is mostly defined automatically.
By default, all components in a pipeline are executed in parallel. If a pipeline component receives the output of another component as input, the component is executed after the other component has finished. If you want to further control the execution order, you can use the after parameter in the component definition.
Store Pipeline Output Data⚓︎
Artifacts are the mechanism used for ML artifact outputs, such as datasets, models, metrics, etc.
Running Pipelines⚓︎
To run your pipeline, continue with compiling it into a yaml file. You can then run the pipeline in different ways:
- Run on your local machine
- In the cluster, from the Kubeflow central dashboard
- In the cluster, from a notebook
- In the cluster, from the command line
- In the cluster, as recurring runs
Hint
Each execution of a component creates one or more pods. Finished pods are
only cleared after 24 hours. Kubeflow profiles are limited to 100 pods by
default and finished pipeline pods also count against this limit. If you run
into this limit, you can delete the pods of finished pipeline components
with the pkadmin command line tool. See the
FAQ for more information.
Examples⚓︎
Check out prokube's example repository and the Pipelines Cookbook for many different use cases and components.
FAQ⚓︎
My components do not run anymore - Delete old pods⚓︎
Running Kubeflow pipelines can create a large number of pods in the cluster. By default, the number of pods in each namespace is limited to 100 for performance reasons and completed pods count against that limit and are only cleaned up after 24h.
The pipelines subcommand provides a way to clean up completed and failed pods
that are no longer needed before they are automatically cleaned.
pkadmin pipelines COMMAND [args]
| COMMAND | Syntax | Description |
|---|---|---|
| clean-pods | pkadmin pipelines clean-pods -n/--namespace <namespace> [-t/--timelimit] [-y/--yes] |
Delete all old inactive kubeflow pipelines pods in a namespace |
I want to start pipelines from my local machines⚓︎
Moved to KFP Cookbook.
My Pipeline (or component) doesn't run and there are no logs for the failing component⚓︎
Check the status of the pod with kubectl. Possible error causes are: not
enough resources to schedule the pod, the image cannot be pulled, environment
variables do not exist etc.
You can for example get a list of all pods in the namespace of your pipeline
with kubectl get pods -n <namespace> and should then see a pod with a status
that is neither Running nor Completed, but for example CreateContainerConfigError.
See the Kubernetes section for more information.
I want my pipeline components to run on specific nodes⚓︎
By default, pipeline pods are scheduled on any node in the cluster. You can restrict this by adding a node selector constraint in your pipeline code. Effectively this will run the component's pod on any node that fulfills the specified criteria. This might be useful if for example your cluster is equipped with different GPUs on the nodes.
This is a short example of how to set it for a specific task in a pipeline. In this case using the hostname label to schedule the pod to a specific node:
from kfp import dsl
from kfp.compiler import Compiler
@dsl.component
def hello_op(name: str) -> str:
return f'Hello, {name}!'
@dsl.pipeline
def hello_pipeline(name: str = 'Kubeflow'):
task = hello_op(name)
task.add_node_selector_constraint('kubernetes.io/hostname', 'node4') # change as needed
if __name__ == '__main__':
Compiler().compile(hello_pipeline, 'hello_pipeline.yaml')
For more information, see the Kubeflow Pipelines SDK Documentation.