Dask⚓︎
Links to external documentation
Dask is a flexible parallel computing library for analytics which strives to be easy to use and efficient. Dask can scale on kubernetes with the help of the Dask Operator.
Using Dask from a notebook⚓︎
Dask can be used from a Jupyter notebook by creating a Dask cluster and connecting to it. See our Dask example notebook for a minimal example.
Note that Dask is sensitive to the versions of the libraries you use, as it pickles the functions you send to the workers. This means that you should make sure that the versions of the libraries you use are the same on the workers as on the client.
If you see a warning like this when creating a cluster:
+---------+----------------+-----------------+---------+
| Package | Client | Scheduler | Workers |
+---------+----------------+-----------------+---------+
| lz4 | 4.3.2 | 4.3.3 | None |
| msgpack | 1.0.6 | 1.1.0 | None |
| python | 3.11.6.final.0 | 3.11.10.final.0 | None |
| tornado | 6.3.3 | 6.4.1 | None |
+---------+----------------+-----------------+---------+
do make sure that the versions of the libraries are the same on the client and
the workers. You can do this by installing the libraries on the workers with
pip install <library>==<version>.
A known working combination is:
- using the following image for a notebook:
kubeflownotebookswg/jupyter-scipy:v1.9.2 - using the following image for the workers:
daskdev/dask:2024.9.1-py3.11 - and the following versions of the python libraries:
dask-kubernetes==2024.5.0pandas==2.2.3numpy==2.1.1dask==2024.9.1scipy==1.15.1
FAQ⚓︎
How can I use Dask behind a proxy⚓︎
To use Dask behind a proxy, especially within a KubeCluster, proxy configuration via the environment variables is essential when workers need external internet access. Configure http_proxy, https_proxy, and no_proxy environment variables using the env parameter for KubeCluster, explicitly including the Kubernetes API server IP to prevent routing internal cluster communication through the proxy. Be aware that CIDR notation in no_proxy may not be reliably supported; use explicit IPs or domain names instead. See our Dask proxy example notebook for a practical guide, and remember to also configure proxy settings for the notebook instance itself if it requires external access, such as for pip. Also, be mindful of potential Dask client and cluster version mismatches, which can be resolved through careful package management and image selection.