MinIO⚓︎

Links to external documentation

To store your data on the cluster, we recommend using prokube's MinIO instance. MinIO is an S3-compatible object storage server that can be used to store any kind of data.

Within the cluster, prokube has preconfigured most technical details, so that you can immediately start using MinIO in your Kubeflow Notebooks and Pipelines. Some examples are listed here.

You can view, upload, and download data with the MinIO web UI, which can be accessed by clicking the "MinIO" tab in the Kubeflow web UI.

The S3-compatible MinIO API allows you to programmatically access data using many different libraries in many different languages and applications. To see how you can access the S3 API, please see below.

Access Examples - Python⚓︎

Access from a Kubeflow Notebook or Kubeflow Pipelines⚓︎

Set up a connection and access a file with Python:

import s3fs

s3 = s3fs.S3FileSystem()

with s3.open('<bucketname>/<objectname>', 'rb') as f:
    print(f.read())

Read pandas DataFrames from MinIO⚓︎

Example to read a parquet file from MinIO with pandas:

import pandas as pd

FILE_PATH = 's3://<bucketname>/path/to/file.parquet'

df = pd.read_parquet(FILE_PATH)

Example to read a csv file from MinIO with pandas:

import pandas as pd

FILE_PATH = 's3://<bucketname>/path/to/file.csv'

df = pd.read_csv(FILE_PATH)

Access Examples - R⚓︎

Access from a RStudio Notebook or Kubeflow Pipelines⚓︎

Set up a connection and read a csv file with aws.s3 and readr:

library(aws.s3)
library(readr)

# ⚠️ WARNING: Only use `use_https = FALSE` for the preconfigured internal MinIO endpoint, as cluster traffic is encrypted with mTLS.
# For external S3/MinIO endpoints, always use HTTPS (omit this argument or set `use_https = TRUE`, which is the default).
# Never use `use_https = FALSE` for traffic to remote/external endpoints.
bucketlist(use_https = FALSE, region="")

bucket <- "<bucket-name>"
key    <- "/path/to/file.csv"

# Fetch object as raw vector
obj <- get_object(object = key, bucket = bucket, region="", use_https=F)
df <- read_csv(rawToChar(obj))

Access MinIO from outside the cluster (e.g. your local machine)⚓︎

The MinIO API is exposed implementing the S3 protocol to the outside. For this we use separate domain, usually something like https://minio.<your-domain>.

Before you can access data within MinIO, you need to configure some technical details. Every library implementing the S3 protocol needs at least three parameters: AWS_ENDPOINT_URL, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY. S3 was originally introduced by AWS, hence the naming here. The storage provided by prokube has nothing to do with AWS; we provide an open-source implementation of the same protocol.

To create the access and secret key, you can log in to the MinIO Web UI (https://<your-domain>/minio/). Navigate to Access Keys and create a new access key. For most users, the default values are sufficient, so you can leave them as is and hit Create to obtain your personal token. The new access key and secret key will be shown in a pop-up. Don't forget to copy the secret key and store it in your password manager, as you won't be able to see it again later.

To populate the secret access key we recommend using the Python module getpass.

import getpass

AWS_SECRET_ACCESS_KEY = getpass.getpass("Enter your AWS Secret Access Key: ")
AWS_ACCESS_KEY_ID = "YOUR_ACCESS_KEY_ID_HERE"
AWS_ENDPOINT_URL = "https://minio.<your-domain>.tld"

Configure s3fs⚓︎

import s3fs

s3 = s3fs.S3FileSystem(
    key=AWS_ACCESS_KEY_ID, secret=AWS_SECRET_ACCESS_KEY, endpoint_url=AWS_ENDPOINT_URL
)

with s3.open('<bucketname>/<objectname>', 'rb') as f:
    print(f.read())

Configure pandas⚓︎

import pandas as pd

FILE_PATH = "s3://<bucketname>/path/to/file.parquet"

storage_options = {
    "client_kwargs": {"endpoint_url": AWS_ENDPOINT_URL},
    "key": AWS_ACCESS_KEY_ID,
    "secret": AWS_SECRET_ACCESS_KEY,
}

df = pd.read_parquet(FILE_PATH, storage_options=storage_options)

FAQ⚓︎

The file size is too large to upload files to MinIO using the Web-UI⚓︎

The maximum file size that can be uploaded to MinIO via the web UI is limited by your organization or cluster administrator. If you need to upload larger files, use the MinIO API—either with Python or the command line tool mc.

Using the MinIO CLI (mc):

mc cp <src-file-path> <your-minio-alias>//<bucket-name>/<path>

If this approach does not work, please contact your administrator to check the configured file size limits.

I want to use DuckDB to access my files on MinIO⚓︎

You can read files directly from MinIO using DuckDB. See this example for python:

import duckdb

conn = duckdb.connect()

conn.execute("INSTALL httpfs")
conn.execute("LOAD httpfs")

conn.execute("SET s3_url_style='path';")
conn.execute("SET s3_use_ssl = false;")

conn.execute("SET s3_endpoint='minio.minio'")

path = 's3://bucket/folder/subfolder/file.parquet'
conn.execute(f"CREATE TABLE data AS SELECT * FROM read_parquet('{path}')")
conn.sql("SELECT * FROM data LIMIT 10")

You should not need to set the access key and secret access key. You can check if DuckDB finds the right credentials with executing conn.execute('CALL load_aws_credentials();').fetchall() (see the DuckDB docs).

Otherwise, you can also set them by hand like this:

import os

[...]

conn.execute(f"SET s3_access_key_id='{os.environ['AWS_ACCESS_KEY_ID']}'")
conn.execute(f"SET s3_secret_access_key='{os.environ['AWS_SECRET_ACCESS_KEY']}'")

My pipeline/Katib/PyTorchJob... does not recognize MinIO credentials and throws errors⚓︎

The MinIO credentials are automatically injected into KServe inference services and notebooks in the user's namespace. However, they are not automatically available in other Pods, such as those used by Katib experiments, pipeline tasks, or training operators. In such cases, the necessary environment variables can be injected into the Pods from the s3creds secret provided in the user's namespace.

For simple Pods, as well as for Katib experiments and PyTorch training jobs, this section can be added to the manifests at the container specification level:

env:
  - name: AWS_ACCESS_KEY_ID
    valueFrom:
      secretKeyRef:
        name: s3creds
        key: AWS_ACCESS_KEY_ID
  - name: AWS_SECRET_ACCESS_KEY
    valueFrom:
      secretKeyRef:
        name: s3creds
        key: AWS_SECRET_ACCESS_KEY
  - name: S3_ENDPOINT
    value: "minio.minio"
  - name: S3_USE_HTTPS
    value: "0"
  - name: S3_VERIFY_SSL
    value: "0"

How exactly this YAML section is integrated into the Katib job manifest depends on the type of Katib worker trial and can be found here. For example, in PyTorchJobs, the section can be added under the env field within the containers level of the template in both the Master and Worker specifications.

For pipeline tasks, the Kubeflow Pipelines Kubernetes SDK can be used as shown in the following example:

from kfp import dsl
from kfp import kubernetes
from typing import List

def add_minio_env_vars_to_tasks(task_list: List[dsl.PipelineTask]) -> None:
    """Adds environment variables for MinIO to the tasks"""
    for task in task_list:
        kubernetes.use_secret_as_env(
            task,
            secret_name="s3creds",
            secret_key_to_env={
                "AWS_ACCESS_KEY_ID": "AWS_ACCESS_KEY_ID",
                "AWS_SECRET_ACCESS_KEY": "AWS_SECRET_ACCESS_KEY",
                "S3_ENDPOINT": "S3_ENDPOINT",
            }
        )