Run Kubeflow Jobs in Multi-Cluster

Run a MultiKueue scheduled Kubeflow Jobs.

Before you begin

Check the MultiKueue installation guide on how to properly setup MultiKueue clusters.

For the ease of setup and use we recommend using at least Kueue v0.8.1 and for Kubeflow Training Operator at least v1.8.1.

Manager Cluster

To install the CRDs run:

kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/base/crds?ref=v1.8.0"

Worker Cluster

See Training Operator Installation for installation and configuration details of Training Operator.

MultiKueue integration

Once the setup is complete you can test it by running one of the Kubeflow Jobs e.g. PyTorchJob sample-pytorchjob.yaml.

Working alongside MPI Operator

In order for MPI-operator and Training-operator to work on the same cluster it is required that:

  1. kubeflow.org_mpijobs.yaml entry is removed from base/crds/kustomization.yaml - https://github.com/kubeflow/training-operator/issues/1930
  2. Training Operator deployment is modified to enable all kubeflow jobs except for MPI - https://github.com/kubeflow/training-operator/issues/1777