Run Kubeflow Jobs in Multi-Cluster

Run a MultiKueue scheduled Kubeflow Jobs.

Before you begin

Check the MultiKueue installation guide on how to properly setup MultiKueue clusters.

For the ease of setup and use we recommend using at least Kueue v0.11.0 and for Kubeflow Trainer at least v1.9.0.

See Trainer Installation for installation and configuration details of Trainer.

MultiKueue integration

Once the setup is complete you can test it by running one of the Kubeflow Jobs e.g. PyTorchJob sample-pytorchjob.yaml.

Working alongside MPI Operator

In order for MPI-operator and Trainer to work on the same cluster it is required that:

  1. kubeflow.org_mpijobs.yaml entry is removed from base/crds/kustomization.yaml - https://github.com/kubeflow/trainer/issues/1930
  2. Trainer deployment is modified to enable all kubeflow jobs except for MPI - https://github.com/kubeflow/trainer/issues/1777