Run Kubeflow Jobs in Multi-Cluster

Run a MultiKueue scheduled Kubeflow Jobs.

Before you begin

Check the MultiKueue installation guide on how to properly setup MultiKueue clusters.

For the ease of setup and use we recommend using at least Kueue v0.11.0 and for Kubeflow Trainer at least v1.9.0.

See Trainer Installation for installation and configuration details of Trainer.

Note

Before the ManagedBy feature was supported in Kueue (below v0.11.0), the installation of Kubeflow Trainer in the Manager Cluster must be limited to CRDs only.

To install the CRDs run:

kubectl apply -k "github.com/kubeflow/trainer.git/manifests/base/crds?ref=v1.9.0"

MultiKueue integration

Once the setup is complete you can test it by running one of the Kubeflow Jobs e.g. PyTorchJob sample-pytorchjob.yaml.

Note

Kueue defaults the spec.runPolicy.managedBy field to kueue.x-k8s.io/multikueue on the management cluster for all Kubeflow Jobs.

This allows the Trainer to ignore the Jobs managed by MultiKueue on the management cluster, and in particular skip Pod creation.

The pods are created and the actual computation will happen on the mirror copy of the Job on the selected worker cluster. The mirror copy of the Job does not have the field set.

Working alongside MPI Operator

In order for MPI-operator and Trainer to work on the same cluster it is required that:

kubeflow.org_mpijobs.yaml entry is removed from base/crds/kustomization.yaml - https://github.com/kubeflow/trainer/issues/1930
Trainer deployment is modified to enable all kubeflow jobs except for MPI - https://github.com/kubeflow/trainer/issues/1777

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified June 27, 2025: doc: Remove duplicated 'Note' prefix (#5794) (2015adbc)