Run Kubeflow Jobs in Multi-Cluster
Before you begin
Check the MultiKueue installation guide on how to properly setup MultiKueue clusters.
For the ease of setup and use we recommend using at least Kueue v0.11.0 and for Kubeflow Trainer at least v1.9.0.
See Trainer Installation for installation and configuration details of Trainer.
Note
Before the ManagedBy feature was supported in Kueue (below v0.11.0), the installation of Kubeflow Trainer in the Manager Cluster must be limited to CRDs only.
To install the CRDs run:
kubectl apply -k "github.com/kubeflow/trainer.git/manifests/base/crds?ref=v1.9.0"
MultiKueue integration
Once the setup is complete you can test it by running one of the Kubeflow Jobs e.g. PyTorchJob sample-pytorchjob.yaml
.
Note
Note: Kueue defaults the spec.runPolicy.managedBy
field to kueue.x-k8s.io/multikueue
on the management cluster for all Kubeflow Jobs.
This allows the Trainer to ignore the Jobs managed by MultiKueue on the management cluster, and in particular skip Pod creation.
The pods are created and the actual computation will happen on the mirror copy of the Job on the selected worker cluster. The mirror copy of the Job does not have the field set.
Working alongside MPI Operator
In order for MPI-operator and Trainer to work on the same cluster it is required that:
kubeflow.org_mpijobs.yaml
entry is removed frombase/crds/kustomization.yaml
- https://github.com/kubeflow/trainer/issues/1930- Trainer deployment is modified to enable all kubeflow jobs except for MPI - https://github.com/kubeflow/trainer/issues/1777
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.