Run Kubeflow Jobs in Multi-Cluster
Run a MultiKueue scheduled Kubeflow Jobs.
Before you begin
Check the MultiKueue installation guide on how to properly setup MultiKueue clusters.
For the ease of setup and use we recommend using at least Kueue v0.8.1 and for Kubeflow Training Operator at least v1.8.1.
Manager Cluster
Note
Before the ManagedBy feature will become a part of the release of Kubeflow Training Operator the installation of Kubeflow Training Operator in the manager cluster must be limited to CRDs only.To install the CRDs run:
kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/base/crds?ref=v1.8.0"
Worker Cluster
See Training Operator Installation for installation and configuration details of Training Operator.
MultiKueue integration
Once the setup is complete you can test it by running one of the Kubeflow Jobs e.g. PyTorchJob sample-pytorchjob.yaml
.
Working alongside MPI Operator
In order for MPI-operator and Training-operator to work on the same cluster it is required that:
kubeflow.org_mpijobs.yaml
entry is removed frombase/crds/kustomization.yaml
- https://github.com/kubeflow/training-operator/issues/1930- Training Operator deployment is modified to enable all kubeflow jobs except for MPI - https://github.com/kubeflow/training-operator/issues/1777
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.