Run a XGBoostJob

Run a Kueue scheduled XGBoostJob

This page shows how to leverage Kueue’s scheduling and resource management capabilities when running Training Operator XGBoostJobs.

This guide is for batch users that have a basic understanding of Kueue. For more information, see Kueue’s overview.

Before you begin

Check administer cluster quotas for details on the initial cluster setup.

Check the Training Operator installation guide.

Note that the minimum requirement training-operator version is v1.7.0.

You can modify kueue configurations from installed releases to include XGBoostJobs as an allowed workload.

Note

In order to use Training Operator you need to restart Kueue after the installation. You can do it by running: kubectl delete pods -lcontrol-plane=controller-manager -nkueue-system.

XGBoostJob definition

a. Queue selection

The target local queue should be specified in the metadata.labels section of the XGBoostJob configuration.

metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue

b. Optionally set Suspend field in XGBoostJobs

spec:
  runPolicy:
    suspend: true

By default, Kueue will set suspend to true via webhook and unsuspend it when the XGBoostJob is admitted.

Sample XGBoostJob

This example is based on https://github.com/kubeflow/training-operator/blob/afba76bc5a168cbcbc8685c7661f36e9b787afd1/examples/xgboost/xgboostjob.yaml.

apiVersion: kubeflow.org/v1
kind: XGBoostJob
metadata:
  name: xgboost-dist-iris-test-train
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  xgbReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: xgboost
              image: docker.io/kubeflow/xgboost-dist-iris:latest
              resources:
                requests:
                  cpu: 0.5
                  memory: 256Mi
              ports:
                - containerPort: 9991
                  name: xgboostjob-port
              imagePullPolicy: Always
              args:
                - --job_type=Train
                - --xgboost_parameter=objective:multi:softprob,num_class:3
                - --n_estimators=10
                - --learning_rate=0.1
                - --model_path=/tmp/xgboost-model
                - --model_storage_type=local
    Worker:
      replicas: 2
      restartPolicy: ExitCode
      template:
        spec:
          containers:
            - name: xgboost
              image: docker.io/kubeflow/xgboost-dist-iris:latest
              resources:
                requests:
                  cpu: 0.5
                  memory: 256Mi
              ports:
                - containerPort: 9991
                  name: xgboostjob-port
              imagePullPolicy: Always
              args:
                - --job_type=Train
                - --xgboost_parameter="objective:multi:softprob,num_class:3"
                - --n_estimators=10
                - --learning_rate=0.1

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified July 24, 2024: Add a note about the need to restart the kueue pod (#2687) (102a4e74)