Run a MXJob

Run a Kueue scheduled MXJob

This page shows how to leverage Kueue’s scheduling and resource management capabilities when running Training Operator MXJobs.

This guide is for batch users that have a basic understanding of Kueue. For more information, see Kueue’s overview.

Before you begin

Check administer cluster quotas for details on the initial cluster setup.

Check the Training Operator installation guide.

Note that the minimum requirement training-operator version is v1.7.0.

You can modify kueue configurations from installed releases to include MXJobs as an allowed workload.

MXJob definition

a. Queue selection

The target local queue should be specified in the metadata.labels section of the MXJob configuration.

metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue

b. Optionally set Suspend field in MXJobs

spec:
  runPolicy:
    suspend: true

By default, Kueue will set suspend to true via webhook and unsuspend it when the MXJob is admitted.

Sample MXJob

This example is based on https://github.com/kubeflow/training-operator/blob/a4c0cec561a4bfe478720f1a102f305ed656071b/examples/mxnet/mxjob_dist_v1.yaml.

apiVersion: kubeflow.org/v1
kind: MXJob
metadata:
  name: mxnet-job
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  jobMode: MXTrain
  mxReplicaSpecs:
    Scheduler:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: mxnet
              image: kubeflow/mxnet-gpu:latest
              resources:
                limits:
                  cpu: 100m
                  memory: 0.2Gi
              ports:
                - containerPort: 9991
                  name: mxjob-port
    Server:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: mxnet
              image: kubeflow/mxnet-gpu:latest
              resources:
                limits:
                  cpu: 100m
                  memory: 0.2Gi
              ports:
                - containerPort: 9991
                  name: mxjob-port
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: mxnet
              image: kubeflow/mxnet-gpu:latest
              command:
              - python3
              args:
              - /mxnet/mxnet/example/image-classification/train_mnist.py
              - --num-epochs=1
              - --num-layers=2
              - --kv-store=dist_device_sync
              resources:
                limits:
                  cpu: 2
                  memory: 1Gi
              ports:
                - containerPort: 9991
                  name: mxjob-port