Run a TFJob

Run a Kueue scheduled TFJob

This page shows how to leverage Kueue’s scheduling and resource management capabilities when running Training Operator TFJobs.

This guide is for batch users that have a basic understanding of Kueue. For more information, see Kueue’s overview.

Before you begin

Check administer cluster quotas for details on the initial cluster setup.

Check the Training Operator installation guide.

Note that the minimum requirement training-operator version is v1.7.0.

You can modify kueue configurations from installed releases to include TFJobs as an allowed workload.

TFJob definition

a. Queue selection

The target local queue should be specified in the metadata.labels section of the TFJob configuration.

metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue

b. Optionally set Suspend field in TFJobs

spec:
  runPolicy:
    suspend: true

By default, Kueue will set suspend to true via webhook and unsuspend it when the TFJob is admitted.

Sample TFJob

This example is based on https://github.com/kubeflow/training-operator/blob/48dbbf0a8e90e52c55ec05d0f689fcbf83c6b441/examples/tensorflow/dist-mnist/tf_job_mnist.yaml.

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: tensorflow-dist-mnist
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: kubeflow/tf-dist-mnist-test:latest
              resources:
                requests:
                  cpu: 1
                  memory: "200Mi"
    Worker:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: kubeflow/tf-dist-mnist-test:latest
              resources:
                requests:
                  cpu: 1
                  memory: "200Mi"