Run a MPIJob

Run a Kueue scheduled MPIJob

This page shows how to leverage Kueue’s scheduling and resource management capabilities when running MPI Operator MPIJobs.

This guide is for batch users that have a basic understanding of Kueue. For more information, see Kueue’s overview.

Before you begin

Check administer cluster quotas for details on the initial cluster setup.

Check the MPI Operator installation guide.

You can modify kueue configurations from installed releases to include MPIJobs as an allowed workload.

MPI Operator definition

a. Queue selection

The target local queue should be specified in the metadata.labels section of the MPIJob configuration.

metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue

b. Optionally set Suspend field in MPIJobs

spec:
  runPolicy:
    suspend: true

By default, Kueue will set suspend to true via webhook and unsuspend it when the MPIJob is admitted.

Sample MPIJob

This example is based on https://github.com/kubeflow/mpi-operator/blob/ccf2756f749336d652fa6b10a732e241a40c7aa6/examples/v2beta1/pi/pi.yaml.

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: pi
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
    ttlSecondsAfterFinished: 60
  sshAuthMountPath: /home/mpiuser/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: mpioperator/mpi-pi:openmpi
            name: mpi-launcher
            securityContext:
              runAsUser: 1000
            command:
            - mpirun
            args:
            - -n
            - "2"
            - /home/mpiuser/pi
            resources:
              limits:
                cpu: 1
                memory: 1Gi
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: mpioperator/mpi-pi:openmpi
            name: mpi-worker
            securityContext:
              runAsUser: 1000
            command:
            - /usr/sbin/sshd
            args:
            - -De
            - -f
            - /home/mpiuser/.sshd_config
            resources:
              limits:
                cpu: 1
                memory: 1Gi

For equivalent instructions for doing this in Python, see Run Python Jobs.


Last modified October 29, 2024: Change MPIJob naming from to (#3360) (853f1229)