Run An AppWrapper

Run an AppWrapper on Kueue.

This page shows how to leverage Kueue’s scheduling and resource management capabilities when running AppWrappers.

AppWrappers provide a flexible and workload-agnostic mechanism for enabling Kueue to manage a group of Kubernetes resources as a single logical unit without requiring any Kueue-specific support by the controllers of those resources.

AppWrappers are designed to harden workloads by providing an additional level of automatic fault detection and recovery. The AppWrapper controller monitors the health of the workload and if corrective actions are not taken by the primary resource controllers within specified deadlines, the AppWrapper controller will orchestrate workload-level retries and resource deletion to ensure that either the workload returns to a healthy state or is cleanly removed from the cluster and its quota freed for use by other workloads.

This guide is for batch users that have a basic understanding of Kueue. For more information, see Kueue’s overview.

Before you begin

Check Administer cluster quotas for details on the initial Kueue setup.

To simplify setup, make sure you are using Kueue v0.11.0 version or newer and AppWrapper v1.1.1 or newer.

See AppWrapper Quick-Start Guide for installation and configuration details of the AppWrapper Operator.

AppWrapper definition

When running AppWrappers on Kueue, take into consideration the following aspects:

a. Queue selection

The target local queue should be specified in the metadata.labels section of the AppWrapper.

metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue

b. Configure the resource needs

The resource needs of the workload are computed by combining the resource needs of each wrapper component.

Example AppWrapper containing a PyTorchJob

The AppWrapper looks like the following:

apiVersion: workload.codeflare.dev/v1beta2
kind: AppWrapper
metadata:
  name: sample-appwrapper-pytorch-job
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  components:
  - template:
      apiVersion: "kubeflow.org/v1"
      kind: PyTorchJob
      metadata:
        name: pytorch-simple
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                - name: pytorch
                  image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-fc858d1
                  command:
                  - "python3"
                  - "/opt/pytorch-mnist/mnist.py"
                  - "--epochs=1"
                  resources:
                    requests:
                      cpu: 1
          Worker:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                - name: pytorch
                  image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-fc858d1
                  command:
                  - "python3"
                  - "/opt/pytorch-mnist/mnist.py"
                  - "--epochs=1"
                  resources:
                    requests:
                      cpu: 1

Example AppWrapper containing a Deployment

The AppWrapper looks like the following:

apiVersion: workload.codeflare.dev/v1beta2
kind: AppWrapper
metadata:
  name: sample-appwrapper-deployment
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  suspend: true
  components:
  - podSets:
    - path: "template.spec.template"
      replicas: 3
    template:
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: nginx-deployment
        labels:
          app: nginx
      spec:
        replicas: 3
        selector:
          matchLabels:
            app: nginx
        template:
          metadata:
            labels:
              app: nginx
          spec:
            containers:
              - name: nginx
                image: registry.k8s.io/nginx-slim:0.27
                ports:
                  - containerPort: 80
                resources:
                  requests:
                    cpu: "100m"