Run An AppWrapper

Run an AppWrapper on Kueue.

This page shows how to leverage Kueue’s scheduling and resource management capabilities when running AppWrappers.

AppWrappers provide a flexible and workload-agnostic mechanism for enabling Kueue to manage a group of Kubernetes resources as a single logical unit without requiring any Kueue-specific support by the controllers of those resources.

AppWrappers are designed to harden workloads by providing an additional level of automatic fault detection and recovery. The AppWrapper controller monitors the health of the workload and if corrective actions are not taken by the primary resource controllers within specified deadlines, the AppWrapper controller will orchestrate workload-level retries and resource deletion to ensure that either the workload returns to a healthy state or is cleanly removed from the cluster and its quota freed for use by other workloads.

This guide is for batch users that have a basic understanding of Kueue. For more information, see Kueue’s overview.

Before you begin

  1. Make sure you are using Kueue v0.11.0 version or newer and AppWrapper v1.0.0 or newer.

  2. Check Administer cluster quotas for details on the initial Kueue setup.

  3. Because AppWrappers were initially designed as an external framework for Kueue, you need to install the Standalone configuration of the AppWrapper controller. This disables the AppWrapper controller’s instance of Kueue’s GenericJob Reconciller. One way to do this is by doing

kustomize build "https://github.com/project-codeflare/appwrapper/config/standalone?ref=v1.0.0"

A future release of AppWrapper will change its default configuration to disable its copy of the GenericJob Reconciller.

AppWrapper definition

When running AppWrappers on Kueue, take into consideration the following aspects:

a. Queue selection

The target local queue should be specified in the metadata.labels section of the AppWrapper.

metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue

b. Configure the resource needs

The resource needs of the workload are computed by combining the resource needs of each wrapper component.

Example AppWrapper containing a PyTorchJob

The AppWrapper looks like the following:

apiVersion: workload.codeflare.dev/v1beta2
kind: AppWrapper
metadata:
  name: sample-pytorch-job
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  components:
  - template:
      apiVersion: "kubeflow.org/v1"
      kind: PyTorchJob
      metadata:
        name: pytorch-simple
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                - name: pytorch
                  image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-fc858d1
                  command:
                  - "python3"
                  - "/opt/pytorch-mnist/mnist.py"
                  - "--epochs=1"
                  resources:
                    requests:
                      cpu: 1
          Worker:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                - name: pytorch
                  image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-fc858d1
                  command:
                  - "python3"
                  - "/opt/pytorch-mnist/mnist.py"
                  - "--epochs=1"
                  resources:
                    requests:
                      cpu: 1

Last modified January 15, 2025: AppWrapper integration (#3953) (d71331ba)