Run An AppWrapper
This page shows how to leverage Kueue’s scheduling and resource management capabilities when running AppWrappers.
AppWrappers provide a flexible and workload-agnostic mechanism for enabling Kueue to manage a group of Kubernetes resources as a single logical unit without requiring any Kueue-specific support by the controllers of those resources.
AppWrappers are designed to harden workloads by providing an additional level of automatic fault detection and recovery. The AppWrapper controller monitors the health of the workload and if corrective actions are not taken by the primary resource controllers within specified deadlines, the AppWrapper controller will orchestrate workload-level retries and resource deletion to ensure that either the workload returns to a healthy state or is cleanly removed from the cluster and its quota freed for use by other workloads.
This guide is for batch users that have a basic understanding of Kueue. For more information, see Kueue’s overview.
Before you begin
Make sure you are using Kueue v0.11.0 version or newer and AppWrapper v1.0.0 or newer.
Check Administer cluster quotas for details on the initial Kueue setup.
Because AppWrappers were initially designed as an external framework for Kueue, you need to install the Standalone configuration of the AppWrapper controller. This disables the AppWrapper controller’s instance of Kueue’s GenericJob Reconciller. One way to do this is by doing
kustomize build "https://github.com/project-codeflare/appwrapper/config/standalone?ref=v1.0.0"
A future release of AppWrapper will change its default configuration to disable its copy of the GenericJob Reconciller.
AppWrapper definition
When running AppWrappers on Kueue, take into consideration the following aspects:
a. Queue selection
The target local queue should be specified in the metadata.labels
section of the AppWrapper.
metadata:
labels:
kueue.x-k8s.io/queue-name: user-queue
b. Configure the resource needs
The resource needs of the workload are computed by combining the resource needs of each wrapper component.
Example AppWrapper containing a PyTorchJob
The AppWrapper looks like the following:
apiVersion: workload.codeflare.dev/v1beta2
kind: AppWrapper
metadata:
name: sample-pytorch-job
labels:
kueue.x-k8s.io/queue-name: user-queue
spec:
components:
- template:
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: pytorch-simple
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-fc858d1
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
resources:
requests:
cpu: 1
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-fc858d1
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
resources:
requests:
cpu: 1
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.