Run Workloads With DRA Devices
This page shows you how to run workloads that request hardware devices (such as GPUs) managed by Dynamic Resource Allocation (DRA) in a Kubernetes cluster with Kueue enabled. The examples use a batch Job, but the same approach works with any workload type that Kueue supports.
The intended audience for this page are batch users.
For conceptual details about how Kueue handles DRA resources, see Dynamic Resource Allocation concepts.
Before you begin
Make sure the following conditions are met:
- A Kubernetes cluster is running.
- The kubectl command-line tool has communication with your cluster.
- Kueue is installed.
- The cluster has quotas configured
with DRA resources included in the
ClusterQueue. - Your administrator has set up DRA support in Kueue.
0. Identify the queues available in your namespace
Run the following command to list the LocalQueues available in your namespace.
kubectl -n default get localqueues
The output is similar to the following:
NAME CLUSTERQUEUE PENDING WORKLOADS
user-queue cluster-queue 0
The ClusterQueue defines the quotas for the Queue.
1. Define the workload
Running a workload with DRA devices is similar to
running a regular Job. You must set the
kueue.x-k8s.io/queue-name label to select the LocalQueue you want to
submit the workload to.
There are two ways to request DRA devices, depending on how your administrator has configured the cluster. Choose the approach that matches your setup.
Using a ResourceClaimTemplate
Use this approach when you need to explicitly describe the device you want.
Create a ResourceClaimTemplate and reference it from the workload:
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
namespace: default
name: single-gpu
spec:
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.example.com
---
apiVersion: batch/v1
kind: Job
metadata:
generateName: sample-dra-rct-job-
namespace: default
labels:
kueue.x-k8s.io/queue-name: user-queue
spec:
template:
spec:
containers:
- name: dummy-job
image: registry.k8s.io/e2e-test-images/agnhost:2.53
args: ["pause"]
resources:
claims:
- name: gpu
requests:
cpu: "1"
memory: "200Mi"
resourceClaims:
- name: gpu
resourceClaimTemplateName: single-gpu
restartPolicy: Never
Using extended resources
Use this approach when a DeviceClass with spec.extendedResourceName exists
in the cluster. You request devices using the standard resources.requests
syntax, just like CPU or memory. No ResourceClaimTemplate is needed:
apiVersion: batch/v1
kind: Job
metadata:
generateName: sample-dra-extended-job-
namespace: default
labels:
kueue.x-k8s.io/queue-name: user-queue
spec:
template:
spec:
containers:
- name: dummy-job
image: registry.k8s.io/e2e-test-images/agnhost:2.53
args: ["pause"]
resources:
requests:
cpu: "1"
memory: "200Mi"
example.com/gpu: "1"
limits:
example.com/gpu: "1"
restartPolicy: Never
If you are not sure which approach to use, ask your administrator.
2. Run the workload
You can run the workload with the following command.
For a ResourceClaimTemplate-based workload:
kubectl create -f https://kueue.sigs.k8s.io/examples/dra/sample-dra-rct-job.yaml
For an extended resource-based workload:
kubectl create -f https://kueue.sigs.k8s.io/examples/dra/sample-dra-extended-resource-job.yaml
Internally, Kueue will create a corresponding Workload for this Job.
3. (Optional) Monitor the status of the workload
You can see the Workload status with the following command:
kubectl -n default get workloads.kueue.x-k8s.io
To check whether the workload was admitted and see the DRA resource accounting:
kubectl -n default describe workload <workload-name>
Look at the Conditions section for admission status and the Events
section for details. If the workload was admitted, you can verify the
resources charged for quota in the
status.admission.podSetAssignments[].resourceUsage field:
kubectl -n default get workloads.kueue.x-k8s.io <workload-name> -o yaml
Troubleshooting
Workload not admitted
If the Workload stays in Pending state:
- Verify the
ClusterQueuehas quota for the DRA resource and it is not fully consumed by other workloads. - Run
kubectl -n default describe workload <workload-name>and look at the Events section for admission rejection reasons.
Double counting (extended resource path)
If quota usage shows double the expected value (e.g., 2 instead of 1 for
a single GPU), the DRAExtendedResources feature gate may not be enabled.
Ask your administrator to verify the
DRA setup.
Missing DeviceClass
For the extended resource path, the DeviceClass must exist before you submit
your workload. If it was created after your workload was rejected, the workload
may not be re-evaluated until another cluster event triggers requeuing.
Delete and re-create the workload to force re-evaluation.
For general troubleshooting, see the troubleshooting guide.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.