Run Workloads With DRA Devices

Run workloads that request hardware devices managed by Kubernetes Dynamic Resource Allocation (DRA) with Kueue quota management.

This page shows you how to run workloads that request hardware devices (such as GPUs) managed by Dynamic Resource Allocation (DRA) in a Kubernetes cluster with Kueue enabled. The examples use a batch Job, but the same approach works with any workload type that Kueue supports.

The intended audience for this page are batch users.

For conceptual details about how Kueue handles DRA resources, see Dynamic Resource Allocation concepts.

Before you begin

Make sure the following conditions are met:

0. Identify the queues available in your namespace

Run the following command to list the LocalQueues available in your namespace.

kubectl -n default get localqueues

The output is similar to the following:

NAME         CLUSTERQUEUE    PENDING WORKLOADS
user-queue   cluster-queue   0

The ClusterQueue defines the quotas for the Queue.

1. Define the workload

Running a workload with DRA devices is similar to running a regular Job. You must set the kueue.x-k8s.io/queue-name label to select the LocalQueue you want to submit the workload to.

There are two ways to request DRA devices, depending on how your administrator has configured the cluster. Choose the approach that matches your setup.

Using a ResourceClaimTemplate

Use this approach when you need to explicitly describe the device you want. Create a ResourceClaimTemplate and reference it from the workload:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  namespace: default
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - name: gpu
        exactly:
          deviceClassName: gpu.example.com
---
apiVersion: batch/v1
kind: Job
metadata:
  generateName: sample-dra-rct-job-
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  template:
    spec:
      containers:
      - name: dummy-job
        image: registry.k8s.io/e2e-test-images/agnhost:2.53
        args: ["pause"]
        resources:
          claims:
          - name: gpu
          requests:
            cpu: "1"
            memory: "200Mi"
      resourceClaims:
      - name: gpu
        resourceClaimTemplateName: single-gpu
      restartPolicy: Never

Using extended resources

Use this approach when a DeviceClass with spec.extendedResourceName exists in the cluster. You request devices using the standard resources.requests syntax, just like CPU or memory. No ResourceClaimTemplate is needed:

apiVersion: batch/v1
kind: Job
metadata:
  generateName: sample-dra-extended-job-
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  template:
    spec:
      containers:
      - name: dummy-job
        image: registry.k8s.io/e2e-test-images/agnhost:2.53
        args: ["pause"]
        resources:
          requests:
            cpu: "1"
            memory: "200Mi"
            example.com/gpu: "1"
          limits:
            example.com/gpu: "1"
      restartPolicy: Never

If you are not sure which approach to use, ask your administrator.

2. Run the workload

You can run the workload with the following command.

For a ResourceClaimTemplate-based workload:

kubectl create -f https://kueue.sigs.k8s.io/examples/dra/sample-dra-rct-job.yaml

For an extended resource-based workload:

kubectl create -f https://kueue.sigs.k8s.io/examples/dra/sample-dra-extended-resource-job.yaml

Internally, Kueue will create a corresponding Workload for this Job.

3. (Optional) Monitor the status of the workload

You can see the Workload status with the following command:

kubectl -n default get workloads.kueue.x-k8s.io

To check whether the workload was admitted and see the DRA resource accounting:

kubectl -n default describe workload <workload-name>

Look at the Conditions section for admission status and the Events section for details. If the workload was admitted, you can verify the resources charged for quota in the status.admission.podSetAssignments[].resourceUsage field:

kubectl -n default get workloads.kueue.x-k8s.io <workload-name> -o yaml

Troubleshooting

Workload not admitted

If the Workload stays in Pending state:

  • Verify the ClusterQueue has quota for the DRA resource and it is not fully consumed by other workloads.
  • Run kubectl -n default describe workload <workload-name> and look at the Events section for admission rejection reasons.

Double counting (extended resource path)

If quota usage shows double the expected value (e.g., 2 instead of 1 for a single GPU), the DRAExtendedResources feature gate may not be enabled. Ask your administrator to verify the DRA setup.

Missing DeviceClass

For the extended resource path, the DeviceClass must exist before you submit your workload. If it was created after your workload was rejected, the workload may not be re-evaluated until another cluster event triggers requeuing. Delete and re-create the workload to force re-evaluation.

For general troubleshooting, see the troubleshooting guide.