Troubleshooting Provisioning Request in Kueue

Troubleshooting the status of a Provisioning Request in Kueue

This document helps you troubleshoot ProvisioningRequests, an API defined by ClusterAutoscaler.

Kueue creates ProvisioningRequests via the Provisioning Admission Check Controller, and treats them like an Admission Check. In order for Kueue to admit a Workload, the ProvisioningRequest created for it needs to succeed.

Before you begin

Before you begin troubleshooting, make sure your cluster meets the following requirements:

  • Your cluster has ClusterAutoscaler enabled and ClusterAutoscaler supports ProvisioningRequest API. Check your cloud provider’s documentation to determine the minimum versions that support ProvisioningRequest. If you use GKE, your cluster should be running version 1.28.3-gke.1098000 or newer.
  • You use a type of nodes that support ProvisioningRequest. It may vary depending on your cloud provider.
  • Kueue’s version is v0.5.3 or newer.
  • You have enabled the ProvisioningACC in the feature gates configuration. This feature gate is enabled by default for Kueue v0.7.0 or newer.

Identifying the Provisioning Request for your job

See the Troubleshooting Jobs guide, to learn how to identify the Workload for your job.

You can run the following command to see a brief state of a Provisioning Request (and other Admission Checks) in the admissionChecks field of the Workload’s Status.

kubectl describe workload WORKLOAD_NAME

Kueue creates ProvisioningRequests using a naming pattern that helps you identify the request corresponding to your workload.

[NAME OF YOUR WORKLOAD]-[NAME OF THE ADMISSION CHECK]-[NUMBER OF RETRY]

e.g.

sample-job-2zcsb-57864-sample-admissioncheck-1

When nodes for your job are provisioned, Kueue will also add the annotation cluster-autoscaler.kubernetes.io/consume-provisioning-request to the .admissionChecks[*].podSetUpdate[*] field in Workload’s status. The value of this annotation is the Provisioning Request’s name.

The output of the kubectl describe workload command should look similar to the following:

[...]
Status:
  Admission Checks:
    Last Transition Time:  2024-05-22T10:47:46Z
    Message:               Provisioning Request was successfully provisioned.
    Name:                  sample-admissioncheck
    Pod Set Updates:
      Annotations:
        cluster-autoscaler.kubernetes.io/consume-provisioning-request:  sample-job-2zcsb-57864-sample-admissioncheck-1
        cluster-autoscaler.kubernetes.io/provisioning-class-name:       queued-provisioning.gke.io
      Name:                                                             main
    State:                                                              Ready

What is the current state of my Provisioning Request?

One possible reason your job is not running might be that ProvisioningRequest is waiting to be provisioned. To find out if this is the case you can view Provisioning Request’s state by running the following command:

kubectl get provisioningrequest PROVISIONING_REQUEST_NAME

If this is the case, the output should look similar to the following:

NAME                                                 ACCEPTED   PROVISIONED   FAILED   AGE
sample-job-2zcsb-57864-sample-admissioncheck-1       True       False         False    20s

You can also view more detailed status of your ProvisioningRequest by running the following command:

kubectl describe provisioningrequest PROVISIONING_REQUEST_NAME

If your ProvisioningRequest fails to provision nodes, the error output may look similar to the following:

[...]
Status:
  Conditions:
    Last Transition Time:  2024-05-22T13:04:54Z
    Message:               Provisioning Request wasn't accepted.
    Observed Generation:   1
    Reason:                NotAccepted
    Status:                False
    Type:                  Accepted
    Last Transition Time:  2024-05-22T13:04:54Z
    Message:               Provisioning Request wasn't provisioned.
    Observed Generation:   1
    Reason:                NotProvisioned
    Status:                False
    Type:                  Provisioned
    Last Transition Time:  2024-05-22T13:06:49Z
    Message:               max cluster limit reached, nodepools out of resources: default-nodepool (cpu, memory)
    Observed Generation:   1
    Reason:                OutOfResources
    Status:                True
    Type:                  Failed

Note that the Reason and Message values for Failed condition may differ from your output, depending on the reason that prevented the provisioning.

The Provisioning Request state is described in the .conditions[*].status field. An empty field means ProvisinongRequest is still being processed by the ClusterAutoscaler. Otherwise, it falls into one of the states listed below:

  • Accepted - indicates that the ProvisioningRequest was accepted by ClusterAutoscaler, so ClusterAutoscaler will attempt to provision the nodes for it.
  • Provisioned - indicates that all of the requested resources were created and are available in the cluster. ClusterAutoscaler will set this condition when the VM creation finishes successfully.
  • Failed - indicates that it is impossible to obtain resources to fulfill this ProvisioningRequest. Condition Reason and Message will contain more details about what failed.
  • BookingExpired - indicates that the ProvisioningRequest had Provisioned condition before and capacity reservation time is expired.
  • CapacityRevoked - indicates that requested resources are not longer valid.

The states transitions are as follow:

Provisioning Request’s states

Why a Provisioning Request is not created?

If Kueue did not create a Provisioning Request for your job, try checking the following requirements:

a. Ensure the Kueue’s controller manager enables the ProvisioningACC feature gate

Run the following command to check whether your Kueue’s controller manager has enabled the ProvisioningACC feature gate:

kubectl describe pod -n kueue-system kueue-controller-manager-

The arguments for Kueue container should be similar to the following:

    ...
    Args:
      --config=/controller_manager_config.yaml
      --zap-log-level=2
      --feature-gates=ProvisioningACC=true

Note for Kueue v0.7.0 or newer the feature is enabled by default, so you may see different output.

b. Ensure your Workload has reserved quota

To check if your Workload has reserved quota in a ClusterQueue check your Workload’s status by running the following command:

kubectl describe workload WORKLOAD_NAME

The output should be similar to the following:

[...]
Status:
  Conditions:
    Last Transition Time:  2024-05-22T10:26:40Z
    Message:               Quota reserved in ClusterQueue cluster-queue
    Observed Generation:   1
    Reason:                QuotaReserved
    Status:                True
    Type:                  QuotaReserved

If the output you get is similar to the following:

  Conditions:
    Last Transition Time:  2024-05-22T08:48:47Z
    Message:               couldn't assign flavors to pod set main: insufficient unused quota for memory in flavor default-flavor, 4396Mi more needed
    Observed Generation:   1
    Reason:                Pending
    Status:                False
    Type:                  QuotaReserved

This means you do not have sufficient free quota in your ClusterQueue.

Other reasons why your Workload has not reserved quota may relate to LocalQueue/ClusterQueue misconfiguration, e.g.:

Status:
  Conditions:
    Last Transition Time:  2024-05-22T08:57:09Z
    Message:               ClusterQueue cluster-queue doesn't exist
    Observed Generation:   1
    Reason:                Inadmissible
    Status:                False
    Type:                  QuotaReserved

You can check if ClusterQueues and LocalQueues are ready to admit your Workloads. See the Troubleshooting Queues for more details.

c. Ensure the Admission Check is active

To check if the Admission Check that your job uses is active run the following command:

kubectl describe admissionchecks ADMISSIONCHECK_NAME

Where ADMISSIONCHECK_NAME is a name configured in your ClusterQueue spec. See the Admission Check documentation for more details.

The status of the Admission Check should be similar to:

...
Status:
  Conditions:
    Last Transition Time:  2024-03-08T11:44:53Z
    Message:               The admission check is active
    Reason:                Active
    Status:                True
    Type:                  Active

If none of the above steps resolves your problem, contact us at the Slack wg-batch channel