Troubleshooting Provisioning Request in Kueue
This document helps you troubleshoot ProvisioningRequests, an API defined by ClusterAutoscaler.
Kueue creates ProvisioningRequests via the Provisioning Admission Check Controller, and treats them like an Admission Check. In order for Kueue to admit a Workload, the ProvisioningRequest created for it needs to succeed.
Before you begin
Before you begin troubleshooting, make sure your cluster meets the following requirements:
- Your cluster has ClusterAutoscaler enabled and ClusterAutoscaler supports ProvisioningRequest API.
Check your cloud provider’s documentation to determine the minimum versions that support ProvisioningRequest. If you use GKE, your cluster should be running version
1.28.3-gke.1098000
or newer. - You use a type of nodes that support ProvisioningRequest. It may vary depending on your cloud provider.
- Kueue’s version is
v0.5.3
or newer. - You have enabled the
ProvisioningACC
in the feature gates configuration. This feature gate is enabled by default for Kueuev0.7.0
or newer.
Identifying the Provisioning Request for your job
See the Troubleshooting Jobs guide, to learn how to identify the Workload for your job.
You can run the following command to see a brief state of a Provisioning Request (and other Admission Checks) in the admissionChecks
field of the Workload’s Status.
kubectl describe workload WORKLOAD_NAME
Kueue creates ProvisioningRequests using a naming pattern that helps you identify the request corresponding to your workload.
[NAME OF YOUR WORKLOAD]-[NAME OF THE ADMISSION CHECK]-[NUMBER OF RETRY]
e.g.
sample-job-2zcsb-57864-sample-admissioncheck-1
When nodes for your job are provisioned, Kueue will also add the annotation cluster-autoscaler.kubernetes.io/consume-provisioning-request
to the .admissionChecks[*].podSetUpdate[*]
field in Workload’s status. The value of this annotation is the Provisioning Request’s name.
The output of the kubectl describe workload
command should look similar to the following:
[...]
Status:
Admission Checks:
Last Transition Time: 2024-05-22T10:47:46Z
Message: Provisioning Request was successfully provisioned.
Name: sample-admissioncheck
Pod Set Updates:
Annotations:
cluster-autoscaler.kubernetes.io/consume-provisioning-request: sample-job-2zcsb-57864-sample-admissioncheck-1
cluster-autoscaler.kubernetes.io/provisioning-class-name: queued-provisioning.gke.io
Name: main
State: Ready
What is the current state of my Provisioning Request?
One possible reason your job is not running might be that ProvisioningRequest is waiting to be provisioned. To find out if this is the case you can view Provisioning Request’s state by running the following command:
kubectl get provisioningrequest PROVISIONING_REQUEST_NAME
If this is the case, the output should look similar to the following:
NAME ACCEPTED PROVISIONED FAILED AGE
sample-job-2zcsb-57864-sample-admissioncheck-1 True False False 20s
You can also view more detailed status of your ProvisioningRequest by running the following command:
kubectl describe provisioningrequest PROVISIONING_REQUEST_NAME
If your ProvisioningRequest fails to provision nodes, the error output may look similar to the following:
[...]
Status:
Conditions:
Last Transition Time: 2024-05-22T13:04:54Z
Message: Provisioning Request wasn't accepted.
Observed Generation: 1
Reason: NotAccepted
Status: False
Type: Accepted
Last Transition Time: 2024-05-22T13:04:54Z
Message: Provisioning Request wasn't provisioned.
Observed Generation: 1
Reason: NotProvisioned
Status: False
Type: Provisioned
Last Transition Time: 2024-05-22T13:06:49Z
Message: max cluster limit reached, nodepools out of resources: default-nodepool (cpu, memory)
Observed Generation: 1
Reason: OutOfResources
Status: True
Type: Failed
Note that the Reason
and Message
values for Failed
condition may differ from your output, depending on the
reason that prevented the provisioning.
The Provisioning Request state is described in the .conditions[*].status
field.
An empty field means ProvisinongRequest is still being processed by the ClusterAutoscaler.
Otherwise, it falls into one of the states listed below:
Accepted
- indicates that the ProvisioningRequest was accepted by ClusterAutoscaler, so ClusterAutoscaler will attempt to provision the nodes for it.Provisioned
- indicates that all of the requested resources were created and are available in the cluster. ClusterAutoscaler will set this condition when the VM creation finishes successfully.Failed
- indicates that it is impossible to obtain resources to fulfill this ProvisioningRequest. Condition Reason and Message will contain more details about what failed.BookingExpired
- indicates that the ProvisioningRequest had Provisioned condition before and capacity reservation time is expired.CapacityRevoked
- indicates that requested resources are not longer valid.
The states transitions are as follow:
Why a Provisioning Request is not created?
If Kueue did not create a Provisioning Request for your job, try checking the following requirements:
a. Ensure the Kueue’s controller manager enables the ProvisioningACC
feature gate
Run the following command to check whether your Kueue’s controller manager has enabled the ProvisioningACC
feature gate:
kubectl describe pod -n kueue-system kueue-controller-manager-
The arguments for Kueue container should be similar to the following:
...
Args:
--config=/controller_manager_config.yaml
--zap-log-level=2
--feature-gates=ProvisioningACC=true
Note for Kueue v0.7.0
or newer the feature is enabled by default, so you may see different output.
b. Ensure your Workload has reserved quota
To check if your Workload has reserved quota in a ClusterQueue check your Workload’s status by running the following command:
kubectl describe workload WORKLOAD_NAME
The output should be similar to the following:
[...]
Status:
Conditions:
Last Transition Time: 2024-05-22T10:26:40Z
Message: Quota reserved in ClusterQueue cluster-queue
Observed Generation: 1
Reason: QuotaReserved
Status: True
Type: QuotaReserved
If the output you get is similar to the following:
Conditions:
Last Transition Time: 2024-05-22T08:48:47Z
Message: couldn't assign flavors to pod set main: insufficient unused quota for memory in flavor default-flavor, 4396Mi more needed
Observed Generation: 1
Reason: Pending
Status: False
Type: QuotaReserved
This means you do not have sufficient free quota in your ClusterQueue.
Other reasons why your Workload has not reserved quota may relate to LocalQueue/ClusterQueue misconfiguration, e.g.:
Status:
Conditions:
Last Transition Time: 2024-05-22T08:57:09Z
Message: ClusterQueue cluster-queue doesn't exist
Observed Generation: 1
Reason: Inadmissible
Status: False
Type: QuotaReserved
You can check if ClusterQueues and LocalQueues are ready to admit your Workloads. See the Troubleshooting Queues for more details.
c. Ensure the Admission Check is active
To check if the Admission Check that your job uses is active run the following command:
kubectl describe admissionchecks ADMISSIONCHECK_NAME
Where ADMISSIONCHECK_NAME
is a name configured in your ClusterQueue spec. See the Admission Check documentation for more details.
The status of the Admission Check should be similar to:
...
Status:
Conditions:
Last Transition Time: 2024-03-08T11:44:53Z
Message: The admission check is active
Reason: Active
Status: True
Type: Active
If none of the above steps resolves your problem, contact us at the Slack wg-batch
channel
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.