Prometheus Metrics

Prometheus metrics exported by Kueue

Kueue exposes prometheus metrics to monitor the health of the system and the status of ClusterQueues and LocalQueues.

Kueue health

Use the following metrics to monitor the health of the kueue controllers:

Metric nameTypeDescriptionLabels
kueue_admission_attempt_duration_secondsHistogramThe latency of an admission attempt.
The label ‘result’ can have the following values:
- ‘success’ means that at least one workload was admitted.,
- ‘inadmissible’ means that no workload was admitted.
result: possible values are success or inadmissible
kueue_admission_attempts_totalCounterThe total number of attempts to admit workloads.
Each admission attempt might try to admit more than one workload.
The label ‘result’ can have the following values:
- ‘success’ means that at least one workload was admitted.,
- ‘inadmissible’ means that no workload was admitted.
result: possible values are success or inadmissible

ClusterQueue status

Use the following metrics to monitor the status of your ClusterQueues:

Metric nameTypeDescriptionLabels
kueue_admission_checks_wait_time_secondsHistogramThe time from when a workload got the quota reservation until admission, per ‘cluster_queue’cluster_queue: the name of the ClusterQueue
priority_class: the priority class name
kueue_admission_cycle_preemption_skipsGaugeThe number of Workloads in the ClusterQueue that got preemption candidates but had to be skipped because other ClusterQueues needed the same resources in the same cyclecluster_queue: the name of the ClusterQueue
kueue_admission_wait_time_secondsHistogramThe time between a workload was created or requeued until admission, per ‘cluster_queue’cluster_queue: the name of the ClusterQueue
priority_class: the priority class name
kueue_admitted_active_workloadsGaugeThe number of admitted Workloads that are active (unsuspended and not finished), per ‘cluster_queue’cluster_queue: the name of the ClusterQueue
kueue_admitted_workloads_totalCounterThe total number of admitted workloads per ‘cluster_queue’cluster_queue: the name of the ClusterQueue
priority_class: the priority class name
kueue_build_infoGaugeKueue build information. 1 labeled by git version, git commit, build date, go version, compiler, platformgit_version: git version
git_commit: git commit
build_date: build date
go_version: go version
compiler: compiler
platform: platform
kueue_cluster_queue_statusGaugeReports ‘cluster_queue’ with its ‘status’ (with possible values ‘pending’, ‘active’ or ’terminated’).
For a ClusterQueue, the metric only reports a value of 1 for one of the statuses.
cluster_queue: the name of the ClusterQueue
status: one of pending, active, or terminated
kueue_evicted_workloads_once_totalCounterThe number of unique workload evictions per ‘cluster_queue’,
The label ‘reason’ can have the following values:
- “Preempted” means that the workload was evicted in order to free resources for a workload with a higher priority or reclamation of nominal quota.
- “PodsReadyTimeout” means that the eviction took place due to a PodsReady timeout.
- “AdmissionCheck” means that the workload was evicted because at least one admission check transitioned to False.
- “ClusterQueueStopped” means that the workload was evicted because the ClusterQueue is stopped.
- “LocalQueueStopped” means that the workload was evicted because the LocalQueue is stopped.
- “NodeFailures” means that the workload was evicted due to node failures when using TopologyAwareScheduling.
- “Deactivated” means that the workload was evicted because spec.active is set to false.
The label ‘detailed_reason’ can have the following values:
- "" means that the value in ‘reason’ label is the root cause for eviction.
- “WaitForStart” means that the pods have not been ready since admission, or the workload is not admitted.
- “WaitForRecovery” means that the Pods were ready since the workload admission, but some pod has failed.
- “AdmissionCheck” means that the workload was evicted by Kueue due to a rejected admission check.
- “MaximumExecutionTimeExceeded” means that the workload was evicted by Kueue due to maximum execution time exceeded.
- “RequeuingLimitExceeded” means that the workload was evicted by Kueue due to requeuing limit exceeded.
cluster_queue: the name of the ClusterQueue
reason: eviction or preemption reason
detailed_reason: finer-grained eviction cause
priority_class: the priority class name
kueue_evicted_workloads_totalCounterThe number of evicted workloads per ‘cluster_queue’,
The label ‘reason’ can have the following values:
- “Preempted” means that the workload was evicted in order to free resources for a workload with a higher priority or reclamation of nominal quota.
- “PodsReadyTimeout” means that the eviction took place due to a PodsReady timeout.
- “AdmissionCheck” means that the workload was evicted because at least one admission check transitioned to False.
- “ClusterQueueStopped” means that the workload was evicted because the ClusterQueue is stopped.
- “LocalQueueStopped” means that the workload was evicted because the LocalQueue is stopped.
- “NodeFailures” means that the workload was evicted due to node failures when using TopologyAwareScheduling.
- “Deactivated” means that the workload was evicted because spec.active is set to false.
The label ‘underlying_cause’ can have the following values:
- "" means that the value in ‘reason’ label is the root cause for eviction.
- “AdmissionCheck” means that the workload was evicted by Kueue due to a rejected admission check.
- “MaximumExecutionTimeExceeded” means that the workload was evicted by Kueue due to maximum execution time exceeded.
- “RequeuingLimitExceeded” means that the workload was evicted by Kueue due to requeuing limit exceeded.
cluster_queue: the name of the ClusterQueue
reason: eviction or preemption reason
underlying_cause: root cause for eviction
priority_class: the priority class name
kueue_pending_workloadsGaugeThe number of pending workloads, per ‘cluster_queue’ and ‘status’.
‘status’ can have the following values:
- “active” means that the workloads are in the admission queue.
- “inadmissible” means there was a failed admission attempt for these workloads and they won’t be retried until cluster conditions, which could make this workload admissible, change
cluster_queue: the name of the ClusterQueue
status: status label (varies by metric)
kueue_pods_ready_to_evicted_time_secondsHistogramThe number of seconds between a workload’s pods being ready and eviction workloads per ‘cluster_queue’,
The label ‘reason’ can have the following values:
- “Preempted” means that the workload was evicted in order to free resources for a workload with a higher priority or reclamation of nominal quota.
- “PodsReadyTimeout” means that the eviction took place due to a PodsReady timeout.
- “AdmissionCheck” means that the workload was evicted because at least one admission check transitioned to False.
- “ClusterQueueStopped” means that the workload was evicted because the ClusterQueue is stopped.
- “LocalQueueStopped” means that the workload was evicted because the LocalQueue is stopped.
- “NodeFailures” means that the workload was evicted due to node failures when using TopologyAwareScheduling.
- “Deactivated” means that the workload was evicted because spec.active is set to false.
The label ‘underlying_cause’ can have the following values:
- "" means that the value in ‘reason’ label is the root cause for eviction.
- “AdmissionCheck” means that the workload was evicted by Kueue due to a rejected admission check.
- “MaximumExecutionTimeExceeded” means that the workload was evicted by Kueue due to maximum execution time exceeded.
- “RequeuingLimitExceeded” means that the workload was evicted by Kueue due to requeuing limit exceeded.
cluster_queue: the name of the ClusterQueue
reason: eviction or preemption reason
underlying_cause: root cause for eviction
kueue_preempted_workloads_totalCounterThe number of preempted workloads per ‘preempting_cluster_queue’,
The label ‘reason’ can have the following values:
- “InClusterQueue” means that the workload was preempted by a workload in the same ClusterQueue.
- “InCohortReclamation” means that the workload was preempted by a workload in the same cohort due to reclamation of nominal quota.
- “InCohortFairSharing” means that the workload was preempted by a workload in the same cohort Fair Sharing.
- “InCohortReclaimWhileBorrowing” means that the workload was preempted by a workload in the same cohort due to reclamation of nominal quota while borrowing.
preempting_cluster_queue: the ClusterQueue executing preemption
reason: eviction or preemption reason
kueue_quota_reserved_wait_time_secondsHistogramThe time between a workload was created or requeued until it got quota reservation, per ‘cluster_queue’cluster_queue: the name of the ClusterQueue
priority_class: the priority class name
kueue_quota_reserved_workloads_totalCounterThe total number of quota reserved workloads per ‘cluster_queue’cluster_queue: the name of the ClusterQueue
priority_class: the priority class name
kueue_replaced_workload_slices_totalCounterThe number of replaced workload slices per ‘cluster_queue’cluster_queue: the name of the ClusterQueue
kueue_reserving_active_workloadsGaugeThe number of Workloads that are reserving quota, per ‘cluster_queue’cluster_queue: the name of the ClusterQueue

LocalQueue Status (alpha)

The following metrics are available only if LocalQueueMetrics feature gate is enabled. Check the Change the feature gates configuration section of the Installation for details.

Metric nameTypeDescriptionLabels
kueue_local_queue_admission_checks_wait_time_secondsHistogramThe time from when a workload got the quota reservation until admission, per ’local_queue’name: the name of the LocalQueue
namespace: the namespace of the LocalQueue
priority_class: the priority class name
kueue_local_queue_admission_wait_time_secondsHistogramThe time between a workload was created or requeued until admission, per ’local_queue’name: the name of the LocalQueue
namespace: the namespace of the LocalQueue
priority_class: the priority class name
kueue_local_queue_admitted_active_workloadsGaugeThe number of admitted Workloads that are active (unsuspended and not finished), per ’localQueue’name: the name of the LocalQueue
namespace: the namespace of the LocalQueue
kueue_local_queue_admitted_workloads_totalCounterThe total number of admitted workloads per ’local_queue’name: the name of the LocalQueue
namespace: the namespace of the LocalQueue
priority_class: the priority class name
kueue_local_queue_evicted_workloads_totalCounterThe number of evicted workloads per ’local_queue’,
The label ‘reason’ can have the following values:
- “Preempted” means that the workload was evicted in order to free resources for a workload with a higher priority or reclamation of nominal quota.
- “PodsReadyTimeout” means that the eviction took place due to a PodsReady timeout.
- “AdmissionCheck” means that the workload was evicted because at least one admission check transitioned to False.
- “ClusterQueueStopped” means that the workload was evicted because the ClusterQueue is stopped.
- “LocalQueueStopped” means that the workload was evicted because the LocalQueue is stopped.
- “NodeFailures” means that the workload was evicted due to node failures when using TopologyAwareScheduling.
- “Deactivated” means that the workload was evicted because spec.active is set to false.
The label ‘underlying_cause’ can have the following values:
- "" means that the value in ‘reason’ label is the root cause for eviction.
- “AdmissionCheck” means that the workload was evicted by Kueue due to a rejected admission check.
- “MaximumExecutionTimeExceeded” means that the workload was evicted by Kueue due to maximum execution time exceeded.
- “RequeuingLimitExceeded” means that the workload was evicted by Kueue due to requeuing limit exceeded.
name: the name of the LocalQueue
namespace: the namespace of the LocalQueue
reason: eviction or preemption reason
underlying_cause: root cause for eviction
priority_class: the priority class name
kueue_local_queue_pending_workloadsGaugeThe number of pending workloads, per ’local_queue’ and ‘status’.
‘status’ can have the following values:
- “active” means that the workloads are in the admission queue.
- “inadmissible” means there was a failed admission attempt for these workloads and they won’t be retried until cluster conditions, which could make this workload admissible, change
name: the name of the LocalQueue
namespace: the namespace of the LocalQueue
status: status label (varies by metric)
kueue_local_queue_quota_reserved_wait_time_secondsHistogramThe time between a workload was created or requeued until it got quota reservation, per ’local_queue’name: the name of the LocalQueue
namespace: the namespace of the LocalQueue
priority_class: the priority class name
kueue_local_queue_quota_reserved_workloads_totalCounterThe total number of quota reserved workloads per ’local_queue’name: the name of the LocalQueue
namespace: the namespace of the LocalQueue
priority_class: the priority class name
kueue_local_queue_reserving_active_workloadsGaugeThe number of Workloads that are reserving quota, per ’localQueue’name: the name of the LocalQueue
namespace: the namespace of the LocalQueue
kueue_local_queue_resource_reservationGaugeReports the localQueue’s total resource reservation within all the flavorsname: the name of the LocalQueue
namespace: the namespace of the LocalQueue
flavor: the resource flavor name
resource: the resource name
kueue_local_queue_resource_usageGaugeReports the localQueue’s total resource usage within all the flavorsname: the name of the LocalQueue
namespace: the namespace of the LocalQueue
flavor: the resource flavor name
resource: the resource name
kueue_local_queue_statusGaugeReports ’localQueue’ with its ‘active’ status (with possible values ‘True’, ‘False’, or ‘Unknown’).
For a LocalQueue, the metric only reports a value of 1 for one of the statuses.
name: the name of the LocalQueue
namespace: the namespace of the LocalQueue
active: one of True, False, or Unknown

Cohort Status

Metric nameTypeDescriptionLabels
kueue_cohort_weighted_shareGaugeReports a value that representing the maximum of the ratios of usage above nominal
quota to the lendable resources in the Cohort, among all the resources provided by
the Cohort, and divided by the weight.
If zero, it means that the usage of the Cohort is below the nominal quota.
If the Cohort has a weight of zero and is borrowing, this will return NaN.
cohort: the name of the Cohort

Optional metrics

The following metrics are available only if metrics.enableClusterQueueResources is enabled in the manager’s configuration.

Metric nameTypeDescriptionLabels
kueue_cluster_queue_borrowing_limitGaugeReports the cluster_queue’s resource borrowing limit within all the flavorscohort: the name of the Cohort
cluster_queue: the name of the ClusterQueue
flavor: the resource flavor name
resource: the resource name
kueue_cluster_queue_lending_limitGaugeReports the cluster_queue’s resource lending limit within all the flavorscohort: the name of the Cohort
cluster_queue: the name of the ClusterQueue
flavor: the resource flavor name
resource: the resource name
kueue_cluster_queue_nominal_quotaGaugeReports the cluster_queue’s resource nominal quota within all the flavorscohort: the name of the Cohort
cluster_queue: the name of the ClusterQueue
flavor: the resource flavor name
resource: the resource name
kueue_cluster_queue_resource_reservationGaugeReports the cluster_queue’s total resource reservation within all the flavorscohort: the name of the Cohort
cluster_queue: the name of the ClusterQueue
flavor: the resource flavor name
resource: the resource name
kueue_cluster_queue_resource_usageGaugeReports the cluster_queue’s total resource usage within all the flavorscohort: the name of the Cohort
cluster_queue: the name of the ClusterQueue
flavor: the resource flavor name
resource: the resource name
kueue_cluster_queue_weighted_shareGaugeReports a value that representing the maximum of the ratios of usage above nominal
quota to the lendable resources in the cohort, among all the resources provided by
the ClusterQueue, and divided by the weight.
If zero, it means that the usage of the ClusterQueue is below the nominal quota.
If the ClusterQueue has a weight of zero and is borrowing, this will return NaN.
cluster_queue: the name of the ClusterQueue
cohort: the name of the Cohort

The following metrics are available only if waitForPodsReady is enabled in the manager’s configuration. For more details see.

Metric nameTypeDescriptionLabels
kueue_admitted_until_ready_wait_time_secondsHistogramThe time between a workload was admitted until ready, per ‘cluster_queue’cluster_queue: the name of the ClusterQueue
priority_class: the priority class name
kueue_local_queue_admitted_until_ready_wait_time_secondsHistogramThe time between a workload was admitted until ready, per ’local_queue’name: the name of the LocalQueue
namespace: the namespace of the LocalQueue
priority_class: the priority class name
kueue_local_queue_ready_wait_time_secondsHistogramThe time between a workload was created or requeued until ready, per ’local_queue’name: the name of the LocalQueue
namespace: the namespace of the LocalQueue
priority_class: the priority class name
kueue_ready_wait_time_secondsHistogramThe time between a workload was created or requeued until ready, per ‘cluster_queue’cluster_queue: the name of the ClusterQueue
priority_class: the priority class name