Prometheus Metrics

Prometheus metrics exported by Kueue

Kueue exposes prometheus metrics to monitor the health of the system and the status of ClusterQueues.

Kueue health

Use the following metrics to monitor the health of the kueue controllers:

Metric nameTypeDescriptionLabels
kueue_admission_attempts_totalCounterThe total number of attempts toadmit workloads. Each admission attempt might try to admit more than one workload.result: possible values are success or inadmissible
kueue_admission_attempt_duration_secondsHistogramThe latency of an admission attempt.result: possible values are success or inadmissible

ClusterQueue status

Use the following metrics to monitor the status of your ClusterQueues:

Metric nameTypeDescriptionLabels
kueue_pending_workloadsGaugeThe number of pending workloads.cluster_queue: the name of the ClusterQueue
status: possible values are active or inadmissible
kueue_quota_reserved_workloads_totalCounterThe total number of quota reserved workloads.cluster_queue: the name of the ClusterQueue
kueue_quota_reserved_wait_time_secondsHistogramThe time between a workload was created or requeued until it got quota reservation.cluster_queue: the name of the ClusterQueue
kueue_admitted_workloads_totalCounterThe total number of admitted workloads.cluster_queue: the name of the ClusterQueue
kueue_evicted_workloads_totalCounterThe total number of evicted workloads.cluster_queue: the name of the ClusterQueue
reason: Possible values are Preempted, PodsReadyTimeout, AdmissionCheck, ClusterQueueStopped or Deactivated
kueue_admission_wait_time_secondsHistogramThe time between a workload was created or requeued until admission.cluster_queue: the name of the ClusterQueue
kueue_admission_checks_wait_time_secondsHistogramThe time from when a workload got the quota reservation until admission.cluster_queue: the name of the ClusterQueue
kueue_admitted_active_workloadsGaugeThe number of admitted Workloads that are active (unsuspended and not finished)cluster_queue: the name of the ClusterQueue
kueue_cluster_queue_statusGaugeReports the status of the ClusterQueuecluster_queue: The name of the ClusterQueue
status: Possible values are pending, active or terminated. For a ClusterQueue, the metric only reports a value of 1 for one of the statuses.

LocalQueue Status (alpha)

Metric NameTypeDescriptionLabels
local_queue_pending_workloadsGaugeThe number of pending workloads, per ’local_queue’ and ‘status’.name: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
status: can be either active for the number of active pending workloads or inadmissible
local_queue_quota_reserved_workloads_totalCounterThe number of workloads with quota reserved in a LocalQueuename: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
local_queue_quota_reserved_wait_time_secondsHistogramThe time between a workload was created or requeued until it got quota reservation, perlocal_queuename: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
local_queue_admitted_workloads_totalCounterThe total number of admitted workloads perlocal_queuename: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
local_queue_admission_wait_time_secondsHistogramThe time between a workload was created or requeued until admission, perlocal_queuename: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
local_queue_evicted_workloads_totalCounterThe number of evicted workloads perlocal_queuename: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
reason: the reason the workload was pre-empted. It can have the following values [“Preempted”, “PodsReadyTimeout”, “AdmissionCheck”, “ClusterQueueStopped”, “Deactivated”]
local_queue_reserving_active_workloadsGaugeThe number of Workloads that are reserving quota, perlocalQueuename: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
local_queue_admitted_active_workloadsGaugeThe number of admitted Workloads that are active (unsuspended and not finished), perlocalQueuename: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
local_queue_statusGaugeReports a LocalQueue’sactive status (ability to schedule workloads)name: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
active: one of [True, False, Unknown] and exclusively one is positive at any given time
local_queue_resource_usageGaugeReports the LocalQueue’s total resource usage within all theflavorsname: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
flavor: the name of the flavor which resources are being consumed from
resource: the resource which is being consumed

Optional metrics

The following metrics are available only if metrics.enableClusterQueueResources is enabled in the manager’s configuration.

Metric nameTypeDescriptionLabels
kueue_cluster_queue_resource_usageGaugeReports the ClusterQueue’s total resource usagecohort: The cohort in which the queue belongs
cluster_queue: The name of the ClusterQueue
flavor: referenced flavor
resource: The resource name
kueue_cluster_queue_nominal_quotaGaugeReports the ClusterQueue’s resource quotacohort: The cohort in which the queue belongs
cluster_queue: The name of the ClusterQueue
flavor: referenced flavor
resource: The resource name
kueue_cluster_queue_borrowing_limitGaugeReports the ClusterQueue’s resource borrowing limitcohort: The cohort in which the queue belongs
cluster_queue: The name of the ClusterQueue
flavor: referenced flavor
resource: The resource name
kueue_cluster_queue_weighted_shareGaugeReports a value that representing the maximum of the ratios of usage above nominal quota to the lendable resources in the cohort, among all the resources provided by the ClusterQueue.cluster_queue: The name of the ClusterQueue