Prometheus Metrics
Prometheus metrics exported by Kueue
Kueue exposes prometheus metrics to monitor the health of the system and the status of ClusterQueues.
Kueue health
Use the following metrics to monitor the health of the kueue controllers:
Metric name | Type | Description | Labels |
---|---|---|---|
kueue_admission_attempts_total | Counter | The total number of attempts toadmit workloads. Each admission attempt might try to admit more than one workload. | result : possible values are success or inadmissible |
kueue_admission_attempt_duration_seconds | Histogram | The latency of an admission attempt. | result : possible values are success or inadmissible |
ClusterQueue status
Use the following metrics to monitor the status of your ClusterQueues:
Metric name | Type | Description | Labels |
---|---|---|---|
kueue_pending_workloads | Gauge | The number of pending workloads. | cluster_queue : the name of the ClusterQueuestatus : possible values are active or inadmissible |
kueue_quota_reserved_workloads_total | Counter | The total number of quota reserved workloads. | cluster_queue : the name of the ClusterQueue |
kueue_quota_reserved_wait_time_seconds | Histogram | The time between a workload was created or requeued until it got quota reservation. | cluster_queue : the name of the ClusterQueue |
kueue_admitted_workloads_total | Counter | The total number of admitted workloads. | cluster_queue : the name of the ClusterQueue |
kueue_evicted_workloads_total | Counter | The total number of evicted workloads. | cluster_queue : the name of the ClusterQueuereason : Possible values are Preempted , PodsReadyTimeout , AdmissionCheck , ClusterQueueStopped or Deactivated |
kueue_admission_wait_time_seconds | Histogram | The time between a workload was created or requeued until admission. | cluster_queue : the name of the ClusterQueue |
kueue_admission_checks_wait_time_seconds | Histogram | The time from when a workload got the quota reservation until admission. | cluster_queue : the name of the ClusterQueue |
kueue_admitted_active_workloads | Gauge | The number of admitted Workloads that are active (unsuspended and not finished) | cluster_queue : the name of the ClusterQueue |
kueue_cluster_queue_status | Gauge | Reports the status of the ClusterQueue | cluster_queue : The name of the ClusterQueuestatus : Possible values are pending , active or terminated . For a ClusterQueue, the metric only reports a value of 1 for one of the statuses. |
LocalQueue Status (alpha)
Metric Name | Type | Description | Labels |
---|---|---|---|
local_queue_pending_workloads | Gauge | The number of pending workloads, per ’local_queue’ and ‘status’. | name : the name of the LocalQueuenamespace : the namespace that the LocalQueue resides instatus : can be either active for the number of active pending workloads or inadmissible |
local_queue_quota_reserved_workloads_total | Counter | The number of workloads with quota reserved in a LocalQueue | name : the name of the LocalQueuenamespace : the namespace that the LocalQueue resides in |
local_queue_quota_reserved_wait_time_seconds | Histogram | The time between a workload was created or requeued until it got quota reservation, perlocal_queue | name : the name of the LocalQueuenamespace : the namespace that the LocalQueue resides in |
local_queue_admitted_workloads_total | Counter | The total number of admitted workloads perlocal_queue | name : the name of the LocalQueuenamespace : the namespace that the LocalQueue resides in |
local_queue_admission_wait_time_seconds | Histogram | The time between a workload was created or requeued until admission, perlocal_queue | name : the name of the LocalQueuenamespace : the namespace that the LocalQueue resides in |
local_queue_evicted_workloads_total | Counter | The number of evicted workloads perlocal_queue | name : the name of the LocalQueuenamespace : the namespace that the LocalQueue resides inreason : the reason the workload was pre-empted. It can have the following values [“Preempted”, “PodsReadyTimeout”, “AdmissionCheck”, “ClusterQueueStopped”, “Deactivated”] |
local_queue_reserving_active_workloads | Gauge | The number of Workloads that are reserving quota, perlocalQueue | name : the name of the LocalQueuenamespace : the namespace that the LocalQueue resides in |
local_queue_admitted_active_workloads | Gauge | The number of admitted Workloads that are active (unsuspended and not finished), perlocalQueue | name : the name of the LocalQueuenamespace : the namespace that the LocalQueue resides in |
local_queue_status | Gauge | Reports a LocalQueue’sactive status (ability to schedule workloads) | name : the name of the LocalQueuenamespace : the namespace that the LocalQueue resides inactive : one of [True , False , Unknown ] and exclusively one is positive at any given time |
local_queue_resource_usage | Gauge | Reports the LocalQueue’s total resource usage within all theflavors | name : the name of the LocalQueuenamespace : the namespace that the LocalQueue resides inflavor : the name of the flavor which resources are being consumed fromresource : the resource which is being consumed |
Optional metrics
The following metrics are available only if metrics.enableClusterQueueResources
is enabled in the manager’s configuration.
Metric name | Type | Description | Labels |
---|---|---|---|
kueue_cluster_queue_resource_usage | Gauge | Reports the ClusterQueue’s total resource usage | cohort : The cohort in which the queue belongscluster_queue : The name of the ClusterQueueflavor : referenced flavorresource : The resource name |
kueue_cluster_queue_nominal_quota | Gauge | Reports the ClusterQueue’s resource quota | cohort : The cohort in which the queue belongscluster_queue : The name of the ClusterQueueflavor : referenced flavorresource : The resource name |
kueue_cluster_queue_borrowing_limit | Gauge | Reports the ClusterQueue’s resource borrowing limit | cohort : The cohort in which the queue belongscluster_queue : The name of the ClusterQueueflavor : referenced flavorresource : The resource name |
kueue_cluster_queue_weighted_share | Gauge | Reports a value that representing the maximum of the ratios of usage above nominal quota to the lendable resources in the cohort, among all the resources provided by the ClusterQueue. | cluster_queue : The name of the ClusterQueue |
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.