Common Grafana Queries
This page shows you how to use common PromQL queries to monitor Kueue metrics in Grafana.
The intended audience for this page are batch administrators.
Before you begin
Make sure the following conditions are met:
- A Kubernetes cluster is running.
- The kubectl command-line tool has communication with your cluster.
- Kueue is installed.
- kube-prometheus is installed.
- Kueue Prometheus metrics are enabled (see Setup Prometheus).
Quota utilization
Note
The queries in this section requiremetrics.enableClusterQueueResources: true
in the Kueue configuration. See Installation for details.To monitor the percentage of CPU quota being used in a ClusterQueue:
(sum by (cluster_queue) (kueue_cluster_queue_resource_usage{resource="cpu"}))
/
(sum by (cluster_queue) (kueue_cluster_queue_nominal_quota{resource="cpu"}))
* 100
To see utilization broken down by resource in a ClusterQueue:
(sum by (cluster_queue, resource) (kueue_cluster_queue_resource_usage))
/
(sum by (cluster_queue, resource) (kueue_cluster_queue_nominal_quota))
* 100
To see the average CPU quota utilization over the last week in a ClusterQueue:
avg_over_time(
(
(sum by (cluster_queue) (kueue_cluster_queue_resource_usage{resource="cpu"}))
/
(sum by (cluster_queue) (kueue_cluster_queue_nominal_quota{resource="cpu"}))
* 100
)[1w:1h]
)
To find the top 5 ClusterQueues by CPU utilization:
topk(5,
(sum by (cluster_queue) (kueue_cluster_queue_resource_usage{resource="cpu"}))
/
(sum by (cluster_queue) (kueue_cluster_queue_nominal_quota{resource="cpu"}))
* 100
)
Pending workloads
To monitor the number of pending workloads per ClusterQueue:
sum by (cluster_queue) (kueue_pending_workloads{status="active"})
To see both active and inadmissible pending workloads per ClusterQueue:
sum by (cluster_queue, status) (kueue_pending_workloads)
Admission wait time
To monitor how long workloads wait before admission, use histogram percentile queries.
For the 95th percentile (P95) admission wait time:
histogram_quantile(0.95,
sum by (le, cluster_queue) (
rate(kueue_admission_wait_time_seconds_bucket[5m])
)
)
For the 50th percentile (median):
histogram_quantile(0.50,
sum by (le, cluster_queue) (
rate(kueue_admission_wait_time_seconds_bucket[5m])
)
)
For the 99th percentile (P99):
histogram_quantile(0.99,
sum by (le, cluster_queue) (
rate(kueue_admission_wait_time_seconds_bucket[5m])
)
)
Workload throughput
To monitor how many workloads are being admitted per hour:
sum by (cluster_queue) (
increase(kueue_admitted_workloads_total[1h])
)
To monitor finished workloads per hour:
sum by (cluster_queue) (
increase(kueue_finished_workloads_total[1h])
)
To see the admission rate over time (workloads per minute):
sum by (cluster_queue) (
rate(kueue_admitted_workloads_total[5m])
) * 60
Eviction rate
To monitor evictions per hour by reason:
sum by (cluster_queue, reason) (
increase(kueue_evicted_workloads_total[1h])
)
See Prometheus Metrics for the full list of reason label values.
ClusterQueue status
To see which ClusterQueues are active:
kueue_cluster_queue_status{status="active"} == 1
To see ClusterQueues that are not active (pending or terminating):
kueue_cluster_queue_status{status!="active"} == 1
What’s next
- See Pending Workloads in Grafana for visibility dashboards using the on-demand API.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.