Preemption
In a preemption, the following terms are relevant:
- Preemptees: The preempted Workloads.
- Target ClusterQueues: The ClusterQueues to which the preemptees belong.
- Preemptor: The Workload being accommodated.
- Preempting ClusterQueue: The ClusterQueue to which the preemptor belongs.
Reasons for preemption
A Workload can preempt one or more Workloads if it is admitted in a ClusterQueue with preemption enabled and any of the following events happen:
- The preemptee belongs to the same ClusterQueue as the preemptor and the preemptee has a lower priority.
- The preemptee belongs to the same cohort as the preemptor and the preemptee’s ClusterQueue has a usage above the nominal quota for at least one resource that the preemptee and preemptor require.
The configured settings for preemption in the Kueue Configuration and in the ClusterQueue can limit whether a Workload can preempt others, in addition to the criteria above.
When preempting a Workload, Kueue adds entries in the .status.conditions
field of the preempted Workload
that is similar to the following:
status:
conditions:
- lastTransitionTime: "2024-05-31T18:42:33Z"
message: 'Preempted to accommodate a workload (UID: 5515f7da-d2ea-4851-9e9c-6b8b3333734d)
in the ClusterQueue'
observedGeneration: 1
reason: Preempted
status: "True"
type: Evicted
- lastTransitionTime: "2024-05-31T18:42:33Z"
message: 'Preempted to accommodate a workload (UID: 5515f7da-d2ea-4851-9e9c-6b8b3333734d)
in the ClusterQueue'
reason: InClusterQueue
status: "True"
type: Preempted
The Evicted
condition indicates that the Workload was evicted with a reason Preempted
,
whereas the Preempted
condition gives more details about the preemption reason.
Preemption algorithms
Kueue offers two preemption algorithms. The main difference between them is the criteria to allow preemptions from a ClusterQueue to others in the Cohort, when the usage of the preempting ClusterQueue is already above the nominal quota. The algorithms are:
Classic Preemption: Preemption in the cohort can only happen when any of the following occurs:
- The usage of the ClusterQueue for the incoming workload will be under the nominal quota after the ongoing admission process
- Preemption while borrowing is enabled for the workload’s ClusterQueue
- All candidates for preemption belong to the same ClusterQueue as the preempting Workload
In the above scenarios, a workload can only be considered for preemption, in favor a workload from another ClusterQueue, if it belongs to a ClusterQueue which is running over its nominal quota. ClusterQueues in a cohort borrow resources in a first-come first-served fashion.
This algorithm is the most lightweight of the two.
Fair sharing: ClusterQueues with pending Workloads can preempt other Workloads in their cohort until the preempting ClusterQueue obtains an equal or weighted share of the borrowable resources. The borrowable resources are the unused nominal quota of all the ClusterQueues in the cohort.
Classic Preemption
An incoming Workload, which does not fit within the unused quota, is eligible to issue preemptions when one of the following is true:
- the requests of the Workload are below the flavor’s nominal quota, or
borrowWithinCohort
is enabled.
Candidates
The list of preemption candidates is compiled from Workloads which either:
- belong to the same ClusterQueue as the preemptor Workload, and satisfying the
withinClusterQueue
policy of the preemptor’s Cluster Queue - belong to other ClusterQueues in the cohort, which are actively borrowing, and satisfying the
reclaimWithinCohort
andborrowWithinCohort
policies of the preemptor’s Cluster Queue.
The list of candidates is sorted based on the following preference checks for tie-breaking:
- Workloads from borrowing queues in the cohort
- Workloads with the lowest priority
- Workloads which got admitted the most recently.
Targets
The Classic Preemption algorithm qualifies the candidates as preemption targets using the heuristics below:
If all candidates belong to the target queue, then Kueue greedily qualifies candidates until the preemptor Workload can fit, allowing the usage of the ClusterQueue to be above the nominal quota, up to the
borrowingLimit
. This is referred as “borrowing” in the points below.If
borrowWithinCohort
is enabled, then Kueue greedily qualifies candidates (respecting theborrowWithinCohort.maxPriorityThreshold
threshold), until the preemptor Workload can fit, allowing for borrowing.If the current usage of the target queue is below nominal quota, then Kueue greedily qualifies the candidates, until the preemptor Workload can fit, disallowing for borrowing.
If the Workload didn’t fit by using the previous heuristics, Kueue greedily qualifies only the candidates which belong to the preempting Cluster Queue, until the preemptor Workload can fit, allowing for borrowing.
The last step of the algorithm is to minimize the set of targets. For this purpose, Kueue greedily traverses the list of initial targets in reverse and removes a Workload from the list of targets if the preemptor Workload still can be admitted when accounting back the quota usage of the target Workload.
Fair Sharing
Fair sharing introduces the concepts of ClusterQueue share values and preemption
strategies. These work together with the preemption policies set in
withinClusterQueue
and reclaimWithinCohort
to determine if a pending
Workload can preempt an admitted Workload. Fair sharing uses preemptions to
achieve an equal or weighted share of the borrowable resources between the
tenants of a cohort.
To enable fair sharing, use a Kueue Configuration similar to the following:
apiVersion: config.kueue.x-k8s.io/v1beta1
kind: Configuration
fairSharing:
enable: true
preemptionStrategies: [LessThanOrEqualToFinalShare, LessThanInitialShare]
The attributes in this Kueue Configuration are described in the following sections.
ClusterQueue share value
When you enable fair sharing, Kueue assigns a numeric share value to each ClusterQueue to summarize
the usage of borrowed resources in a ClusterQueue, in comparison to others in the same cohort.
The share value is weighted by the .spec.fairSharing.weight
defined in a ClusterQueue.
During admission, Kueue prefers to admit Workloads from ClusterQueues that have the lowest share value first. During preemption, Kueue prefers to preempt Workloads from ClusterQueues that have the highest share value first.
You can obtain the share value of a ClusterQueue in the .status.fairSharing.weightedShare
field or querying
the kueue_cluster_queue_weighted_share
metric.
Preemption strategies
The preemptionStrategies
field in the Kueue Configuration indicates which constraints should a
preemption satisfy, with regards to the share values of the target and preempting ClusterQueues,
before and after preempting a particular Workload.
Different preemptionStrategies
can lead to less or more preemptions under specific scenarios.
These are the factors you should consider when configuring preemptionStrategies
:
- Tolerance to disruptions, in particular when single Workloads use a significant amount of the borrowable resources.
- Speed of convergence, in other words, how important is it to reach a steady fair state as soon as possible.
- Overall utilization, because certain strategies might reduce the utilization of the cluster in the pursue of fairness.
When you define multiple preemptionStrategies
, the preemption algorithm will only use the next
strategy in the list if there aren’t any more Workloads that are candidates for preemption that
satisfy the current strategy and the preemptor still doesn’t fit.
The values you can put in the preemptionStrategies
list are:
LessThanOrEqualToFinalShare
: Only preempt a Workload if the share of the preempting ClusterQueue with the preemptor Workload is less than or equal to the share of the target ClusterQueue without the preempted Workload. This strategy might favor preemption of smaller workloads in the target ClusterQueue, regardless of priority or start time, in an effort to keep the share of the ClusterQueue as high as possible.LessThanInitialShare
: Only preempt a Workload if the share of the preempting ClusterQueue with the preemptor Workload is strictly less than the share of the target ClusterQueue. Note that this strategy doesn’t depend on the share usage of the Workload being preempted. As a result, the strategy chooses to first preempt workloads with the lowest priority and newest start time within the target ClusterQueue. The default strategy is[LessThanOrEqualToFinalShare, LessThanInitialShare]
Algorithm overview
The initial step of the algorithm is to identify the Workloads that are candidate for preemption, with the same criteria and ordering as the classic preemption, and grouped by ClusterQueue.
Next, the above candidates are qualified as preemption targets, following an algorithm that can be summarized as follows:
FindFairPreemptionTargets(X ClusterQueue, W Workload)
For each preemption strategy:
While W does not fit and there are workloads that are preemption candidates:
Find the ClusterQueue Y with the highest share value.
For each admitted Workload U in ClusterQueue Y:
If Workload U satisfies the preemption strategy:
Add workload U to the list of targets
In the reverse order of the list of targets:
Attempt to remove a Workload from the targets, while W still fits.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.