Integrate a custom Job with Kueue
Kueue has built-in integrations for several Job types, including Kubernetes batch Job, MPIJob, RayJob and JobSet.
There are three options for using Kueue to manage Job-like CRDs that lack built-in integrations.
- Leverage the built-in AppWrapper integration by wrapping instances of the custom Job in an AppWrapper. See Running a Wrapped Custom Workload for details.
- Build a new integration as part of the Kueue repository.
- Build a new integration as an external controller.
This guide is for platform developers and describes how
to build a new integration. Integrations should be built using the APIs provided by
Kueue’s jobframework
package. This will both simplify development and ensure that
your controller will be properly structured to become a core built-in integration if your
Job type is widely used by the community.
Overview of Requirements
Kueue uses the controller-runtime. We recommend becoming familiar with it and with Kubebuilder before starting to build a Kueue integration.
Whether you are building an external or built-in integration, your main tasks are the following:
To work with Kueue, your custom Job CRD should have a suspend-like field, with semantics similar to the
suspend
field in a Kubernetes Job. The field needs to be in your CRD’sspec
, not itsstatus
, to enable its value to be set from a webhook. Your CRD’s primary controller must respond to changes in the value of this field by suspending or unsuspending its owned resources.You will need to register the GroupVersionKind of your CRD with Kueue as an integration.
You will need to instantiate various pieces of Kueue’s
jobframework
package for your CRD:- You will need to implement Kueue’s
GenericJob
interface for your CRD - You will need to instantiate a
ReconcilerFactory
and register it with the controller runtime. - You will need to add a
+kubebuilder:rbac
directive for your CRD so Kueue will be permitted to manage it. - You will need to register webhooks that set the initial value of the
suspend
field in instances of your CRD and validate Kueue invariants on creation and update operations. - You will need to instantiate a
Workload
indexer for your CRD.
- You will need to implement Kueue’s
Building a Built-in Integration
To get started, add a new folder in ./pkg/controller/jobs/
to host the implementation of the integration.
Here are completed built-in integrations you can learn from:
Registration
Add your framework name to .integrations.frameworks
in controller_manager_config.yaml
Add RBAC Authorization for your CRD using kubebuilder marker comments.
Write a go func init()
that invokes the jobframework RegisterIntegration()
function.
Job Framework
In the mycrd_controller.go
file in your folder, implement the GenericJob
interface, and other optional interfaces defined by the framework.
In the mycrd_webhook.go
file in your folder, provide the webhook that invokes helper methods in from the jobframework to
set the initial suspend status of created jobs and validate invariants.
Add testing files for both the controller and the webhook. You can check the test files in the other subfolders of ./pkg/controller/jobs/
to learn how to implement them.
Developing a mutation and validation webhook
Once you have implemented the GenericJob
interface, you can register a webhook for the type by using
the BaseWebhookFactory
in pkg/controller/jobframework
. This webhook ensures that a job is created
in a suspended state and that queue names don’t change when the job is admitted. You can read the
MPIJob integration as an example.
For certain use cases, you might need to register a dedicated webhook. In this case, you should use
Kueue’s LosslessDefaulter
, which ensures that the defaulter is compatible with future versions of the
CRD, under the assumption that the defaulter does not remove any fields from the original object.
You can read the BatchJob
integration as an example for how to use a LosslessDefaulter
.
If the suggestions above are incompatible with your use case, beware of potential problems when using controller-runtime’s CustomDefaulter.
Adjust build system
Add required dependencies to compile your code. For example, using go get github.com/kubeflow/mpi-operator@0.4.0
.
Update the Makefile for testing.
- Add commands which copy the CRD of your custom job to the Kueue project.
- Add your custom job operator CRD dependencies into
test-integration
.
Building an External Integration
Building an external integration is similar to building a built-in integration, except that
you will instantiate Kueue’s jobframework
for your CRD in a separate executable. The only
change you will make to the primary Kueue manager is a deployment-time configuration change
to enable it to recognize that your CRD is being managed by a Kueue-compatible manager.
Here are completed external integrations you can learn from:
Registration
Configure Kueue by adding your framework’s GroupVersionKind to .integrations.externalFrameworks
in controller_manager_config.yaml.
Job Framework
Make the following changes to an existing controller-runtime based manager for your CRD.
Add a dependency on Kueue to your go.mod
so you can import jobframework
as a go library.
In a new mycrd_workload_controller.go
file, implement the GenericJob
interface, and other optional interfaces defined by the framework.
Extend the existing webhook for your CRD to invoke Kueue’s webhook helper methods:
- Your defaulter should invoke
jobframework.ApplyDefaultForSuspend
in defaults.go - Your validator should invoke
jobframework.ValidateJobOnCreate
andjobframework.ValidateJobOnUpdate
in validation.go
Extend your manager’s startup procedure to do the following:
- Using the
jobframework.NewGenericReconcilerFactory
method, create an instance of Kueue’s JobReconciler for your CRD and register it with the controller-runtime manager. - Invoke
jobframework.SetupWorkloadOwnerIndex
to create an indexer for Workloads owned by your CRD.
Extend your existing RBAC Authorizations to include the privileges needed by Kueue’s JobReconciler:
// +kubebuilder:rbac:groups=scheduling.k8s.io,resources=priorityclasses,verbs=list;get;watch
// +kubebuilder:rbac:groups="",resources=events,verbs=create;watch;update;patch
// +kubebuilder:rbac:groups=kueue.x-k8s.io,resources=workloads,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=kueue.x-k8s.io,resources=workloads/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=kueue.x-k8s.io,resources=workloads/finalizers,verbs=update
// +kubebuilder:rbac:groups=kueue.x-k8s.io,resources=resourceflavors,verbs=get;list;watch
// +kubebuilder:rbac:groups=kueue.x-k8s.io,resources=workloadpriorityclasses,verbs=get;list;watch
For a concrete example, consult these pieces of the AppWrapper controller:
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.