Run A RayCluster

Run a RayCluster on Kueue.

This page shows how to leverage Kueue’s scheduling and resource management capabilities when running RayCluster.

This guide is for batch users that have a basic understanding of Kueue. For more information, see Kueue’s overview.

Before you begin

  1. Make sure you are using Kueue v0.6.0 version or newer and KubeRay 1.1.0 or newer.

  2. Check Administer cluster quotas for details on the initial Kueue setup.

  3. See KubeRay Installation for installation and configuration details of KubeRay.

RayCluster definition

When running RayClusters on Kueue, take into consideration the following aspects:

a. Queue selection

The target local queue should be specified in the metadata.labels section of the RayCluster configuration.

metadata:
  name: raycluster-sample
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: local-queue

b. Configure the resource needs

The resource needs of the workload can be configured in the spec.

    headGroupSpec:
       spec:
        affinity: {}
        containers:
        - env: []
          image: rayproject/ray:2.7.0
          imagePullPolicy: IfNotPresent
          name: ray-head
          resources:
            limits:
              cpu: "1"
              memory: 2G
            requests:
              cpu: "1"
              memory: 2G
          securityContext: {}
          volumeMounts:
          - mountPath: /tmp/ray
            name: log-volume
    workerGroupSpecs:
      template:
        spec:
          affinity: {}
          containers:
          - env: []
          image: rayproject/ray:2.7.0
          imagePullPolicy: IfNotPresent
          name: ray-worker
          resources:
            limits:
            cpu: "1"
            memory: 1G
            requests:
            cpu: "1"
            memory: 1G

Note that a RayCluster will hold resource quotas while it exists. For optimal resource management, you should delete a RayCluster that is no longer in use.

c. Limitations

  • Limited Worker Groups: Because a Kueue workload can have a maximum of 8 PodSets, the maximum number of spec.workerGroupSpecs is 7
  • In-Tree Autoscaling Disabled: Kueue manages resource allocation for the RayCluster; therefore, the cluster’s internal autoscaling mechanisms need to be disabled

Example RayCluster

The RayCluster looks like the following:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-sample
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: local-queue
spec:
  headGroupSpec:
    rayStartParams:
      dashboard-host: 0.0.0.0
    serviceType: ClusterIP
    template:
      metadata:
        annotations: {}
      spec:
        affinity: {}
        containers:
        - env: []
          image: rayproject/ray:2.7.0
          imagePullPolicy: IfNotPresent
          name: ray-head
          resources:
            limits:
              cpu: "1"
              memory: 2G
            requests:
              cpu: "1"
              memory: 2G
          securityContext: {}
          volumeMounts:
          - mountPath: /tmp/ray
            name: log-volume
        imagePullSecrets: []
        nodeSelector: {}
        tolerations: []
        volumes:
        - emptyDir: {}
          name: log-volume
  workerGroupSpecs:
  - groupName: workergroup
    maxReplicas: 10
    minReplicas: 1
    rayStartParams: {}
    replicas: 4
    template:
      metadata:
        annotations: {}
      spec:
        affinity: {}
        containers:
        - env: []
          image: rayproject/ray:2.7.0
          imagePullPolicy: IfNotPresent
          name: ray-worker
          resources:
            limits:
              cpu: "1"
              memory: 1G
            requests:
              cpu: "1"
              memory: 1G
          securityContext: {}
          volumeMounts:
          - mountPath: /tmp/ray
            name: log-volume
        imagePullSecrets: []
        nodeSelector: {}
        tolerations: []
        volumes:
        - emptyDir: {}
          name: log-volume

You can submit a Ray Job using the CLI or log into the Ray Head and execute a job following this example with kind cluster.