运行 RayJob

在启用了 Kueue 的环境里运行 RayJobs

本页面展示了如何利用 Kueue 的调度和服务管理能力来运行 KubeRay 的 RayJob。

本指南适用于批处理用户，他们基本了解 Kueue。更多信息，请参见 Kueue 概览。

开始之前

请确保你使用 Kueue v0.6.0 版本或更高版本，以及 KubeRay v1.1.0 或更高版本。
请参见管理集群配额了解初始 Kueue 设置的详细信息。
请参见 KubeRay 安装文档了解 KubeRay 的安装和配置详情。

注意

在 v0.8.1 之前，你需要重启 Kueue 才能使用 RayJob。你可以通过运行 kubectl delete pods -l control-plane=controller-manager -n kueue-system 来完成此操作。

RayJob 定义

当运行 RayJobs时，请考虑以下方面：

a. 队列选择

目标本地队列应在 RayJob 配置的 metadata.labels 部分指定。

metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue

b. 配置资源需求

工作负载的资源需求可以在 spec.rayClusterSpec 中配置。

spec:
  rayClusterSpec:
    headGroupSpec:
      template:
        spec:
          containers:
            - resources:
                requests:
                  cpu: "1"
    workerGroupSpecs:
      - template:
          spec:
            containers:
              - resources:
                  requests:
                    cpu: "1"

c. 限制

一个 Kueue 管理的 RayJob 不能使用现有的 RayCluster。
RayCluster 应在作业执行结束后删除，spec.ShutdownAfterJobFinishes 应为 true。
因为 Kueue 会为 RayCluster 预留资源，spec.rayClusterSpec.enableInTreeAutoscaling 应为 false。
因为一个 Kueue 工作负载最多可以有 8 个 PodSet，spec.rayClusterSpec.workerGroupSpecs 的最大数量为 7。

示例 {#examples} RayJob

在本例中，代码通过 ConfigMap 提供给 Ray 框架。

apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import os
    import requests

    ray.init()

    @ray.remote
    class Counter:
        def __init__(self):
            # Used to verify runtimeEnv
            self.name = os.getenv("counter_name")
            assert self.name == "test_counter"
            self.counter = 0

        def inc(self):
            self.counter += 1

        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)

    counter = Counter.remote()

    for _ in range(5):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))

    # Verify that the correct runtime env was used for the job.
    assert requests.__version__ == "2.26.0"

RayJob 如下所示：

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  suspend: true
  shutdownAfterJobFinishes: true
  # submissionMode specifies how RayJob submits the Ray job to the RayCluster.
  # The default value is "K8sJobMode", meaning RayJob will submit the Ray job via a submitter Kubernetes Job.
  # The alternative value is "HTTPMode", indicating that KubeRay will submit the Ray job by sending an HTTP request to the RayCluster.
  # submissionMode: "K8sJobMode"
  entrypoint: python /home/ray/samples/sample_code.py
  # shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false.
  # shutdownAfterJobFinishes: false

  # ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes.
  # ttlSecondsAfterFinished: 10

  # activeDeadlineSeconds is the duration in seconds that the RayJob may be active before
  # KubeRay actively tries to terminate the RayJob; value must be positive integer.
  # activeDeadlineSeconds: 120

  # RuntimeEnvYAML represents the runtime environment configuration provided as a multi-line YAML string.
  # See https://docs.ray.io/en/latest/ray-core/handling-dependencies.html for details.
  # (New in KubeRay version 1.0.)
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
      - pendulum==2.1.2
    env_vars:
      counter_name: "test_counter"

  # Suspend specifies whether the RayJob controller should create a RayCluster instance.
  # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
  # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluste rwill be created.
  # suspend: false

  # rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller.
  rayClusterSpec:
    rayVersion: '2.9.0' # should match the Ray version in the image of the containers
    # Ray head pod template
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.9.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "1"
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            # You set volumes at the Pod level, then mount them into containers inside that Pod
            - name: code-sample
              configMap:
                # Provide the name of the ConfigMap you want to mount.
                name: ray-job-code-sample
                # An array of keys from the ConfigMap to create as files
                items:
                  - key: sample_code.py
                    path: sample_code.py
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: rayproject/ray:2.9.0
                lifecycle:
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
                resources:
                  limits:
                    cpu: "1"
                  requests:
                    cpu: "1"
  # SubmitterPodTemplate is the template for the pod that will run the `ray job submit` command against the RayCluster.
  # If SubmitterPodTemplate is specified, the first container is assumed to be the submitter container.
  # submitterPodTemplate:
  #   spec:
  #     restartPolicy: Never
  #     containers:
  #       - name: my-custom-rayjob-submitter-pod
  #         image: rayproject/ray:2.9.0
  #         # If Command is not specified, the correct command will be supplied at runtime using the RayJob spec `entrypoint` field.
  #         # Specifying Command is not recommended.
  #         # command: ["sh", "-c", "ray job submit --address=http://$RAY_DASHBOARD_ADDRESS --submission-id=$RAY_JOB_SUBMISSION_ID -- echo hello world"]

你可以使用以下命令运行此 RayJob：

# 创建代码 ConfigMap（一次）
kubectl apply -f ray-job-code-sample.yaml
# 创建 RayJob。你可以多次运行此命令，以观察作业的排队和准入。
kubectl create -f ray-job-sample.yaml

注意

上述示例来自这里并且只添加了 queue-name 标签和更新了请求。

反馈

这个页面有帮助吗？

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

最后修改 September 29, 2025: [website] [docs] Improve RayService on Kueue documentation (#7034) (cf869a5c)