运行 RayCluster
在启用了 Kueue 的环境里运行 RayClusters
本页面展示了如何利用 Kueue 的调度和服务管理能力来运行 RayCluster。
本指南适用于批处理用户,他们需要对 Kueue 有基本的了解。 更多信息,请参见 Kueue 概述。
开始之前
请确保你使用的是 Kueue v0.6.0 版本或更高版本,以及 KubeRay v1.1.0 或更高版本。
请参见 Administer cluster quotas 了解初始 Kueue 设置的详细信息。
请参见 KubeRay Installation 了解 KubeRay 的安装和配置详情。
注意
在 v0.8.1 之前,你需要重启 Kueue 才能使用 RayCluster。你可以通过运行kubectl delete pods -l control-plane=controller-manager -n kueue-system
来完成此操作。RayCluster 定义
当在 Kueue 上运行 RayClusters时,请考虑以下方面:
a. 队列选择
目标 本地队列应在 RayCluster 配置的 metadata.labels
部分指定。
metadata:
labels:
kueue.x-k8s.io/queue-name: user-queue
b. 配置资源需求
工作负载的资源需求可以在 spec
中配置。
spec:
headGroupSpec:
template:
spec:
containers:
- resources:
requests:
cpu: "1"
workerGroupSpecs:
- template:
spec:
containers:
- resources:
requests:
cpu: "1"
请注意,RayCluster 在存在期间会占用资源配额。为了优化资源管理,你应该删除不再使用的 RayCluster。
c. 限制
- 有限的 Worker Groups:由于 Kueue 工作负载最多可以有 8 个 PodSets,
spec.workerGroupSpecs
的最大数量为 7 - 内建自动扩缩禁用:Kueue 管理 RayCluster 的资源分配;因此,集群的内部自动扩缩机制需要禁用
示例 {#examples} RayCluster
RayCluster 如下所示:
apiVersion: ray.io/v1
kind: RayCluster
metadata:
labels:
kueue.x-k8s.io/queue-name: user-queue
controller-tools.k8s.io: "1.0"
# A unique identifier for the head node and workers of this cluster.
name: raycluster-complete
spec:
rayVersion: '2.9.0'
# Ray head pod configuration
headGroupSpec:
# Kubernetes Service Type. This is an optional field, and the default value is ClusterIP.
# Refer to https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services-service-types.
serviceType: ClusterIP
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams:
dashboard-host: '0.0.0.0'
# pod template
template:
metadata:
# Custom labels. NOTE: To avoid conflicts with KubeRay operator, do not define custom labels start with `raycluster`.
# Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
labels: {}
spec:
containers:
- name: ray-head
image: rayproject/ray:2.9.0
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
# The resource requests and limits in this config are too small for production!
# For an example with more realistic resource configuration, see
# ray-cluster.autoscaler.large.yaml.
# It is better to use a few large Ray pod than many small ones.
# For production, it is ideal to size each Ray pod to take up the
# entire Kubernetes node on which it is scheduled.
resources:
limits:
cpu: "1"
memory: "2G"
requests:
# For production use-cases, we recommend specifying integer CPU reqests and limits.
# We also recommend setting requests equal to limits for both CPU and memory.
# For this example, we use a 500m CPU request to accomodate resource-constrained local
# Kubernetes testing environments such as KinD and minikube.
cpu: "1"
memory: "2G"
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 10
# logical group name, for this called small-group, also can be functional
groupName: small-group
# If worker pods need to be added, we can increment the replicas.
# If worker pods need to be removed, we decrement the replicas, and populate the workersToDelete list.
# The operator will remove pods from the list until the desired number of replicas is satisfied.
# If the difference between the current replica count and the desired replicas is greater than the
# number of entries in workersToDelete, random worker pods will be deleted.
#scaleStrategy:
# workersToDelete:
# - raycluster-complete-worker-small-group-bdtwh
# - raycluster-complete-worker-small-group-hv457
# - raycluster-complete-worker-small-group-k8tj7
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams: {}
#pod template
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.9.0
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
# use volumeMounts.Optional.
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
# The resource requests and limits in this config are too small for production!
# For an example with more realistic resource configuration, see
# ray-cluster.autoscaler.large.yaml.
# It is better to use a few large Ray pod than many small ones.
# For production, it is ideal to size each Ray pod to take up the
# entire Kubernetes node on which it is scheduled.
resources:
limits:
cpu: "1"
memory: "1G"
# For production use-cases, we recommend specifying integer CPU reqests and limits.
# We also recommend setting requests equal to limits for both CPU and memory.
# For this example, we use a 500m CPU request to accomodate resource-constrained local
# Kubernetes testing environments such as KinD and minikube.
requests:
# For production use-cases, we recommend specifying integer CPU reqests and limits.
# We also recommend setting requests equal to limits for both CPU and memory.
# For this example, we use a 500m CPU request to accomodate resource-constrained local
# Kubernetes testing environments such as KinD and minikube.
cpu: "1"
# For production use-cases, we recommend allocating at least 8Gb memory for each Ray container.
memory: "1G"
# use volumes
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumes:
- name: ray-logs
emptyDir: {}
你可以使用 CLI 提交 Ray Job,或者登录 Ray Head 并按照此 示例在 kind 集群中执行作业。
反馈
这个页面有帮助吗?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.