Now that you’ve tried out Stable Diffusion on GKE Autopilot via the WebUI, you might be wondering how you’d go about adding stable diffusion as a proper micro-service that other components of your application can call. One popular way is via Ray. Let’s try this tutorial: Serve a StableDiffusion text-to-image model on Kubernetes, on GKE Autopilot. Here goes:
Create an Autopilot Cluster
You’ll want version 1.28 or later for this, to get the newer CUDA drivers. This will do the trick:
CLUSTER_NAME=stable-diffusion
VERSION="1.28"
REGION=us-central1
gcloud container clusters create-auto $CLUSTER_NAME \
--region $REGION --release-channel rapid \
--cluster-version $VERSION
Deploy the KubeRay operator
To use Ray on GKE Autopilot, we first need to deploy the KubeRay operator. This allows us to create Ray objects directly in GKE (adding RayService and RayCluster to the built-in Kubernetes workload types such as Deployment and StatefulSet). Follow just step 2 of the ray getting started guide, referenced here:
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
# Install both CRDs and KubeRay operator v1.0.0.
helm install kuberay-operator kuberay/kuberay-operator --version 1.0.0
Wait till the KubeRay’s Pod status is “Running”, which indicates that it’s now ready for us to can create Ray objects. We can observe the Pod like so:
$ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
kuberay-operator-756cf8c9b6-wcnqg 0/1 ContainerCreating 0 2m35s
kuberay-operator-756cf8c9b6-wcnqg 0/1 Running 0 3m14s
Or simply wait until it’s ready with:
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=kuberay-operator
Run the stable diffusion RayService
With KubeRay configured, we can now run the service. Here I’m going to try running the one, unchanged, from the tutorial. This won’t work first time, but I want to demonstrate how to fix it.
This is what the service looks like:
apiVersion: ray.io/v1
kind: RayService
metadata:
name: stable-diffusion
spec:
serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for Ray Serve applications. Default value is 900.
deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray dashboard agent. Default value is 300.
serveConfigV2: |
applications:
- name: stable_diffusion
import_path: stable_diffusion.stable_diffusion:entrypoint
runtime_env:
working_dir: "https://github.com/ray-project/serve_config_examples/archive/d6acf9b99ef076a1848f506670e1290a11654ec2.zip"
pip: ["diffusers==0.12.1"]
rayClusterConfig:
rayVersion: '2.7.0' # Should match the Ray version in the image of the containers
######################headGroupSpecs#################################
# Ray head pod template.
headGroupSpec:
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams:
dashboard-host: '0.0.0.0'
# Pod template
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.7.0
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: "2"
memory: "8G"
requests:
cpu: "2"
memory: "8G"
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
# The pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 10
groupName: gpu-group
rayStartParams: {}
# Pod template
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.7.0
resources:
limits:
cpu: 4
memory: "16G"
nvidia.com/gpu: 1
requests:
cpu: 3
memory: "12G"
nvidia.com/gpu: 1
# Please add the following taints to the GPU node.
tolerations:
- key: "ray.io/node-type"
operator: "Equal"
value: "worker"
effect: "NoSchedule"
Download, and apply:
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0/ray-operator/config/samples/ray-service.stable-diffusion.yaml
kubectl apply -f ray-service.stable-diffusion.yaml
This will not work out of the box, but I want to show you why, and how to fix. Let’s dig in:
$ kubectl get rayclusters
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE
stable-diffusion-raycluster-cc7rg failed 3s
$ kubectl describe raycluster stable-diffusion-raycluster-wg5p2
Name: stable-diffusion-raycluster-wg5p2
Namespace: default
Labels: app.kubernetes.io/created-by=rayservice
ray.io/service=stable-diffusion
Annotations: ray.io/cluster-hash: N6FJA58A50TVP42R7LM9EGISIG6RK5KL
ray.io/enableAgentService: true
API Version: ray.io/v1
Kind: RayCluster
Metadata:
Creation Timestamp: 2024-02-17T03:57:03Z
Generation: 1
Owner References:
API Version: ray.io/v1
Block Owner Deletion: true
Controller: true
Kind: RayService
Name: stable-diffusion
UID: 5fcd5349-d14e-45a0-88bf-b186cd2df207
Resource Version: 1278352
UID: daf9b4da-9fb5-489e-b57e-4bde1c66bab6
Spec:
Head Group Spec:
Ray Start Params:
Dashboard - Host: 0.0.0.0
Template:
Metadata:
Spec:
Containers:
Image: rayproject/ray-ml:2.7.0
Name: ray-head
Ports:
Container Port: 6379
Name: gcs
Protocol: TCP
Container Port: 8265
Name: dashboard
Protocol: TCP
Container Port: 10001
Name: client
Protocol: TCP
Container Port: 8000
Name: serve
Protocol: TCP
Resources:
Limits:
Cpu: 2
Memory: 8G
Requests:
Cpu: 2
Memory: 8G
Volume Mounts:
Mount Path: /tmp/ray
Name: ray-logs
Volumes:
Empty Dir:
Name: ray-logs
Ray Version: 2.7.0
Worker Group Specs:
Group Name: gpu-group
Max Replicas: 10
Min Replicas: 1
Ray Start Params:
Replicas: 1
Scale Strategy:
Template:
Metadata:
Spec:
Containers:
Image: rayproject/ray-ml:2.7.0
Name: ray-worker
Resources:
Limits:
Cpu: 4
Memory: 16G
nvidia.com/gpu: 1
Requests:
Cpu: 3
Memory: 12G
nvidia.com/gpu: 1
Tolerations:
Effect: NoSchedule
Key: ray.io/node-type
Operator: Equal
Value: worker
Status:
Head:
Reason: admission webhook "warden-validating.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by autogke-gpu-limitation]":["You must specify a GPU type with node selector 'cloud.google.com/gke-accelerator' when GPU is requested on Autopilot workloads; supported values are: [nvidia-a100-80gb, nvidia-l4, nvidia-tesla-a100, nvidia-tesla-t4]."]}
Here’s the clue behind the error: Violations details: {“[denied by autogke-gpu-limitation]”:[“You must specify a GPU type with node selector ‘cloud.google.com/gke-accelerator’ when GPU is requested on Autopilot workloads; supported values are: [nvidia-a100-80gb, nvidia-l4, nvidia-tesla-a100, nvidia-tesla-t4].”]}
As this warning notes, you can’t just request a “gpu”, rather you need to pick a specific one. I’m going to go with L4 here, which is a popular GPU for inference workloads.
Add the required node selector for Nvidia L4 to the workload. I’m also going to specify Spot since this isn’t a production workload:
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
cloud.google.com/gke-spot: "true"
Note that the above tutorial’s config requests that you add taints to the GPU node (“Please add the following taints to the GPU node.”). In Autopilot you don’t have to manage nodes, and the platform will automatically do all this for you. Neat!
Here’s the final RayService with our two node selectors added, plus an ephemeral-storage request to make sure we get a big enough node (in Autopilot, you need to tell the system what resources you need).
apiVersion: ray.io/v1
kind: RayService
metadata:
name: stable-diffusion
spec:
serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for Ray Serve applications. Default value is 900.
deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray dashboard agent. Default value is 300.
serveConfigV2: |
applications:
- name: stable_diffusion
import_path: stable_diffusion.stable_diffusion:entrypoint
runtime_env:
working_dir: "https://github.com/ray-project/serve_config_examples/archive/d6acf9b99ef076a1848f506670e1290a11654ec2.zip"
pip: ["diffusers==0.12.1"]
rayClusterConfig:
rayVersion: '2.7.0' # Should match the Ray version in the image of the containers
######################headGroupSpecs#################################
# Ray head pod template.
headGroupSpec:
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams:
dashboard-host: '0.0.0.0'
# Pod template
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.7.0
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: "2"
memory: "8G"
requests:
cpu: "2"
memory: "8G"
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
# The pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 10
groupName: gpu-group
rayStartParams: {}
# Pod template
template:
spec:
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
cloud.google.com/gke-spot: "true"
containers:
- name: ray-worker
image: rayproject/ray-ml:2.7.0
resources:
limits:
cpu: 4
memory: "16G"
nvidia.com/gpu: 1
requests:
cpu: 3
memory: "12G"
nvidia.com/gpu: 1
ephemeral-storage: 100Gi
# Please add the following taints to the GPU node.
tolerations:
- key: "ray.io/node-type"
operator: "Equal"
value: "worker"
effect: "NoSchedule"
What happened here is that since the gpu resource was specified, Autopilot needs you to specify the type of GPU you want using the cloud.google.com/gke-accelerator key, and a label like nvidia-l4
Update the file, and apply the changes:
kubectl apply -f ray-service.stable-diffusion.yaml
This time the workload was admitted, that’s good news. If we describe the RayCluster again, we’ll see that the RayCluster is running (not “failed” like before), and that it created the pod.
$ kubectl get raycluster
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE
stable-diffusion-raycluster-cc7rg 1
$ kubectl describe raycluster
<truncated>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 29s raycluster-controller Created service stable-diffusion-raycluster-cc7rg-head-svc
Normal Created 29s raycluster-controller Created worker pod
Now we just need to wait for the Pod to be ready. It will take a couple of minutes for GKE to provision the resources. We can watch the Pods like so:
$ watch -d kubectl get pods
NAME READY STATUS RESTARTS AGE
kuberay-operator-756cf8c9b6-wcnqg 1/1 Running 0 9m24s
table-diffusion-raycluster-cc7rg-worker-gpu-group-k9zzg 0/1 Pending 0 39s
After a while, it shows as Running.
Every 2.0s: kubectl get pods cs-803911491473-default: Fri Apr 12 00:41:29 2024
NAME READY STATUS RESTARTS AGE
kuberay-operator-756cf8c9b6-wcnqg 1/1 Running 0 23m
stable-diffusion-raycluster-cc7rg-head-825lw 1/1 Running 0 8m48s
table-diffusion-raycluster-cc7rg-worker-gpu-group-k9zzg 1/1 Running 0 14m
The cluster also now shows “ready”.
~$ kubectl get raycluster
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE
stable-diffusion-raycluster-m4bmp 1 1 ready 10m
Looks like it’s good to go, let’s forward the port, and give it a try! In one tab, forward the port of the service declared above:
$ kubectl port-forward svc/stable-diffusion-serve-svc 8000:8000
Forwarding from 127.0.0.1:8000 -> 8000
Now try a request.
curl "http://127.0.0.1:8000/imagine?prompt=englishman%20on%20a%20horse%20riding%20into%20battle.%20masterpiece%2C%20award%20winning" --output image-`date +%s`.png
Here’s the results of my effort.
This is the base stable diffusion model. To make it more interesting like we did with the WebUI, you’d need to grab the code and build on it, for example by using a more interesting model checkpoint.
Cleanup
When you’re done, delete the RayService. This will remove the related RayCluster, Pod and Service objects.
kubectl delete rayservice stable-diffusion