# Serving Stable Diffusion with RayServe on GKE Autopilot Now that you’ve [tried out Stable Diffusion on GKE Autopilot via the WebUI](/k8s/stable-diffusion-webui-on-gke-autopilot/), you might be wondering how you’d go about adding stable diffusion as a proper micro-service that other components of your application can call. One popular way is via Ray. Let’s try this tutorial: [Serve a StableDiffusion text-to-image model on Kubernetes](https://docs.ray.io/en/latest/cluster/kubernetes/examples/stable-diffusion-rayservice.html), on GKE Autopilot. Here goes: ## Create an Autopilot Cluster You’ll want version 1.28 or later for this, to get the newer CUDA drivers. This will do the trick: ```shell CLUSTER_NAME=stable-diffusion VERSION="1.28" REGION=us-central1 gcloud container clusters create-auto $CLUSTER_NAME \ --region $REGION --release-channel rapid \ --cluster-version $VERSION ``` ## Deploy the KubeRay operator To use Ray on GKE Autopilot, we first need to deploy the KubeRay operator. This allows us to create Ray objects directly in GKE (adding RayService and RayCluster to the built-in Kubernetes workload types such as Deployment and StatefulSet). Follow just [step 2](https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/raycluster-quick-start.html#kuberay-operator-deploy) of the ray getting started guide, referenced here: ```shell helm repo add kuberay https://ray-project.github.io/kuberay-helm/ helm repo update # Install both CRDs and KubeRay operator v1.0.0. helm install kuberay-operator kuberay/kuberay-operator --version 1.0.0 ``` Wait till the KubeRay’s Pod status is “Running”, which indicates that it’s now ready for us to can create Ray objects. We can observe the Pod like so: ```javascript $ kubectl get pods -w NAME READY STATUS RESTARTS AGE kuberay-operator-756cf8c9b6-wcnqg 0/1 ContainerCreating 0 2m35s kuberay-operator-756cf8c9b6-wcnqg 0/1 Running 0 3m14s ``` Or simply wait until it’s ready with: ```shell kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=kuberay-operator ``` ## Run the stable diffusion RayService With KubeRay configured, we can now run the service. Here I’m going to try running the one, unchanged, from the tutorial. This won’t work first time, but I want to demonstrate how to fix it. This is what the service looks like: ```yaml {hl_lines=[74]} apiVersion: ray.io/v1 kind: RayService metadata: name: stable-diffusion spec: serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for Ray Serve applications. Default value is 900. deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray dashboard agent. Default value is 300. serveConfigV2: | applications: - name: stable_diffusion import_path: stable_diffusion.stable_diffusion:entrypoint runtime_env: working_dir: "https://github.com/ray-project/serve_config_examples/archive/d6acf9b99ef076a1848f506670e1290a11654ec2.zip" pip: ["diffusers==0.12.1"] rayClusterConfig: rayVersion: '2.7.0' # Should match the Ray version in the image of the containers ######################headGroupSpecs################################# # Ray head pod template. headGroupSpec: # The `rayStartParams` are used to configure the `ray start` command. # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay. # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`. rayStartParams: dashboard-host: '0.0.0.0' # Pod template template: spec: containers: - name: ray-head image: rayproject/ray-ml:2.7.0 ports: - containerPort: 6379 name: gcs - containerPort: 8265 name: dashboard - containerPort: 10001 name: client - containerPort: 8000 name: serve volumeMounts: - mountPath: /tmp/ray name: ray-logs resources: limits: cpu: "2" memory: "8G" requests: cpu: "2" memory: "8G" volumes: - name: ray-logs emptyDir: {} workerGroupSpecs: # The pod replicas in this group typed worker - replicas: 1 minReplicas: 1 maxReplicas: 10 groupName: gpu-group rayStartParams: {} # Pod template template: spec: containers: - name: ray-worker image: rayproject/ray-ml:2.7.0 resources: limits: cpu: 4 memory: "16G" nvidia.com/gpu: 1 requests: cpu: 3 memory: "12G" nvidia.com/gpu: 1 # Please add the following taints to the GPU node. tolerations: - key: "ray.io/node-type" operator: "Equal" value: "worker" effect: "NoSchedule" ``` Download, and apply: ```shell curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0/ray-operator/config/samples/ray-service.stable-diffusion.yaml kubectl apply -f ray-service.stable-diffusion.yaml ``` **This will not work out of the box, but I want to show you why, and how to fix. Let’s dig in:** ```shell $ kubectl get rayclusters NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE stable-diffusion-raycluster-cc7rg failed 3s ``` ```shell {hl_lines=[88, 89]} $ kubectl describe raycluster stable-diffusion-raycluster-wg5p2 Name: stable-diffusion-raycluster-wg5p2 Namespace: default Labels: app.kubernetes.io/created-by=rayservice ray.io/service=stable-diffusion Annotations: ray.io/cluster-hash: N6FJA58A50TVP42R7LM9EGISIG6RK5KL ray.io/enableAgentService: true API Version: ray.io/v1 Kind: RayCluster Metadata: Creation Timestamp: 2024-02-17T03:57:03Z Generation: 1 Owner References: API Version: ray.io/v1 Block Owner Deletion: true Controller: true Kind: RayService Name: stable-diffusion UID: 5fcd5349-d14e-45a0-88bf-b186cd2df207 Resource Version: 1278352 UID: daf9b4da-9fb5-489e-b57e-4bde1c66bab6 Spec: Head Group Spec: Ray Start Params: Dashboard - Host: 0.0.0.0 Template: Metadata: Spec: Containers: Image: rayproject/ray-ml:2.7.0 Name: ray-head Ports: Container Port: 6379 Name: gcs Protocol: TCP Container Port: 8265 Name: dashboard Protocol: TCP Container Port: 10001 Name: client Protocol: TCP Container Port: 8000 Name: serve Protocol: TCP Resources: Limits: Cpu: 2 Memory: 8G Requests: Cpu: 2 Memory: 8G Volume Mounts: Mount Path: /tmp/ray Name: ray-logs Volumes: Empty Dir: Name: ray-logs Ray Version: 2.7.0 Worker Group Specs: Group Name: gpu-group Max Replicas: 10 Min Replicas: 1 Ray Start Params: Replicas: 1 Scale Strategy: Template: Metadata: Spec: Containers: Image: rayproject/ray-ml:2.7.0 Name: ray-worker Resources: Limits: Cpu: 4 Memory: 16G nvidia.com/gpu: 1 Requests: Cpu: 3 Memory: 12G nvidia.com/gpu: 1 Tolerations: Effect: NoSchedule Key: ray.io/node-type Operator: Equal Value: worker Status: Head: Reason: admission webhook "warden-validating.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints. Violations details: {"[denied by autogke-gpu-limitation]":["You must specify a GPU type with node selector 'cloud.google.com/gke-accelerator' when GPU is requested on Autopilot workloads; supported values are: [nvidia-a100-80gb, nvidia-l4, nvidia-tesla-a100, nvidia-tesla-t4]."]} ``` Here’s the clue behind the error: **Violations details: {“\[denied by autogke-gpu-limitation\]”:\[“You must specify a GPU type with node selector ‘cloud.google.com/gke-accelerator’ when GPU is requested on Autopilot workloads; supported values are: \[nvidia-a100-80gb, nvidia-l4, nvidia-tesla-a100, nvidia-tesla-t4\].”\]}** As this warning notes, you can’t just request a “gpu”, rather you need to pick a specific one. I’m going to go with L4 here, which is a popular GPU for inference workloads. Add the [required node selector](https://cloud.google.com/kubernetes-engine/docs/how-to/autopilot-gpus) for Nvidia L4 to the workload. I’m also going to specify Spot since this isn’t a production workload: ```yaml nodeSelector: cloud.google.com/gke-accelerator: nvidia-l4 cloud.google.com/gke-spot: "true" ``` Note that the above tutorial’s config requests that you add taints to the GPU node (“Please add the following taints to the GPU node.”). In Autopilot you don’t have to manage nodes, and the platform will automatically do all this for you. Neat! Here’s the final RayService with our two node selectors added, plus an ephemeral-storage request to make sure we get a big enough node (in Autopilot, you need to tell the system what resources you need). ```yaml {hl_lines=[63, 64, 65, 78]} apiVersion: ray.io/v1 kind: RayService metadata: name: stable-diffusion spec: serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for Ray Serve applications. Default value is 900. deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray dashboard agent. Default value is 300. serveConfigV2: | applications: - name: stable_diffusion import_path: stable_diffusion.stable_diffusion:entrypoint runtime_env: working_dir: "https://github.com/ray-project/serve_config_examples/archive/d6acf9b99ef076a1848f506670e1290a11654ec2.zip" pip: ["diffusers==0.12.1"] rayClusterConfig: rayVersion: '2.7.0' # Should match the Ray version in the image of the containers ######################headGroupSpecs################################# # Ray head pod template. headGroupSpec: # The `rayStartParams` are used to configure the `ray start` command. # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay. # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`. rayStartParams: dashboard-host: '0.0.0.0' # Pod template template: spec: containers: - name: ray-head image: rayproject/ray-ml:2.7.0 ports: - containerPort: 6379 name: gcs - containerPort: 8265 name: dashboard - containerPort: 10001 name: client - containerPort: 8000 name: serve volumeMounts: - mountPath: /tmp/ray name: ray-logs resources: limits: cpu: "2" memory: "8G" requests: cpu: "2" memory: "8G" volumes: - name: ray-logs emptyDir: {} workerGroupSpecs: # The pod replicas in this group typed worker - replicas: 1 minReplicas: 1 maxReplicas: 10 groupName: gpu-group rayStartParams: {} # Pod template template: spec: nodeSelector: cloud.google.com/gke-accelerator: nvidia-l4 cloud.google.com/gke-spot: "true" containers: - name: ray-worker image: rayproject/ray-ml:2.7.0 resources: limits: cpu: 4 memory: "16G" nvidia.com/gpu: 1 requests: cpu: 3 memory: "12G" nvidia.com/gpu: 1 ephemeral-storage: 100Gi # Please add the following taints to the GPU node. tolerations: - key: "ray.io/node-type" operator: "Equal" value: "worker" effect: "NoSchedule" ``` What happened here is that since the gpu resource was specified, Autopilot needs you to specify the type of GPU you want using the cloud.google.com/gke-accelerator key, and a label like nvidia-l4 Update the file, and apply the changes: ```shell kubectl apply -f ray-service.stable-diffusion.yaml ``` This time the workload was admitted, that’s good news. If we describe the RayCluster again, we’ll see that the RayCluster is running (not “failed” like before), and that it created the pod. ```shell $ kubectl get raycluster NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE stable-diffusion-raycluster-cc7rg 1 $ kubectl describe raycluster Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Created 29s raycluster-controller Created service stable-diffusion-raycluster-cc7rg-head-svc Normal Created 29s raycluster-controller Created worker pod ``` Now we just need to wait for the Pod to be ready. It will take a couple of minutes for GKE to provision the resources. We can watch the Pods like so: ```shell $ watch -d kubectl get pods NAME READY STATUS RESTARTS AGE kuberay-operator-756cf8c9b6-wcnqg 1/1 Running 0 9m24s table-diffusion-raycluster-cc7rg-worker-gpu-group-k9zzg 0/1 Pending 0 39s ``` After a while, it shows as Running. ```shell Every 2.0s: kubectl get pods cs-803911491473-default: Fri Apr 12 00:41:29 2024 NAME READY STATUS RESTARTS AGE kuberay-operator-756cf8c9b6-wcnqg 1/1 Running 0 23m stable-diffusion-raycluster-cc7rg-head-825lw 1/1 Running 0 8m48s table-diffusion-raycluster-cc7rg-worker-gpu-group-k9zzg 1/1 Running 0 14m ``` The cluster also now shows “ready”. ```shell ~$ kubectl get raycluster NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE stable-diffusion-raycluster-m4bmp 1 1 ready 10m ``` Looks like it’s good to go, let’s forward the port, and give it a try! In one tab, forward the port of the service declared above: ```shell $ kubectl port-forward svc/stable-diffusion-serve-svc 8000:8000 Forwarding from 127.0.0.1:8000 -> 8000 ``` Now try a request. ```shell curl "http://127.0.0.1:8000/imagine?prompt=englishman%20on%20a%20horse%20riding%20into%20battle.%20masterpiece%2C%20award%20winning" --output image-`date +%s`.png ``` Here’s the results of my effort. ![](image-1712891309.png) This is the base stable diffusion model. To make it more interesting like we did with the WebUI, you’d need to grab the [code](https://github.com/ray-project/serve_config_examples) and build on it, for example by using a more interesting model checkpoint. ## Cleanup When you’re done, delete the RayService. This will remove the related RayCluster, Pod and Service objects. ```shell kubectl delete rayservice stable-diffusion ```