Serving Stable Diffusion with RayServe on GKE Autopilot

6 min read

Now that you’ve tried out Stable Diffusion on GKE Autopilot via the WebUI, you might be wondering how you’d go about adding stable diffusion as a proper micro-service that other components of your application can call. One popular way is via Ray. Let’s try this tutorial: Serve a StableDiffusion text-to-image model on Kubernetes, on GKE Autopilot. Here goes:

Create an Autopilot Cluster

You’ll want version 1.28 or later for this, to get the newer CUDA drivers. This will do the trick:

CLUSTER_NAME=stable-diffusion
VERSION="1.28"
REGION=us-central1
gcloud container clusters create-auto $CLUSTER_NAME \
    --region $REGION --release-channel rapid \
    --cluster-version $VERSION

Deploy the KubeRay operator

To use Ray on GKE Autopilot, we first need to deploy the KubeRay operator. This allows us to create Ray objects directly in GKE (adding RayService and RayCluster to the built-in Kubernetes workload types such as Deployment and StatefulSet). Follow just step 2 of the ray getting started guide, referenced here:

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update

# Install both CRDs and KubeRay operator v1.0.0.
helm install kuberay-operator kuberay/kuberay-operator --version 1.0.0

Wait till the KubeRay’s Pod status is “Running”, which indicates that it’s now ready for us to can create Ray objects. We can observe the Pod like so:

$ kubectl get pods -w
NAME                                READY   STATUS              RESTARTS   AGE
kuberay-operator-756cf8c9b6-wcnqg   0/1     ContainerCreating   0          2m35s
kuberay-operator-756cf8c9b6-wcnqg   0/1     Running             0          3m14s

Or simply wait until it’s ready with:

kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=kuberay-operator

Run the stable diffusion RayService

With KubeRay configured, we can now run the service. Here I’m going to try running the one, unchanged, from the tutorial. This won’t work first time, but I want to demonstrate how to fix it.

This is what the service looks like:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: stable-diffusion
spec:
  serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for Ray Serve applications. Default value is 900.
  deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray dashboard agent. Default value is 300.
  serveConfigV2: |
    applications:
      - name: stable_diffusion
        import_path: stable_diffusion.stable_diffusion:entrypoint
        runtime_env:
          working_dir: "https://github.com/ray-project/serve_config_examples/archive/d6acf9b99ef076a1848f506670e1290a11654ec2.zip"
          pip: ["diffusers==0.12.1"]
  rayClusterConfig:
    rayVersion: '2.7.0' # Should match the Ray version in the image of the containers
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
      # Pod template
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray-ml:2.7.0
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            volumeMounts:
              - mountPath: /tmp/ray
                name: ray-logs
            resources:
              limits:
                cpu: "2"
                memory: "8G"
              requests:
                cpu: "2"
                memory: "8G"
          volumes:
            - name: ray-logs
              emptyDir: {}
    workerGroupSpecs:
    # The pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 10
      groupName: gpu-group
      rayStartParams: {}
      # Pod template
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray-ml:2.7.0
            resources:
              limits:
                cpu: 4
                memory: "16G"
                nvidia.com/gpu: 1
              requests:
                cpu: 3
                memory: "12G"
                nvidia.com/gpu: 1
          # Please add the following taints to the GPU node.
          tolerations:
            - key: "ray.io/node-type"
              operator: "Equal"
              value: "worker"
              effect: "NoSchedule"

https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0/ray-operator/config/samples/ray-service.stable-diffusion.yaml

Download, and apply:

curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0/ray-operator/config/samples/ray-service.stable-diffusion.yaml

kubectl apply -f ray-service.stable-diffusion.yaml

This will not work out of the box, but I want to show you why, and how to fix. Let’s dig in:

$ kubectl get rayclusters
NAME                                DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
stable-diffusion-raycluster-cc7rg                                         failed   3s

$ kubectl describe raycluster stable-diffusion-raycluster-wg5p2 
Name:         stable-diffusion-raycluster-wg5p2
Namespace:    default
Labels:       app.kubernetes.io/created-by=rayservice
              ray.io/service=stable-diffusion
Annotations:  ray.io/cluster-hash: N6FJA58A50TVP42R7LM9EGISIG6RK5KL
              ray.io/enableAgentService: true
API Version:  ray.io/v1
Kind:         RayCluster
Metadata:
  Creation Timestamp:  2024-02-17T03:57:03Z
  Generation:          1
  Owner References:
    API Version:           ray.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  RayService
    Name:                  stable-diffusion
    UID:                   5fcd5349-d14e-45a0-88bf-b186cd2df207
  Resource Version:        1278352
  UID:                     daf9b4da-9fb5-489e-b57e-4bde1c66bab6
Spec:
  Head Group Spec:
    Ray Start Params:
      Dashboard - Host:  0.0.0.0
    Template:
      Metadata:
      Spec:
        Containers:
          Image:  rayproject/ray-ml:2.7.0
          Name:   ray-head
          Ports:
            Container Port:  6379
            Name:            gcs
            Protocol:        TCP
            Container Port:  8265
            Name:            dashboard
            Protocol:        TCP
            Container Port:  10001
            Name:            client
            Protocol:        TCP
            Container Port:  8000
            Name:            serve
            Protocol:        TCP
          Resources:
            Limits:
              Cpu:     2
              Memory:  8G
            Requests:
              Cpu:     2
              Memory:  8G
          Volume Mounts:
            Mount Path:  /tmp/ray
            Name:        ray-logs
        Volumes:
          Empty Dir:
          Name:  ray-logs
  Ray Version:   2.7.0
  Worker Group Specs:
    Group Name:    gpu-group
    Max Replicas:  10
    Min Replicas:  1
    Ray Start Params:
    Replicas:  1
    Scale Strategy:
    Template:
      Metadata:
      Spec:
        Containers:
          Image:  rayproject/ray-ml:2.7.0
          Name:   ray-worker
          Resources:
            Limits:
              Cpu:             4
              Memory:          16G
              nvidia.com/gpu:  1
            Requests:
              Cpu:             3
              Memory:          12G
              nvidia.com/gpu:  1
        Tolerations:
          Effect:    NoSchedule
          Key:       ray.io/node-type
          Operator:  Equal
          Value:     worker
Status:
  Head:
  Reason:  admission webhook "warden-validating.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by autogke-gpu-limitation]":["You must specify a GPU type with node selector 'cloud.google.com/gke-accelerator' when GPU is requested on Autopilot workloads; supported values are: [nvidia-a100-80gb, nvidia-l4, nvidia-tesla-a100, nvidia-tesla-t4]."]}

Here’s the clue behind the error: Violations details: {“[denied by autogke-gpu-limitation]”:[“You must specify a GPU type with node selector ‘cloud.google.com/gke-accelerator’ when GPU is requested on Autopilot workloads; supported values are: [nvidia-a100-80gb, nvidia-l4, nvidia-tesla-a100, nvidia-tesla-t4].”]}

As this warning notes, you can’t just request a “gpu”, rather you need to pick a specific one. I’m going to go with L4 here, which is a popular GPU for inference workloads.

Add the required node selector for Nvidia L4 to the workload. I’m also going to specify Spot since this isn’t a production workload:

          nodeSelector:
            cloud.google.com/gke-accelerator: nvidia-l4
            cloud.google.com/gke-spot: "true"

Note that the above tutorial’s config requests that you add taints to the GPU node (“Please add the following taints to the GPU node.”). In Autopilot you don’t have to manage nodes, and the platform will automatically do all this for you. Neat!

Here’s the final RayService with our two node selectors added, plus an ephemeral-storage request to make sure we get a big enough node (in Autopilot, you need to tell the system what resources you need).

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: stable-diffusion
spec:
  serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for Ray Serve applications. Default value is 900.
  deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray dashboard agent. Default value is 300.
  serveConfigV2: |
    applications:
      - name: stable_diffusion
        import_path: stable_diffusion.stable_diffusion:entrypoint
        runtime_env:
          working_dir: "https://github.com/ray-project/serve_config_examples/archive/d6acf9b99ef076a1848f506670e1290a11654ec2.zip"
          pip: ["diffusers==0.12.1"]
  rayClusterConfig:
    rayVersion: '2.7.0' # Should match the Ray version in the image of the containers
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
      # Pod template
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray-ml:2.7.0
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            volumeMounts:
              - mountPath: /tmp/ray
                name: ray-logs
            resources:
              limits:
                cpu: "2"
                memory: "8G"
              requests:
                cpu: "2"
                memory: "8G"
          volumes:
            - name: ray-logs
              emptyDir: {}
    workerGroupSpecs:
    # The pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 10
      groupName: gpu-group
      rayStartParams: {}
      # Pod template
      template:
        spec:
          nodeSelector:
            cloud.google.com/gke-accelerator: nvidia-l4
            cloud.google.com/gke-spot: "true"        
          containers:
          - name: ray-worker
            image: rayproject/ray-ml:2.7.0
            resources:
              limits:
                cpu: 4
                memory: "16G"
                nvidia.com/gpu: 1
              requests:
                cpu: 3
                memory: "12G"
                nvidia.com/gpu: 1
                ephemeral-storage: 100Gi
          # Please add the following taints to the GPU node.
          tolerations:
            - key: "ray.io/node-type"
              operator: "Equal"
              value: "worker"
              effect: "NoSchedule"

What happened here is that since the gpu resource was specified, Autopilot needs you to specify the type of GPU you want using the cloud.google.com/gke-accelerator key, and a label like nvidia-l4

Update the file, and apply the changes:

kubectl apply -f ray-service.stable-diffusion.yaml

This time the workload was admitted, that’s good news. If we describe the RayCluster again, we’ll see that the RayCluster is running (not “failed” like before), and that it created the pod.

$ kubectl get raycluster
NAME                                DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
stable-diffusion-raycluster-cc7rg   1 

$ kubectl describe raycluster
<truncated>
Events:
  Type    Reason   Age   From                   Message
  ----    ------   ----  ----                   -------
  Normal  Created  29s   raycluster-controller  Created service stable-diffusion-raycluster-cc7rg-head-svc
  Normal  Created  29s   raycluster-controller  Created worker pod

Now we just need to wait for the Pod to be ready. It will take a couple of minutes for GKE to provision the resources. We can watch the Pods like so:

$ watch -d kubectl get pods
NAME                                                      READY   STATUS        RESTARTS   AGE
kuberay-operator-756cf8c9b6-wcnqg                         1/1     Running       0          9m24s
table-diffusion-raycluster-cc7rg-worker-gpu-group-k9zzg   0/1     Pending       0          39s

After a while, it shows as Running.

Every 2.0s: kubectl get pods                                                                                                                                            cs-803911491473-default: Fri Apr 12 00:41:29 2024

NAME                                                      READY   STATUS    RESTARTS   AGE
kuberay-operator-756cf8c9b6-wcnqg                         1/1     Running   0          23m
stable-diffusion-raycluster-cc7rg-head-825lw              1/1     Running   0          8m48s
table-diffusion-raycluster-cc7rg-worker-gpu-group-k9zzg   1/1     Running   0          14m

The cluster also now shows “ready”.

~$ kubectl get raycluster
NAME                                DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
stable-diffusion-raycluster-m4bmp   1                 1                   ready    10m

Looks like it’s good to go, let’s forward the port, and give it a try! In one tab, forward the port of the service declared above:

$ kubectl port-forward svc/stable-diffusion-serve-svc 8000:8000
Forwarding from 127.0.0.1:8000 -> 8000

Now try a request.

curl "http://127.0.0.1:8000/imagine?prompt=englishman%20on%20a%20horse%20riding%20into%20battle.%20masterpiece%2C%20award%20winning" --output image-`date +%s`.png

Here’s the results of my effort.

This is the base stable diffusion model. To make it more interesting like we did with the WebUI, you’d need to grab the code and build on it, for example by using a more interesting model checkpoint.

Cleanup

When you’re done, delete the RayService. This will remove the related RayCluster, Pod and Service objects.

kubectl delete rayservice stable-diffusion