# LLM Model Serving on Autopilot


When building your business using LLMs as a key component, you may wish to be a master of your own domain and run your own model. Running your own LLM protects you from changes like pricing increases or API availability with third-party services, guarantees the privacy of your data (no data needs to leave your VPC), and lets you experiment with advanced topics like fine-tuning. GKE, and in particular Autopilot mode is a great place to run your own LLM, as it takes care of much of the compute orchestration for you, allowing you to focus your energy on providing the LLM API for your business.

In this post, I take the [GKE LLM tutorial](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-multiple-gpu) and tweak it to run in Autopilot mode. Only one change is needed, which is to mount a generic ephemeral volume for the data volume path, as Autopilot is a Pod-based model.

Firstly, we’ll need a cluster with NVIDIA L4 support. According to the [docs](https://cloud.google.com/kubernetes-engine/docs/how-to/autopilot-gpus), that means using GKE version 1.28.3-gke.1203000 or later. We also need to pick a [region with L4 support](https://cloud.google.com/compute/docs/gpus/gpu-regions-zones). Passing “1.28” in the version string is enough to get the latest patch of 1.28 which will suffice here \[2024-12 update: passing the version with `--cluster-version "1.28"` is no longer needed as current versions are newer\], and us-west1 has L4s so let’s use that. Putting it together:

```bash
CLUSTER_NAME=llm-test
REGION=us-west1
gcloud container clusters create-auto $CLUSTER_NAME \
    --region $REGION
```

Now, deploy your LLM. We can follow the tutorial but use a slightly modified deployment. Here’s the deployment for Falcon:

```yaml {hl_lines=[20, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 53]}
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      containers:
      - name: llm
        image: ghcr.io/huggingface/text-generation-inference:1.1.0
        resources:
          limits:
            nvidia.com/gpu: "2"
        env:
        - name: MODEL_ID
          value: tiiuae/falcon-40b-instruct
        - name: NUM_SHARD
          value: "2"
        - name: PORT
          value: "8080"
        - name: QUANTIZE
          value: bitsandbytes-nf4
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /data
            name: data
      volumes:
        - name: dshm
          emptyDir:
              medium: Memory
        - name: data
          ephemeral:
            volumeClaimTemplate:
              metadata:
                labels:
                  type: data-volume
              spec:
                accessModes: [ "ReadWriteOnce" ]
                storageClassName: "premium-rwo"
                resources:
                  requests:
                    storage: 200Gi
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
        cloud.google.com/gke-spot: "true"
```

[text-generation-inference.yaml](https://github.com/WilliamDenniss/autopilot-examples/blob/master/llm/falcon/text-generation-inference.yaml)

Note the all important `nodeSelector` to give us a spot NVIDIA L4 GPU. On Autopilot it’s that easy, just add those 2 lines and GKE will do the rest. Under the hood, Autopilot will match this particular workload with a g2-standard-24 machine with the 2 L4 GPUs requested by this workload.

Clone the [repo](https://github.com/WilliamDenniss/autopilot-examples)

```bash
git clone https://github.com/WilliamDenniss/autopilot-examples.git
```

then create the Deployment

```bash
cd autopilot-examples/llm/falcon
kubectl apply -f text-generation-inference.yaml
```

Wait for the Pod to become Ready. It will take a little time, as a few things are happening:

- The node is being provisioned
- GPU drivers are installed
- The container is downloaded (during the “ContainerCreating” phase)

Pro tip: to speed up the ContainerCreating phase, mirror the container into Artifact Registry, then it will only take 10 seconds thanks to [GKE image streaming](https://cloud.google.com/blog/products/containers-kubernetes/introducing-container-image-streaming-in-gke)!

```
watch -d kubectl get pods,nodes
```

Once the container is running, it will appear similar to:

```
$ kubectl get pods,nodes
NAME                       READY   STATUS    RESTARTS   AGE
pod/llm-68fd95c699-7mg45   1/1     Running   0          6m32s

NAME                                           STATUS   ROLES    AGE     VERSION
node/gk3-llm-test-nap-1uj8n2g9-7bc7556f-k2lp   Ready    <none>   3m56s   v1.28.3-gke.1203001
node/gk3-llm-test-pool-1-6211caf5-gmjx         Ready    <none>   5m46s   v1.28.3-gke.1203001
node/gk3-llm-test-pool-2-2bb74d90-dbj2         Ready    <none>   2m55s   v1.28.3-gke.1203001
```

With the container Running, we can watch the logs:

```bash
kubectl logs -l app=llm -f
```

It will take a while for the container to be ready to serve (more than 7 minutes) while the model is prepared. The text you’re looking for in the logs to indicate the model is ready for serving is similar to:

```
{..."message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":213}
{..."message":"Setting max batch total tokens to 120224","target":"text_generation_router","filename":"router/src/main.rs","line_number":246}
{..."message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":247
```

Once the model is ready, you can continue with the [rest of the tutorial](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-multiple-gpu#cluster_ip_address) including adding a service, and a gradio interface, like so:

```bash
kubectl apply -f llm-service.yaml
kubectl apply -f gradio.yaml
```

With the gradio service created, wait for the external IP:

```
$ kubectl get svc -w
NAME             TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
gradio-service   LoadBalancer   34.118.239.125   <pending>     80:31805/TCP   0s
kubernetes       ClusterIP      34.118.224.1     <none>        443/TCP        36m
llm-service      ClusterIP      34.118.228.81    <none>        80/TCP         3s
gradio-service   LoadBalancer   34.118.239.125   <pending>     80:31805/TCP   1s
gradio-service   LoadBalancer   34.118.239.125   34.127.67.228   80:31805/TCP   37s
```

Navigating to the external IP, we can try out some prompts, like “You’re a helpful assistant. Tell me about Kubernetes.”

![](Screenshot-2023-12-21-at-16.55.38.png)

Tip: if you don’t want to create an external IP, then don’t create the LoadBalancer (remove [these lines](https://github.com/WilliamDenniss/autopilot-examples/blob/884fbf6694c0f3ede6b4560d3ddda648c1f631eb/llm/falcon/gradio.yaml#L34-L44) from the gradio YAML), and instead forward a port on your machine directly to the Deployment for local experimentation without exposing your work to the internet.

```bash
kubectl port-forward deploy/gradio 7860:7860
```

**Next steps**: to try the LLaMA model, follow the instructions in the [tutorial](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-multiple-gpu#llama-2-70b) under “prepare your model” to get the hugging face token. Once your secret is created, use my [modified Deployment](https://github.com/WilliamDenniss/autopilot-examples/blob/master/llm/llama/text-generation-interface.yaml) to get the correct storage configuration.

![](Llama.jpg)