TensorFlow on GKE Autopilot with GPU acceleration

6 min read

Last week, GKE announced GPU support for Autopilot. Here’s a fun way to try it out: a TensorFlow-enabled Jupyter Notebook with GPU-acceleration! We can even add state, so you can save your work between sessions. Autopilot makes all this really, really easy, as you can configure everything as a Kubernetes object.

Watch my presentation where I demo this example!

Setup

First, create a GKE Autopilot cluster running 1.24 (1.24.2-gke.1800+ to be exact). Be sure you’re in one of the regions with GPUs (the pricing table helpfully shows which region.

To create:

CLUSTER_NAME=test-cluster
REGION=us-west1
gcloud container clusters create-auto $CLUSTER_NAME \
    --release-channel "rapid" --region $REGION \
    --cluster-version "1.24"

Or update an existing:

CLUSTER_NAME=test-cluster
REGION=us-west1
gcloud container clusters upgrade $CLUSTER_NAME \
    --region $REGION \
    --master --cluster-version "1.24" 

Installation

Now we can deploy a Tensorflow-enabled Jupyter Notebook with GPU-acceleration.

The following StatefulSet definition creates an instance of the tensorflow/tensorflow:latest-gpu-jupyter container that gives us a Jupyter notebook in a TensorFlow environment. It provisions a NVIDIA A100 GPU, and mounts a PersistentVolume to the /tf/saved path so you can save your work and it will persist between restarts. And it runs in Spot, so you save 60-91% (and remember, our work is saved if it’s preempted). This is a legit Jupyter Notebook that you can use long term!

# Tensorflow/Jupyter StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: tensorflow
spec:
  selector:
    matchLabels:
      pod: tensorflow-pod
  serviceName: tensorflow
  replicas: 1
  template:
    metadata:
      labels:
        pod: tensorflow-pod
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-a100
        cloud.google.com/gke-spot: "true"
      terminationGracePeriodSeconds: 30
      containers:
      - name: tensorflow-container
        image: tensorflow/tensorflow:latest-gpu-jupyter
        volumeMounts:
        - name: tensorflow-pvc
          mountPath: /tf/saved
        resources:
            requests:
              nvidia.com/gpu: "1"
              ephemeral-storage: 10Gi
## Optional: override and set your own token
#        env:
#          - name: JUPYTER_TOKEN
#            value: "jupyter"
  volumeClaimTemplates:
  - metadata:
      name: tensorflow-pvc
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 100Gi
---
# Headless service for the above StatefulSet
apiVersion: v1
kind: Service
metadata:
  name: tensorflow
spec:
  ports:
  - port: 8888
  clusterIP: None
  selector:
    pod: tensorflow-pod

tensorflow.yaml

We also need a load balancer, so we can connect to this notebook from our desktop:

# External service
apiVersion: "v1"
kind: "Service"
metadata:
  name: tensorflow-jupyter
spec:
  ports:
  - protocol: "TCP"
    port: 80
    targetPort: 8888
  selector:
    pod: tensorflow-pod
  type: LoadBalancer

tensorflow-jupyter.yaml

Deploy them both like so:

kubectl create -f https://raw.githubusercontent.com/WilliamDenniss/autopilot-examples/main/tensorflow/tensorflow.yaml
kubectl create -f https://raw.githubusercontent.com/WilliamDenniss/autopilot-examples/main/tensorflow/tensorflow-jupyter.yaml

While we’re waiting, we can watch the events in the cluster to make sure it’s going to work, like so (output truncated to show relevant events):

$ kubectl get events -w
LAST SEEN   TYPE      REASON                         OBJECT                                              MESSAGE
5m25s       Warning   FailedScheduling               pod/tensorflow-0                                    0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 2 Insufficient nvidia.com/gpu, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
4m24s       Normal    TriggeredScaleUp               pod/tensorflow-0                                    pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/gke-autopilot-test/zones/us-west1-b/instanceGroups/gk3-test-cluster-nap-1ax02924-9c722205-grp 0->1 (max: 1000)}]
2m13s       Normal    Scheduled                      pod/tensorflow-0                                    Successfully assigned default/tensorflow-0 to gk3-test-cluster-nap-1ax02924-9c722205-lzgj

The way Kubernetes and Autopilot works is you’ll initially see FailedScheduling, that’s because at the moment you deploy the code, there is no resource that can handle your Pod. But then you’ll see TriggeredScaleUp, which is Autopilot adding that resource for you, and finally Scheduled once the Pod has the resources. GPU nodes take a little longer than regular CPU nodes to provision, and this container takes a little while to boot. In my case it took about 5min all up from scheduling the Pod to it being running.

Using the Notebook

Now it’s time to connect. First, get the external IP of the load balancer

$ kubectl get svc
NAME                 TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)        AGE
kubernetes           ClusterIP      10.102.0.1     <none>         443/TCP        20d
tensorflow           ClusterIP      None           <none>         80/TCP         9m4s
tensorflow-jupyter   LoadBalancer   10.102.2.107   34.127.75.81   80:31790/TCP   8m35s

And browse to it

We can run the command it suggests in Kubernetes with exec:

$ kubectl exec -it sts/tensorflow -- jupyter notebook list
Currently running servers:
http://0.0.0.0:8888/?token=e54a0e8129ca3918db604f5c79e8a9712aa08570e62d2715 :: /tf

Login by copying the token (in my case, e54a0e8129ca3918db604f5c79e8a9712aa08570e62d2715) into the input box and hit “Log In”.

Note: if you want to skip this step, you can set your own token in the configuration, just uncomment the env lines and define your own token.

There are 2 folders, one with some included samples and “saved” which is the one we mounted from a persistent disk. I recommend operating out of the “saved” folder to preserve your state between sessions, and moving the included “tensorflow-tutorials” directory into the “saved” directory before getting started. You can use the UI below to move the folder, and upload your own notebooks.

Let’s try run a few of the included samples.

The classification.ipynb example
The overfit_and_underfit.ipynb example

We can upload our own projects, like the examples in the Tensorflow docs. Just download the notebook from the docs, and upload it jupyter to the saved/ folder, and run.

Tensorflow basics.ipynb tutorial, utilizing GPU acceleration

So there it is. We have a reusable TensorFlow Jupyter notebook running on an NVIDIA A100! This isn’t just a toy either, we hooked up a PersistentVolume so your work is saved (even if the StatefulSet is deleted, or the Pod disrupted). We’re using Spot compute to save some cash. And the entire thing was provisioned from 2 YAML files, no need to think about the underlying compute hardware. Neat!

Monitoring & Troubleshooting

If you get a message like “The kernel appears to have died. It will restart automatically.”, then the first step is to tail your logs.

kubectl logs tensorflow-0 -f

A common issue I saw was when trying to run two notebooks, I would exhaust my GPU’s memory (CUDA_ERROR_OUT_OF_MEMORY in the logs). The easy fix is to shutdown all but the notebook you are actively using.

You can keep an eye on the GPU utilization like so:

$ kubectl exec -it sts/tensorflow -- bash
# watch -d nvidia-smi

If you need to restart the setup for whatever reason, just delete the pod and Kubernetes will recreate it. This is very fast on Autopilot, as the GPU-enabled node resource will hang around for a short time in the cluster.

kubectl delete pod tensorflow-0

What’s Next

To shell into the environment and run arbitrary code (i.e. without using the notebook UI), you can use the following. Just be sure to save any data you want to persist in /tf/saved/.

kubectl exec -it sts/tensorflow -- bash

If you want some more tutorials, check out the TensorFlow tutorials and Keras.

I cloned the Keras repo onto my persistent volume to have all those tutorials in my notebook as well.

$ kubectl exec -it sts/tensorflow -- bash
# cd /tf/saved
# git clone https://github.com/keras-team/keras-io.git
# pip install pandas

If you need any additional Python modules for your notebooks like Pandas, you can set that up the same way. To create a more durable setup though you’ll want your own Dockerfile extending the one we used above (let me know if you want to share such a recipie in a follow up post).

I ran a few different examples, here’s some of the output:

The output of the Keras timeseries/ipynb/timeseries_weather_forecasting.ipynb example
A epoch random iteration in the Keras generative/ipynb/text_generation_with_miniature_gpt.ipynb example

Cleanup

When you’re done clean up by removing the StatefulSet and services:

kubectl delete sts tensorflow
kubectl delete svc tensorflow tensorflow-jupyter

Again, the nice thing about Autopilot is that deleting the Kubernetes resources (in this case a StatefulSet and LoadBalancer) will end the associated charges.

That just leaves the persistent disk. You can either keep it around (so that if you re-create the above StatefulSet, it will be re-attached and your work will be saved), or if you no longer need it, then go ahead and delete the disk as well.

kubectl delete persistentvolumeclaim/tensorflow-pvc-tensorflow-0

You can delete the cluster if you don’t need it anymore, as that does have it’s own charge (though the first GKE cluster is free).

Can’t wait to see what you do with this. If you want, tweet your creation at @WilliamDenniss.