CUDA 12 on GKE Autopilot

< 1 min read

Per the NVIDIA docs, CUDA 12 applications require driver 525.60.04+. This driver is available as part of GKE 1.28. To upgrade an existing cluster to the latest version of 1.28:

VERSION="1.28"
REGION="us-central1"
CLUSTER_NAME="autopilot-cluster-1"
gcloud container clusters upgrade $CLUSTER_NAME \
    --region $REGION \
    --master --cluster-version $VERSION

This upgrades the control plane, and schedules the nodes to follow, which generally completes within a day or two (depending on how many nodes you have).

New nodes may still be created on the old version for a time after upgrading until that process completes. If you want to get to work right away, to force nodes to be created on the new version you can use workload separation by adding a selector and toleration to your workload. As this forces the creation of a new group of nodes, they will get the now-current version. The key/value can be anything (like "newversion: yes" here) provided you use the same value in both locations.

    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4    
        newversion: "yes"
      tolerations:
      - key: newversion
        operator: Equal
        value: "yes"
        effect: NoSchedule  

Once the cluster upgrade is completed after a few days, this can be safely removed.