Preferring Spot in GKE Autopilot

2 min read

Spot Pods are a great way to save money on Autopilot, currently 70% off the regular price. The catch is two-fold:

  1. Your workload can be disrupted
  2. There may not always be spot capacity available

For workload disruption, this is simply a judgement call. You should only run workloads that can accept disruption (abrupt termination). If you have a batch job that would lose hours of work, it’s not a good fit. Generally StatefulSet shouldn’t be run on spot compute, as most stateful workloads are less tolerant to disruption.

To solve the capacity issue, a common request is to prefer spot compute, but not require it. That is, to use spot when available and fallback to regular capacity when it’s not.

(One word of caution: this technique is not advised for really critical workloads, as there may be correlation between the spot capacity being reclaimed, and a stockout of capacity in the zone. This is somewhat mitigated by the fact that Autopilot can provision nodes in any of the 3 zones of the region, but it remains not recommended to run critical applications in this way).

In Autopilot, spot compute is requested via a nodeSelector or node affinity. This opens the door to using preferredDuringSchedulingIgnoredDuringExecution to preference spot, however there is a catch: the Kubernetes scheduler will run your pod on an existing non-spot node if it has spare capacity, before Autopilot will add a spot node. To prevent this problem, we can combine a preferred node affinity, with workload separation. With this, your Spot Pods won’t run on available non-spot capacity in the cluster, rather a spot node will be provisioned (and only if there is no spot capacity, would a regular node be created instead).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spotexample
spec:
  replicas: 10
  selector:
    matchLabels:
      pod: spotexample-pod
  template:
    metadata:
      labels:
        pod: spotexample-pod
    spec:
      tolerations:
      - key: group
        operator: Equal
        value: spot-preferred
        effect: NoSchedule
      nodeSelector:
        group: spot-preferred
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: cloud.google.com/gke-spot
                operator: In
                values:
                - "true"
      containers:
      - name: timeserver-container
        image: docker.io/wdenniss/timeserver:1

In the event that there are no Spot nodes available, non-Spot nodes will be created, which is the value of preferring, but not requiring spot. Due to the way the cluster scaling works, however, these non-Spot nodes will likely stay around for the duration of the workload even after Spot capacity is available once again, and even following events like updates and workload scale-up/down (which may not be ideal). If you have a workload in this situation and want to get it back onto 100% Spot nodes, simply change the value of the toleration and nodeSelector to a new one, e.g. in the example above changing spot-preferred to spot-preferred2, and you’ll get a fresh set of nodes (Spot ones, if available).