HA 3-zone Deployments with PodSpreadTopology on Autopilot

4 min read

PodSpreadTopology is a way to get Kubernetes to spread out your pods across a failure domain, typically nodes or zones. Kubernetes platforms typically have some default spread built in, although it may not be as aggressive as you want (meaning, it might be more tolerant of imbalanced spread).

Here’s an example Deployment with a PodSpreadToplogy that will result in an even spread over all zones (I also have a writeup on this topic which you can preview in my book).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: timeserver
spec:
  replicas: 3
  selector:
    matchLabels:
      pod: timeserver-pod
  template:
    metadata:
      labels:
        pod: timeserver-pod
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              pod: timeserver-pod
      containers:
      - name: timeserver-container
        image: docker.io/wdenniss/timeserver:1
        resources:
          requests:
            cpu: 200m
            memory: 250Mi
          limits:
            cpu: 300m
            memory: 400Mi

ha-deployment.yaml

Since Autopilot clusters are regional, typically involving 3 zones, you might expect such a spread topology rule would result in pods being scheduled to all 3 zones, however this isn’t always the case. Autopilot by default will use 2 zones for HA, and this is a moment where the way the layers of the GKE/Kubernetes system interact in an uncooperative way.

PodSpreadTopology is a Kubernetes concept, so if your cluster only has nodes in 2 zones, the scheduler only sees 2 zones, it will spread over those 2, but not the third which it doesn’t know about. To fix that you simply need to ensure that you have at least 1 Pod running in each of the 3 zones in your region (see also Using GKE Autopilot in specific zones). There’s a couple of easy ways to make that the case, and you can do this either once to bootstrap the zones, or with a more permanent Deployment. The former is cheaper (runs once), and works the same provided you scale up your workload before it terminates, and you don’t scale down below the number of zones you have (i.e. you always have at least 3 replicas—which is the stated goal anyway!).

Ensuring you are using all 3 zones in Autopilot

To bootstrap Autopilot in 3 zones so the PodSpreadTopology can do it’s thing, we’re going to run a “pause” Job in 3 zones, for a duration of 5h (18000 seconds), meaning you’ll have 5 hours to complete your deployment (change the time as needed).

This sample was built for us-west1, my own favorite region. You can run it as-is in that region, otherwise make sure to find and replace “us-west1” with your desired region before use, otherwise your Pods will not schedule.

apiVersion: batch/v1
kind: Job
metadata:
  name: placeholder-job-zone-a
spec:
  parallelism: 1
  backoffLimit: 0
  template:
    spec:
      nodeSelector:
        topology.kubernetes.io/zone: "us-west1-a"
      terminationGracePeriodSeconds: 0
      containers:
      - name: ubuntu-container
        image: ubuntu
        command: ["sleep"]
        args: ["18000"]
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
            ephemeral-storage: 10Mi
      restartPolicy: Never
---
apiVersion: batch/v1
kind: Job
metadata:
  name: placeholder-job-zone-b
spec:
  parallelism: 1
  backoffLimit: 0
  template:
    spec:
      nodeSelector:
        topology.kubernetes.io/zone: "us-west1-b"
      terminationGracePeriodSeconds: 0
      containers:
      - name: ubuntu-container
        image: ubuntu
        command: ["sleep"]
        args: ["18000"]
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
            ephemeral-storage: 10Mi
      restartPolicy: Never
---
apiVersion: batch/v1
kind: Job
metadata:
  name: placeholder-job-zone-c
spec:
  parallelism: 1
  backoffLimit: 0
  template:
    spec:
      nodeSelector:
        topology.kubernetes.io/zone: "us-west1-c"
      terminationGracePeriodSeconds: 0
      containers:
      - name: ubuntu-container
        image: ubuntu
        command: ["sleep"]
        args: ["18000"]
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
            ephemeral-storage: 10Mi
      restartPolicy: Never

zonal-placeholder-job.yaml

To deploy:

kubectl create -f https://raw.githubusercontent.com/WilliamDenniss/autopilot-examples/main/all-zones/zonal-placeholder-job.yaml

Now it’s deployed we can check that it worked, and we have a pod in each of our 3 zones.

wdenniss@cloudshell:~/autopilot-examples/all-zones (gke-autopilot-test)$ kubectl get pods -o wide
NAME                           READY   STATUS    RESTARTS   AGE   IP            NODE                                                 NOMINATED NODE   READINESS GATES
placeholder-job-zone-a-bhdw7   1/1     Running   0          27m   10.67.1.2     gk3-autopilot-cluster-1-nap-7sxf27zy-adbad4ea-zv4k   <none>           <none>
placeholder-job-zone-b-5njdw   1/1     Running   0          27m   10.67.0.139   gk3-autopilot-cluster-1-nap-7sxf27zy-84b15ccd-gjcs   <none>           <none>
placeholder-job-zone-c-z96rz   1/1     Running   0          27m   10.67.0.75    gk3-autopilot-cluster-1-nap-7sxf27zy-98e28c49-r2m9   <none>           <none>
wdenniss@cloudshell:~/autopilot-examples/all-zones (gke-autopilot-test)$ kubectl describe node gk3-autopilot-cluster-1-nap-7sxf27zy-adbad4ea-zv4k | grep zone
                    failure-domain.beta.kubernetes.io/zone=us-west1-a
                    topology.gke.io/zone=us-west1-a
                    topology.kubernetes.io/zone=us-west1-a
  default                     placeholder-job-zone-a-bhdw7    250m (12%)    250m (12%)  512Mi (8%)       512Mi (8%)     27m
wdenniss@cloudshell:~/autopilot-examples/all-zones (gke-autopilot-test)$ kubectl describe node gk3-autopilot-cluster-1-nap-7sxf27zy-84b15ccd-gjcs | grep zone
                    failure-domain.beta.kubernetes.io/zone=us-west1-b
                    topology.gke.io/zone=us-west1-b
                    topology.kubernetes.io/zone=us-west1-b
  default                     placeholder-job-zone-b-5njdw             250m (12%)    250m (12%)  512Mi (8%)       512Mi (8%)     27m
wdenniss@cloudshell:~/autopilot-examples/all-zones (gke-autopilot-test)$ kubectl describe node gk3-autopilot-cluster-1-nap-7sxf27zy-98e28c49-r2m9 | grep zone
                    failure-domain.beta.kubernetes.io/zone=us-west1-c
                    topology.gke.io/zone=us-west1-c
                    topology.kubernetes.io/zone=us-west1-c
  default                     placeholder-job-zone-c-z96rz           250m (12%)    250m (12%)  512Mi (8%)       512Mi (8%)     27m

The catch with using a 5 hour Job here, is that you need to schedule your actual workload (at least 1 replica per zone) in that time, otherwise you’re back to square one. For bonus marks you could also make this Job preempted by your main workload so it gets evicted right away (by adding a priorityClassName, like we did here).

If rather than the 5h window, you prefer a more permanent setup, then instead of a Job, you can use a Deployment. This method guarantees you always have 3 zones, and will set you back 750mCPU and 1536MB worth of costs. Again, find and replace us-west1 with your own region.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: placeholder-zone-a
spec:
  replicas: 1
  selector:
    matchLabels:
      pod: placeholder-zone-a-pod
  template:
    metadata:
      labels:
        pod: placeholder-zone-a-pod
    spec:
      nodeSelector:
        topology.kubernetes.io/zone: "us-west1-a"      
      containers:
        - name: ubuntu-container
          image: ubuntu
          command: ["sleep"]
          args: ["infinity"]
          resources:
            requests:
              cpu: 250m
              memory: 250Mi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: placeholder-zone-b
spec:
  replicas: 1
  selector:
    matchLabels:
      pod: placeholder-zone-b-pod
  template:
    metadata:
      labels:
        pod: placeholder-zone-b-pod
    spec:
      nodeSelector:
        topology.kubernetes.io/zone: "us-west1-b"      
      containers:
        - name: ubuntu-container
          image: ubuntu
          command: ["sleep"]
          args: ["infinity"]
          resources:
            requests:
              cpu: 250m
              memory: 250Mi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: placeholder-zone-c
spec:
  replicas: 1
  selector:
    matchLabels:
      pod: placeholder-zone-c-pod
  template:
    metadata:
      labels:
        pod: placeholder-zone-c-pod
    spec:
      nodeSelector:
        topology.kubernetes.io/zone: "us-west1-c"      
      containers:
        - name: ubuntu-container
          image: ubuntu
          command: ["sleep"]
          args: ["infinity"]
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
              ephemeral-storage: 10Mi

zonal-placeholder-deployment.yaml