PodSpreadTopology is a way to get Kubernetes to spread out your pods across a failure domain, typically nodes or zones. Kubernetes platforms typically have some default spread built in, although it may not be as aggressive as you want (meaning, it might be more tolerant of imbalanced spread).
Here’s an example Deployment with a PodSpreadToplogy that will result in an even spread over all zones (I also have a writeup on this topic which you can preview in my book).
apiVersion: apps/v1
kind: Deployment
metadata:
name: timeserver
spec:
replicas: 3
selector:
matchLabels:
pod: timeserver-pod
template:
metadata:
labels:
pod: timeserver-pod
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
pod: timeserver-pod
containers:
- name: timeserver-container
image: docker.io/wdenniss/timeserver:1
resources:
requests:
cpu: 200m
memory: 250Mi
limits:
cpu: 300m
memory: 400Mi
Since Autopilot clusters are regional, typically involving 3 zones, you might expect such a spread topology rule would result in pods being scheduled to all 3 zones, however this isn’t always the case. Autopilot by default will use 2 zones for HA, and this is a moment where the way the layers of the GKE/Kubernetes system interact in an uncooperative way.
PodSpreadTopology is a Kubernetes concept, so if your cluster only has nodes in 2 zones, the scheduler only sees 2 zones, it will spread over those 2, but not the third which it doesn’t know about. To fix that you simply need to ensure that you have at least 1 Pod running in each of the 3 zones in your region (see also Using GKE Autopilot in specific zones). There’s a couple of easy ways to make that the case, and you can do this either once to bootstrap the zones, or with a more permanent Deployment. The former is cheaper (runs once), and works the same provided you scale up your workload before it terminates, and you don’t scale down below the number of zones you have (i.e. you always have at least 3 replicas—which is the stated goal anyway!).
Ensuring you are using all 3 zones in Autopilot
To bootstrap Autopilot in 3 zones so the PodSpreadTopology can do it’s thing, we’re going to run a “pause” Job in 3 zones, for a duration of 5h (18000 seconds), meaning you’ll have 5 hours to complete your deployment (change the time as needed).
This sample was built for us-west1, my own favorite region. You can run it as-is in that region, otherwise make sure to find and replace “us-west1” with your desired region before use, otherwise your Pods will not schedule.
apiVersion: batch/v1
kind: Job
metadata:
name: placeholder-job-zone-a
spec:
parallelism: 1
backoffLimit: 0
template:
spec:
nodeSelector:
topology.kubernetes.io/zone: "us-west1-a"
terminationGracePeriodSeconds: 0
containers:
- name: ubuntu-container
image: ubuntu
command: ["sleep"]
args: ["18000"]
resources:
requests:
cpu: 250m
memory: 512Mi
ephemeral-storage: 10Mi
restartPolicy: Never
---
apiVersion: batch/v1
kind: Job
metadata:
name: placeholder-job-zone-b
spec:
parallelism: 1
backoffLimit: 0
template:
spec:
nodeSelector:
topology.kubernetes.io/zone: "us-west1-b"
terminationGracePeriodSeconds: 0
containers:
- name: ubuntu-container
image: ubuntu
command: ["sleep"]
args: ["18000"]
resources:
requests:
cpu: 250m
memory: 512Mi
ephemeral-storage: 10Mi
restartPolicy: Never
---
apiVersion: batch/v1
kind: Job
metadata:
name: placeholder-job-zone-c
spec:
parallelism: 1
backoffLimit: 0
template:
spec:
nodeSelector:
topology.kubernetes.io/zone: "us-west1-c"
terminationGracePeriodSeconds: 0
containers:
- name: ubuntu-container
image: ubuntu
command: ["sleep"]
args: ["18000"]
resources:
requests:
cpu: 250m
memory: 512Mi
ephemeral-storage: 10Mi
restartPolicy: Never
To deploy:
kubectl create -f https://raw.githubusercontent.com/WilliamDenniss/autopilot-examples/main/all-zones/zonal-placeholder-job.yaml
Now it’s deployed we can check that it worked, and we have a pod in each of our 3 zones.
wdenniss@cloudshell:~/autopilot-examples/all-zones (gke-autopilot-test)$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
placeholder-job-zone-a-bhdw7 1/1 Running 0 27m 10.67.1.2 gk3-autopilot-cluster-1-nap-7sxf27zy-adbad4ea-zv4k <none> <none>
placeholder-job-zone-b-5njdw 1/1 Running 0 27m 10.67.0.139 gk3-autopilot-cluster-1-nap-7sxf27zy-84b15ccd-gjcs <none> <none>
placeholder-job-zone-c-z96rz 1/1 Running 0 27m 10.67.0.75 gk3-autopilot-cluster-1-nap-7sxf27zy-98e28c49-r2m9 <none> <none>
wdenniss@cloudshell:~/autopilot-examples/all-zones (gke-autopilot-test)$ kubectl describe node gk3-autopilot-cluster-1-nap-7sxf27zy-adbad4ea-zv4k | grep zone
failure-domain.beta.kubernetes.io/zone=us-west1-a
topology.gke.io/zone=us-west1-a
topology.kubernetes.io/zone=us-west1-a
default placeholder-job-zone-a-bhdw7 250m (12%) 250m (12%) 512Mi (8%) 512Mi (8%) 27m
wdenniss@cloudshell:~/autopilot-examples/all-zones (gke-autopilot-test)$ kubectl describe node gk3-autopilot-cluster-1-nap-7sxf27zy-84b15ccd-gjcs | grep zone
failure-domain.beta.kubernetes.io/zone=us-west1-b
topology.gke.io/zone=us-west1-b
topology.kubernetes.io/zone=us-west1-b
default placeholder-job-zone-b-5njdw 250m (12%) 250m (12%) 512Mi (8%) 512Mi (8%) 27m
wdenniss@cloudshell:~/autopilot-examples/all-zones (gke-autopilot-test)$ kubectl describe node gk3-autopilot-cluster-1-nap-7sxf27zy-98e28c49-r2m9 | grep zone
failure-domain.beta.kubernetes.io/zone=us-west1-c
topology.gke.io/zone=us-west1-c
topology.kubernetes.io/zone=us-west1-c
default placeholder-job-zone-c-z96rz 250m (12%) 250m (12%) 512Mi (8%) 512Mi (8%) 27m
The catch with using a 5 hour Job here, is that you need to schedule your actual workload (at least 1 replica per zone) in that time, otherwise you’re back to square one. For bonus marks you could also make this Job preempted by your main workload so it gets evicted right away (by adding a priorityClassName, like we did here).
If rather than the 5h window, you prefer a more permanent setup, then instead of a Job, you can use a Deployment. This method guarantees you always have 3 zones, and will set you back 750mCPU and 1536MB worth of costs. Again, find and replace us-west1 with your own region.
apiVersion: apps/v1
kind: Deployment
metadata:
name: placeholder-zone-a
spec:
replicas: 1
selector:
matchLabels:
pod: placeholder-zone-a-pod
template:
metadata:
labels:
pod: placeholder-zone-a-pod
spec:
nodeSelector:
topology.kubernetes.io/zone: "us-west1-a"
containers:
- name: ubuntu-container
image: ubuntu
command: ["sleep"]
args: ["infinity"]
resources:
requests:
cpu: 250m
memory: 250Mi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: placeholder-zone-b
spec:
replicas: 1
selector:
matchLabels:
pod: placeholder-zone-b-pod
template:
metadata:
labels:
pod: placeholder-zone-b-pod
spec:
nodeSelector:
topology.kubernetes.io/zone: "us-west1-b"
containers:
- name: ubuntu-container
image: ubuntu
command: ["sleep"]
args: ["infinity"]
resources:
requests:
cpu: 250m
memory: 250Mi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: placeholder-zone-c
spec:
replicas: 1
selector:
matchLabels:
pod: placeholder-zone-c-pod
template:
metadata:
labels:
pod: placeholder-zone-c-pod
spec:
nodeSelector:
topology.kubernetes.io/zone: "us-west1-c"
containers:
- name: ubuntu-container
image: ubuntu
command: ["sleep"]
args: ["infinity"]
resources:
requests:
cpu: 250m
memory: 512Mi
ephemeral-storage: 10Mi