GKE’s Autopilot mode is designed to be safer to operate and prevents many issues that can lead to sudden outages like misconfigured firewalls. However, if you’re trying to actually simulate a failure like when performing a zonal failure simulation, Autopilot’s safeguards can sometimes get in the way, blocking some of the potential ways to run such a test.
It is possibile still though to simulate the loss of compute capacity in a zone and see how your load balancing and workload will respond. Given that this is one of the key attributes in your control (e.g. number of replicas, zonal spread topology, etc), I believe this is a useful test. We can achieve this through kubectl cordon
, kubectl delete pods
and a little scripting.
Let’s create a test of a zonal failure, and see how the workload handles it!
How we’re going to simulate this is to cordon every node in a zone (this prevents Pods being scheduled in that zone), and then delete every running Pod in the zone. By doing this in a loop, even if Autopilot brings back a node in the zone under test, it would be cordoned immediately. Thus, no running Pods and no capacity in the zone for the duration of the test.
Here’s an example script (I created this with Gemini 2.5 Pro with a bit of back and forth to fix some issues):
#!/bin/bash
# --- Configuration ---
# Check if a zone parameter is provided.
if [ -z "$1" ]; then
echo "Error: No target zone specified."
echo "Usage: $0 <zone>"
echo "Example: $0 us-west1-a"
exit 1
fi
# The zone to simulate an outage in, taken from the first command-line argument.
OUTAGE_ZONE="$1"
# Duration of the outage in seconds (30 minutes = 1800 seconds)
OUTAGE_DURATION=1800
# Interval between each disruption cycle in seconds
LOOP_INTERVAL=15
# --- End Configuration ---
# --- Disruption Loop ---
echo "--- Starting continuous zonal disruption for zone: $OUTAGE_ZONE ---"
echo "This will run for $OUTAGE_DURATION seconds. Press Ctrl+C to stop early."
# Calculate the end time for the loop
END_TIME=$(( $(date +%s) + OUTAGE_DURATION ))
while [ $(date +%s) -lt $END_TIME ]; do
# Get current time for logging
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$TIMESTAMP] Running disruption cycle for zone '$OUTAGE_ZONE'..."
# 1. Cordon all current and any newly provisioned nodes in the zone.
echo " -> Cordoning all nodes with label topology.kubernetes.io/zone=$OUTAGE_ZONE"
kubectl cordon -l topology.kubernetes.io/zone=$OUTAGE_ZONE
# 2. Force delete all pods on all nodes in the zone.
echo " -> Getting list of nodes in zone $OUTAGE_ZONE..."
NODE_NAMES=$(kubectl get nodes -l topology.kubernetes.io/zone=$OUTAGE_ZONE -o jsonpath='{.items[*].metadata.name}')
if [ -n "$NODE_NAMES" ]; then
echo " -> Iterating through nodes to delete pods..."
# Use a for loop to iterate over each node name.
# The shell will split the NODE_NAMES variable by spaces into individual words.
for NODE in $NODE_NAMES; do
echo " - Deleting pods on node: $NODE"
kubectl delete pods --all-namespaces --field-selector spec.nodeName=$NODE --force --grace-period=0
done
else
echo " -> No nodes found in zone $OUTAGE_ZONE at this time."
fi
echo " -> Cycle complete. Waiting for $LOOP_INTERVAL seconds."
sleep $LOOP_INTERVAL
done
echo "--- Disruption test finished after $OUTAGE_DURATION seconds. ---"
echo "Ceased cordoning and pod deletion in zone $OUTAGE_ZONE."
echo "GKE Autopilot will now stabilize workloads."
What’s going on here? This script every 15 seconds during the outage duration that you define is cordoning all nodes in a single zone. This prevents newly scheduled Pods from landing on them. It is also deleting all running Pods without honoring graceful termination. Since Autopilot will naturally try to “heal” by creating new nodes, we run it in a loop so any new nodes are cordoned before they can be used.
A zonal failure may have other implications, but if what you wish to do is test how your workload and load balancing will respond to a sudden and continued loss of capacity in a single zone, this is one way.
Here’s the results from my own test run.
Deploy a workload and scale it
kubectl create -f https://raw.githubusercontent.com/WilliamDenniss/kubernetes-for-developers/refs/heads/master/Chapter03/3.2_DeployingToKubernetes/deploy.yaml
kubectl scale deploy/timeserver --replicas 200
Wait for it to be running:
$ kubectl get deploy
NAME READY UP-TO-DATE AVAILABLE AGE
timeserver 200/200 200 200 5m
List all the nodes and their zones
$ kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels.'topology\.kubernetes\.io/zone'
NAME ZONE
gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-5vfv europe-west3-c
gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-fw2v europe-west3-c
gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-k772 europe-west3-c
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-brld europe-west3-b
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-dggm europe-west3-b
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-flsc europe-west3-b
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-r8df europe-west3-b
gk3-autopilot-cluster-3-nap-1cmx89wv-a4a55b43-twsn europe-west3-a
gk3-autopilot-cluster-3-nap-1cmx89wv-a4a55b43-vn8n europe-west3-a
gk3-autopilot-cluster-3-pool-4-3311a57a-k8pj europe-west3-c
Starting the test. Let’s pick a zone with lots of nodes
$ ./outage-test.sh europe-west3-a
--- Starting continuous zonal disruption for zone: europe-west3-a ---
This will run for 1800 seconds. Press Ctrl+C to stop early.
[2025-05-31 05:48:05] Running disruption cycle for zone 'europe-west3-a'...
-> Cordoning all nodes with label topology.kubernetes.io/zone=europe-west3-a
node/gk3-autopilot-cluster-3-nap-x67xdnqn-5b5704dd-rs48 cordoned
node/gk3-autopilot-cluster-3-pool-4-28df1d79-6lk2 cordoned
node/gk3-autopilot-cluster-3-pool-4-28df1d79-cjdk cordoned
node/gk3-autopilot-cluster-3-pool-4-28df1d79-mprh cordoned
node/gk3-autopilot-cluster-3-pool-4-28df1d79-xrnz cordoned
-> Getting list of nodes in zone europe-west3-a...
-> Iterating through nodes to delete pods...
- Deleting pods on node: gk3-autopilot-cluster-3-nap-x67xdnqn-5b5704dd-rs48
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "timeserver-7d54975c9f-sfzhb" force deleted
Note: you can ignore errors like “Error from server (Forbidden): pods "node-local-dns-cnzz5" is forbidden: User cannot delete resource "pods" in API group "" in the namespace "kube-system"
” this is just Autopilot preventing the deletion of system Pods. Your workloads Pods will still be deleted, and the system Pods will be removed when the node is deleted after being cordoned.
Observing the nodes from the affected region being cordoned
~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-5vfv Ready <none> 14m v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-fw2v Ready <none> 14m v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-k772 Ready <none> 10m v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-brld Ready <none> 11m v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-dggm Ready <none> 14m v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-flsc Ready <none> 14m v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-r8df Ready <none> 67m v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-a4a55b43-p7tt NotReady <none> 1s v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-a4a55b43-twsn Ready,SchedulingDisabled <none> 14m v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-a4a55b43-vn8n Ready,SchedulingDisabled <none> 14m v1.32.3-gke.1927009
gk3-autopilot-cluster-3-pool-4-3311a57a-k8pj Ready <none> 84m v1.32.3-gke.1927009
Watching the deployment reduce in availability then recover
$ kubectl get deploy -w
NAME READY UP-TO-DATE AVAILABLE AGE
timeserver 200/200 200 200 123m
timeserver 199/200 199 199 130m
timeserver 198/200 199 198 130m
timeserver 195/200 197 195 130m
timeserver 193/200 198 193 130m
timeserver 193/200 200 193 130m
timeserver 192/200 199 192 130m
timeserver 191/200 199 191 130m
timeserver 191/200 200 191 130m
timeserver 190/200 199 190 130m
timeserver 189/200 199 189 130m
timeserver 187/200 198 187 130m
timeserver 184/200 197 184 130m
timeserver 180/200 196 180 130m
timeserver 174/200 194 174 130m
timeserver 173/200 199 173 130m
timeserver 173/200 200 173 130m
timeserver 172/200 199 172 131m
timeserver 172/200 200 172 131m
timeserver 171/200 199 171 131m
timeserver 171/200 200 171 131m
timeserver 170/200 199 170 131m
timeserver 169/200 199 169 131m
timeserver 167/200 198 167 131m
timeserver 166/200 198 166 131m
timeserver 165/200 198 165 131m
timeserver 163/200 198 163 131m
timeserver 162/200 197 162 131m
timeserver 160/200 197 160 131m
timeserver 152/200 190 152 131m
timeserver 152/200 192 152 131m
timeserver 146/200 194 146 131m
timeserver 146/200 200 146 131m
timeserver 145/200 199 145 131m
timeserver 145/200 200 145 131m
timeserver 144/200 199 144 131m
timeserver 143/200 199 143 131m
timeserver 142/200 199 142 131m
timeserver 141/200 199 141 131m
timeserver 139/200 198 139 131m
timeserver 136/200 197 136 131m
timeserver 130/200 194 130 131m
timeserver 125/200 195 125 131m
timeserver 119/200 194 119 131m
timeserver 119/200 200 119 131m
timeserver 118/200 199 118 131m
timeserver 117/200 199 117 131m
timeserver 115/200 198 115 131m
timeserver 113/200 198 113 131m
timeserver 110/200 197 110 131m
timeserver 107/200 197 107 131m
timeserver 93/200 186 93 131m
timeserver 92/200 199 92 131m
timeserver 92/200 200 92 131m
timeserver 93/200 200 93 131m
timeserver 95/200 200 95 131m
timeserver 96/200 200 96 131m
timeserver 98/200 200 98 131m
timeserver 99/200 200 99 131m
timeserver 100/200 200 100 131m
timeserver 101/200 200 101 131m
timeserver 102/200 200 102 131m
timeserver 103/200 200 103 131m
timeserver 104/200 200 104 131m
timeserver 105/200 200 105 131m
timeserver 106/200 200 106 131m
timeserver 107/200 200 107 131m
timeserver 108/200 200 108 131m
timeserver 109/200 200 109 131m
timeserver 110/200 200 110 131m
timeserver 111/200 200 111 131m
timeserver 112/200 200 112 131m
timeserver 115/200 200 115 131m
timeserver 116/200 200 116 131m
timeserver 117/200 200 117 131m
timeserver 118/200 200 118 131m
timeserver 119/200 200 119 131m
timeserver 120/200 200 120 131m
timeserver 121/200 200 121 131m
timeserver 122/200 200 122 131m
timeserver 123/200 200 123 131m
timeserver 128/200 200 128 131m
timeserver 131/200 200 131 131m
timeserver 132/200 200 132 132m
timeserver 133/200 200 133 132m
timeserver 134/200 200 134 132m
timeserver 136/200 200 136 132m
timeserver 138/200 200 138 132m
timeserver 139/200 200 139 132m
timeserver 144/200 200 144 132m
timeserver 145/200 200 145 132m
timeserver 146/200 200 146 132m
timeserver 148/200 200 148 132m
timeserver 149/200 200 149 132m
timeserver 150/200 200 150 132m
timeserver 152/200 200 152 132m
timeserver 154/200 200 154 132m
timeserver 155/200 200 155 132m
timeserver 156/200 200 156 132m
timeserver 157/200 200 157 132m
timeserver 158/200 200 158 132m
timeserver 159/200 200 159 132m
timeserver 160/200 200 160 132m
timeserver 161/200 200 161 132m
timeserver 162/200 200 162 132m
timeserver 163/200 200 163 132m
timeserver 164/200 200 164 132m
timeserver 165/200 200 165 132m
timeserver 166/200 200 166 132m
timeserver 167/200 200 167 132m
timeserver 166/200 199 166 132m
timeserver 158/200 189 158 132m
timeserver 157/200 199 157 132m
timeserver 152/200 195 152 132m
timeserver 144/200 190 144 132m
timeserver 145/200 190 145 132m
timeserver 149/200 190 149 132m
timeserver 154/200 192 154 132m
timeserver 173/200 200 173 132m
timeserver 177/200 200 177 132m
timeserver 178/200 200 178 132m
timeserver 180/200 200 180 132m
timeserver 181/200 200 181 132m
timeserver 182/200 200 182 132m
timeserver 183/200 200 183 133m
timeserver 182/200 199 182 133m
timeserver 181/200 199 181 133m
timeserver 180/200 199 180 133m
timeserver 178/200 198 178 133m
timeserver 176/200 198 176 133m
timeserver 174/200 198 174 133m
timeserver 170/200 196 170 133m
timeserver 170/200 200 170 133m
timeserver 171/200 200 171 133m
timeserver 172/200 200 172 133m
timeserver 173/200 200 173 133m
timeserver 174/200 200 174 133m
timeserver 175/200 200 175 133m
timeserver 176/200 200 176 133m
timeserver 177/200 200 177 133m
timeserver 178/200 200 178 133m
timeserver 179/200 200 179 133m
timeserver 181/200 200 181 133m
timeserver 182/200 200 182 133m
timeserver 183/200 200 183 133m
timeserver 184/200 200 184 136m
timeserver 183/200 199 183 136m
timeserver 187/200 200 187 136m
timeserver 188/200 200 188 136m
timeserver 190/200 200 190 136m
timeserver 196/200 200 196 136m
timeserver 197/200 200 197 137m
timeserver 198/200 200 198 137m
timeserver 199/200 200 199 137m
timeserver 200/200 200 200 137m
What I observed is that during the test, is that after a short time where nodes were cordoned and pods evicted, there were no nodes in the affected zone (so the script was working). I observed the Deployment dip in available replicas, before recovering with capacity scaled up on the non-affected zones.