Simulate a zonal failure on Autopilot

7 min read

GKE’s Autopilot mode is designed to be safer to operate and prevents many issues that can lead to sudden outages like misconfigured firewalls. However, if you’re trying to actually simulate a failure like when performing a zonal failure simulation, Autopilot’s safeguards can sometimes get in the way, blocking some of the potential ways to run such a test.

It is possibile still though to simulate the loss of compute capacity in a zone and see how your load balancing and workload will respond. Given that this is one of the key attributes in your control (e.g. number of replicas, zonal spread topology, etc), I believe this is a useful test. We can achieve this through kubectl cordon, kubectl delete pods and a little scripting.

Let’s create a test of a zonal failure, and see how the workload handles it!

How we’re going to simulate this is to cordon every node in a zone (this prevents Pods being scheduled in that zone), and then delete every running Pod in the zone. By doing this in a loop, even if Autopilot brings back a node in the zone under test, it would be cordoned immediately. Thus, no running Pods and no capacity in the zone for the duration of the test.

Here’s an example script (I created this with Gemini 2.5 Pro with a bit of back and forth to fix some issues):

#!/bin/bash

# --- Configuration ---
# Check if a zone parameter is provided.
if [ -z "$1" ]; then
  echo "Error: No target zone specified."
  echo "Usage: $0 <zone>"
  echo "Example: $0 us-west1-a"
  exit 1
fi

# The zone to simulate an outage in, taken from the first command-line argument.
OUTAGE_ZONE="$1"

# Duration of the outage in seconds (30 minutes = 1800 seconds)
OUTAGE_DURATION=1800

# Interval between each disruption cycle in seconds
LOOP_INTERVAL=15
# --- End Configuration ---

# --- Disruption Loop ---
echo "--- Starting continuous zonal disruption for zone: $OUTAGE_ZONE ---"
echo "This will run for $OUTAGE_DURATION seconds. Press Ctrl+C to stop early."

# Calculate the end time for the loop
END_TIME=$(( $(date +%s) + OUTAGE_DURATION ))

while [ $(date +%s) -lt $END_TIME ]; do
  # Get current time for logging
  TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
  echo "[$TIMESTAMP] Running disruption cycle for zone '$OUTAGE_ZONE'..."

  # 1. Cordon all current and any newly provisioned nodes in the zone.
  echo "  -> Cordoning all nodes with label topology.kubernetes.io/zone=$OUTAGE_ZONE"
  kubectl cordon -l topology.kubernetes.io/zone=$OUTAGE_ZONE

  # 2. Force delete all pods on all nodes in the zone.
  echo "  -> Getting list of nodes in zone $OUTAGE_ZONE..."
  NODE_NAMES=$(kubectl get nodes -l topology.kubernetes.io/zone=$OUTAGE_ZONE -o jsonpath='{.items[*].metadata.name}')

  if [ -n "$NODE_NAMES" ]; then
    echo "  -> Iterating through nodes to delete pods..."
    # Use a for loop to iterate over each node name.
    # The shell will split the NODE_NAMES variable by spaces into individual words.
    for NODE in $NODE_NAMES; do
      echo "    - Deleting pods on node: $NODE"
      kubectl delete pods --all-namespaces --field-selector spec.nodeName=$NODE --force --grace-period=0
    done
  else
    echo "  -> No nodes found in zone $OUTAGE_ZONE at this time."
  fi

  echo "  -> Cycle complete. Waiting for $LOOP_INTERVAL seconds."
  sleep $LOOP_INTERVAL
done

echo "--- Disruption test finished after $OUTAGE_DURATION seconds. ---"
echo "Ceased cordoning and pod deletion in zone $OUTAGE_ZONE."
echo "GKE Autopilot will now stabilize workloads."

What’s going on here? This script every 15 seconds during the outage duration that you define is cordoning all nodes in a single zone. This prevents newly scheduled Pods from landing on them. It is also deleting all running Pods without honoring graceful termination. Since Autopilot will naturally try to “heal” by creating new nodes, we run it in a loop so any new nodes are cordoned before they can be used.

A zonal failure may have other implications, but if what you wish to do is test how your workload and load balancing will respond to a sudden and continued loss of capacity in a single zone, this is one way.

Here’s the results from my own test run.

Deploy a workload and scale it

kubectl create -f https://raw.githubusercontent.com/WilliamDenniss/kubernetes-for-developers/refs/heads/master/Chapter03/3.2_DeployingToKubernetes/deploy.yaml
kubectl scale deploy/timeserver --replicas 200

Wait for it to be running:

$ kubectl get deploy
NAME         READY     UP-TO-DATE   AVAILABLE   AGE
timeserver   200/200   200          200         5m

List all the nodes and their zones

$ kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels.'topology\.kubernetes\.io/zone'    
NAME                                                 ZONE
gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-5vfv   europe-west3-c
gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-fw2v   europe-west3-c
gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-k772   europe-west3-c
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-brld   europe-west3-b
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-dggm   europe-west3-b
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-flsc   europe-west3-b
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-r8df   europe-west3-b
gk3-autopilot-cluster-3-nap-1cmx89wv-a4a55b43-twsn   europe-west3-a
gk3-autopilot-cluster-3-nap-1cmx89wv-a4a55b43-vn8n   europe-west3-a
gk3-autopilot-cluster-3-pool-4-3311a57a-k8pj         europe-west3-c

Starting the test. Let’s pick a zone with lots of nodes

$ ./outage-test.sh europe-west3-a
--- Starting continuous zonal disruption for zone: europe-west3-a ---
This will run for 1800 seconds. Press Ctrl+C to stop early.
[2025-05-31 05:48:05] Running disruption cycle for zone 'europe-west3-a'...
  -> Cordoning all nodes with label topology.kubernetes.io/zone=europe-west3-a
node/gk3-autopilot-cluster-3-nap-x67xdnqn-5b5704dd-rs48 cordoned
node/gk3-autopilot-cluster-3-pool-4-28df1d79-6lk2 cordoned
node/gk3-autopilot-cluster-3-pool-4-28df1d79-cjdk cordoned
node/gk3-autopilot-cluster-3-pool-4-28df1d79-mprh cordoned
node/gk3-autopilot-cluster-3-pool-4-28df1d79-xrnz cordoned
  -> Getting list of nodes in zone europe-west3-a...
  -> Iterating through nodes to delete pods...
    - Deleting pods on node: gk3-autopilot-cluster-3-nap-x67xdnqn-5b5704dd-rs48
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "timeserver-7d54975c9f-sfzhb" force deleted

Note: you can ignore errors like “Error from server (Forbidden): pods "node-local-dns-cnzz5" is forbidden: User cannot delete resource "pods" in API group "" in the namespace "kube-system"” this is just Autopilot preventing the deletion of system Pods. Your workloads Pods will still be deleted, and the system Pods will be removed when the node is deleted after being cordoned.

Observing the nodes from the affected region being cordoned

~$ kubectl get nodes
NAME                                                 STATUS                     ROLES    AGE   VERSION
gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-5vfv   Ready                      <none>   14m   v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-fw2v   Ready                      <none>   14m   v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-k772   Ready                      <none>   10m   v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-brld   Ready                      <none>   11m   v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-dggm   Ready                      <none>   14m   v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-flsc   Ready                      <none>   14m   v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-r8df   Ready                      <none>   67m   v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-a4a55b43-p7tt   NotReady                   <none>   1s    v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-a4a55b43-twsn   Ready,SchedulingDisabled   <none>   14m   v1.32.3-gke.1927009
gk3-autopilot-cluster-3-nap-1cmx89wv-a4a55b43-vn8n   Ready,SchedulingDisabled   <none>   14m   v1.32.3-gke.1927009
gk3-autopilot-cluster-3-pool-4-3311a57a-k8pj         Ready                      <none>   84m   v1.32.3-gke.1927009

Watching the deployment reduce in availability then recover

$ kubectl get deploy -w
NAME         READY     UP-TO-DATE   AVAILABLE   AGE
timeserver   200/200   200          200         123m
timeserver   199/200   199          199         130m
timeserver   198/200   199          198         130m
timeserver   195/200   197          195         130m
timeserver   193/200   198          193         130m
timeserver   193/200   200          193         130m
timeserver   192/200   199          192         130m
timeserver   191/200   199          191         130m
timeserver   191/200   200          191         130m
timeserver   190/200   199          190         130m
timeserver   189/200   199          189         130m
timeserver   187/200   198          187         130m
timeserver   184/200   197          184         130m
timeserver   180/200   196          180         130m
timeserver   174/200   194          174         130m
timeserver   173/200   199          173         130m
timeserver   173/200   200          173         130m
timeserver   172/200   199          172         131m
timeserver   172/200   200          172         131m
timeserver   171/200   199          171         131m
timeserver   171/200   200          171         131m
timeserver   170/200   199          170         131m
timeserver   169/200   199          169         131m
timeserver   167/200   198          167         131m
timeserver   166/200   198          166         131m
timeserver   165/200   198          165         131m
timeserver   163/200   198          163         131m
timeserver   162/200   197          162         131m
timeserver   160/200   197          160         131m
timeserver   152/200   190          152         131m
timeserver   152/200   192          152         131m
timeserver   146/200   194          146         131m
timeserver   146/200   200          146         131m
timeserver   145/200   199          145         131m
timeserver   145/200   200          145         131m
timeserver   144/200   199          144         131m
timeserver   143/200   199          143         131m
timeserver   142/200   199          142         131m
timeserver   141/200   199          141         131m
timeserver   139/200   198          139         131m
timeserver   136/200   197          136         131m
timeserver   130/200   194          130         131m
timeserver   125/200   195          125         131m
timeserver   119/200   194          119         131m
timeserver   119/200   200          119         131m
timeserver   118/200   199          118         131m
timeserver   117/200   199          117         131m
timeserver   115/200   198          115         131m
timeserver   113/200   198          113         131m
timeserver   110/200   197          110         131m
timeserver   107/200   197          107         131m
timeserver   93/200    186          93          131m
timeserver   92/200    199          92          131m
timeserver   92/200    200          92          131m
timeserver   93/200    200          93          131m
timeserver   95/200    200          95          131m
timeserver   96/200    200          96          131m
timeserver   98/200    200          98          131m
timeserver   99/200    200          99          131m
timeserver   100/200   200          100         131m
timeserver   101/200   200          101         131m
timeserver   102/200   200          102         131m
timeserver   103/200   200          103         131m
timeserver   104/200   200          104         131m
timeserver   105/200   200          105         131m
timeserver   106/200   200          106         131m
timeserver   107/200   200          107         131m
timeserver   108/200   200          108         131m
timeserver   109/200   200          109         131m
timeserver   110/200   200          110         131m
timeserver   111/200   200          111         131m
timeserver   112/200   200          112         131m
timeserver   115/200   200          115         131m
timeserver   116/200   200          116         131m
timeserver   117/200   200          117         131m
timeserver   118/200   200          118         131m
timeserver   119/200   200          119         131m
timeserver   120/200   200          120         131m
timeserver   121/200   200          121         131m
timeserver   122/200   200          122         131m
timeserver   123/200   200          123         131m
timeserver   128/200   200          128         131m
timeserver   131/200   200          131         131m
timeserver   132/200   200          132         132m
timeserver   133/200   200          133         132m
timeserver   134/200   200          134         132m
timeserver   136/200   200          136         132m
timeserver   138/200   200          138         132m
timeserver   139/200   200          139         132m
timeserver   144/200   200          144         132m
timeserver   145/200   200          145         132m
timeserver   146/200   200          146         132m
timeserver   148/200   200          148         132m
timeserver   149/200   200          149         132m
timeserver   150/200   200          150         132m
timeserver   152/200   200          152         132m
timeserver   154/200   200          154         132m
timeserver   155/200   200          155         132m
timeserver   156/200   200          156         132m
timeserver   157/200   200          157         132m
timeserver   158/200   200          158         132m
timeserver   159/200   200          159         132m
timeserver   160/200   200          160         132m
timeserver   161/200   200          161         132m
timeserver   162/200   200          162         132m
timeserver   163/200   200          163         132m
timeserver   164/200   200          164         132m
timeserver   165/200   200          165         132m
timeserver   166/200   200          166         132m
timeserver   167/200   200          167         132m
timeserver   166/200   199          166         132m
timeserver   158/200   189          158         132m
timeserver   157/200   199          157         132m
timeserver   152/200   195          152         132m
timeserver   144/200   190          144         132m
timeserver   145/200   190          145         132m
timeserver   149/200   190          149         132m
timeserver   154/200   192          154         132m
timeserver   173/200   200          173         132m
timeserver   177/200   200          177         132m
timeserver   178/200   200          178         132m
timeserver   180/200   200          180         132m
timeserver   181/200   200          181         132m
timeserver   182/200   200          182         132m
timeserver   183/200   200          183         133m
timeserver   182/200   199          182         133m
timeserver   181/200   199          181         133m
timeserver   180/200   199          180         133m
timeserver   178/200   198          178         133m
timeserver   176/200   198          176         133m
timeserver   174/200   198          174         133m
timeserver   170/200   196          170         133m
timeserver   170/200   200          170         133m
timeserver   171/200   200          171         133m
timeserver   172/200   200          172         133m
timeserver   173/200   200          173         133m
timeserver   174/200   200          174         133m
timeserver   175/200   200          175         133m
timeserver   176/200   200          176         133m
timeserver   177/200   200          177         133m
timeserver   178/200   200          178         133m
timeserver   179/200   200          179         133m
timeserver   181/200   200          181         133m
timeserver   182/200   200          182         133m
timeserver   183/200   200          183         133m
timeserver   184/200   200          184         136m
timeserver   183/200   199          183         136m
timeserver   187/200   200          187         136m
timeserver   188/200   200          188         136m
timeserver   190/200   200          190         136m
timeserver   196/200   200          196         136m
timeserver   197/200   200          197         137m
timeserver   198/200   200          198         137m
timeserver   199/200   200          199         137m
timeserver   200/200   200          200         137m

What I observed is that during the test, is that after a short time where nodes were cordoned and pods evicted, there were no nodes in the affected zone (so the script was working). I observed the Deployment dip in available replicas, before recovering with capacity scaled up on the non-affected zones.