# Simulate a zonal failure on Autopilot GKE’s Autopilot mode is designed to be safer to operate and prevents many issues that can lead to sudden outages like misconfigured firewalls. However, if you’re trying to actually simulate a failure like when performing a zonal failure simulation, Autopilot’s safeguards can sometimes get in the way, blocking some of the potential ways to run such a test. It is possibile still though to simulate the loss of compute capacity in a zone and see how your load balancing and workload will respond. Given that this is one of the key attributes in your control (e.g. number of replicas, zonal spread topology, etc), I believe this is a useful test. We can achieve this through `kubectl cordon`, `kubectl delete pods` and a little scripting. Let’s create a test of a zonal failure, and see how the workload handles it! How we’re going to simulate this is to cordon every node in a zone (this prevents Pods being scheduled in that zone), and then delete every running Pod in the zone. By doing this in a loop, even if Autopilot brings back a node in the zone under test, it would be cordoned immediately. Thus, no running Pods and no capacity in the zone for the duration of the test. Here’s an example script (I created this with Gemini 2.5 Pro with a bit of back and forth to fix some issues): ```bash #!/bin/bash # --- Configuration --- # Check if a zone parameter is provided. if [ -z "$1" ]; then echo "Error: No target zone specified." echo "Usage: $0 " echo "Example: $0 us-west1-a" exit 1 fi # The zone to simulate an outage in, taken from the first command-line argument. OUTAGE_ZONE="$1" # Duration of the outage in seconds (30 minutes = 1800 seconds) OUTAGE_DURATION=1800 # Interval between each disruption cycle in seconds LOOP_INTERVAL=15 # --- End Configuration --- # --- Disruption Loop --- echo "--- Starting continuous zonal disruption for zone: $OUTAGE_ZONE ---" echo "This will run for $OUTAGE_DURATION seconds. Press Ctrl+C to stop early." # Calculate the end time for the loop END_TIME=$(( $(date +%s) + OUTAGE_DURATION )) while [ $(date +%s) -lt $END_TIME ]; do # Get current time for logging TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S') echo "[$TIMESTAMP] Running disruption cycle for zone '$OUTAGE_ZONE'..." # 1. Cordon all current and any newly provisioned nodes in the zone. echo " -> Cordoning all nodes with label topology.kubernetes.io/zone=$OUTAGE_ZONE" kubectl cordon -l topology.kubernetes.io/zone=$OUTAGE_ZONE # 2. Force delete all pods on all nodes in the zone. echo " -> Getting list of nodes in zone $OUTAGE_ZONE..." NODE_NAMES=$(kubectl get nodes -l topology.kubernetes.io/zone=$OUTAGE_ZONE -o jsonpath='{.items[*].metadata.name}') if [ -n "$NODE_NAMES" ]; then echo " -> Iterating through nodes to delete pods..." # Use a for loop to iterate over each node name. # The shell will split the NODE_NAMES variable by spaces into individual words. for NODE in $NODE_NAMES; do echo " - Deleting pods on node: $NODE" kubectl delete pods --all-namespaces --field-selector spec.nodeName=$NODE --force --grace-period=0 done else echo " -> No nodes found in zone $OUTAGE_ZONE at this time." fi echo " -> Cycle complete. Waiting for $LOOP_INTERVAL seconds." sleep $LOOP_INTERVAL done echo "--- Disruption test finished after $OUTAGE_DURATION seconds. ---" echo "Ceased cordoning and pod deletion in zone $OUTAGE_ZONE." echo "GKE Autopilot will now stabilize workloads." ``` What’s going on here? This script every 15 seconds during the outage duration that you define is cordoning all nodes in a single zone. This prevents newly scheduled Pods from landing on them. It is also deleting all running Pods without honoring graceful termination. Since Autopilot will naturally try to “heal” by creating new nodes, we run it in a loop so any new nodes are cordoned before they can be used. A zonal failure may have other implications, but if what you wish to do is test how your workload and load balancing will respond to a sudden and continued loss of capacity in a single zone, this is one way. Here’s the results from my own test run. Deploy a workload and scale it ```shell kubectl create -f https://raw.githubusercontent.com/WilliamDenniss/kubernetes-for-developers/refs/heads/master/Chapter03/3.2_DeployingToKubernetes/deploy.yaml kubectl scale deploy/timeserver --replicas 200 ``` Wait for it to be running: ```shell $ kubectl get deploy NAME READY UP-TO-DATE AVAILABLE AGE timeserver 200/200 200 200 5m ``` List all the nodes and their zones ```shell $ kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels.'topology\.kubernetes\.io/zone' NAME ZONE gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-5vfv europe-west3-c gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-fw2v europe-west3-c gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-k772 europe-west3-c gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-brld europe-west3-b gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-dggm europe-west3-b gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-flsc europe-west3-b gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-r8df europe-west3-b gk3-autopilot-cluster-3-nap-1cmx89wv-a4a55b43-twsn europe-west3-a gk3-autopilot-cluster-3-nap-1cmx89wv-a4a55b43-vn8n europe-west3-a gk3-autopilot-cluster-3-pool-4-3311a57a-k8pj europe-west3-c ``` Starting the test. Let’s pick a zone with lots of nodes ```shell $ ./outage-test.sh europe-west3-a --- Starting continuous zonal disruption for zone: europe-west3-a --- This will run for 1800 seconds. Press Ctrl+C to stop early. [2025-05-31 05:48:05] Running disruption cycle for zone 'europe-west3-a'... -> Cordoning all nodes with label topology.kubernetes.io/zone=europe-west3-a node/gk3-autopilot-cluster-3-nap-x67xdnqn-5b5704dd-rs48 cordoned node/gk3-autopilot-cluster-3-pool-4-28df1d79-6lk2 cordoned node/gk3-autopilot-cluster-3-pool-4-28df1d79-cjdk cordoned node/gk3-autopilot-cluster-3-pool-4-28df1d79-mprh cordoned node/gk3-autopilot-cluster-3-pool-4-28df1d79-xrnz cordoned -> Getting list of nodes in zone europe-west3-a... -> Iterating through nodes to delete pods... - Deleting pods on node: gk3-autopilot-cluster-3-nap-x67xdnqn-5b5704dd-rs48 Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "timeserver-7d54975c9f-sfzhb" force deleted ``` Note: you can ignore errors like “`Error from server (Forbidden): pods "node-local-dns-cnzz5" is forbidden: User cannot delete resource "pods" in API group "" in the namespace "kube-system"`” this is just Autopilot preventing the deletion of system Pods. Your workloads Pods will still be deleted, and the system Pods will be removed when the node is deleted after being cordoned. Observing the nodes from the affected region being cordoned ```shell ~$ kubectl get nodes NAME STATUS ROLES AGE VERSION gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-5vfv Ready 14m v1.32.3-gke.1927009 gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-fw2v Ready 14m v1.32.3-gke.1927009 gk3-autopilot-cluster-3-nap-1cmx89wv-88405675-k772 Ready 10m v1.32.3-gke.1927009 gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-brld Ready 11m v1.32.3-gke.1927009 gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-dggm Ready 14m v1.32.3-gke.1927009 gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-flsc Ready 14m v1.32.3-gke.1927009 gk3-autopilot-cluster-3-nap-1cmx89wv-a1c24d31-r8df Ready 67m v1.32.3-gke.1927009 gk3-autopilot-cluster-3-nap-1cmx89wv-a4a55b43-p7tt NotReady 1s v1.32.3-gke.1927009 gk3-autopilot-cluster-3-nap-1cmx89wv-a4a55b43-twsn Ready,SchedulingDisabled 14m v1.32.3-gke.1927009 gk3-autopilot-cluster-3-nap-1cmx89wv-a4a55b43-vn8n Ready,SchedulingDisabled 14m v1.32.3-gke.1927009 gk3-autopilot-cluster-3-pool-4-3311a57a-k8pj Ready 84m v1.32.3-gke.1927009 ``` Watching the deployment reduce in availability then recover ```shell $ kubectl get deploy -w NAME READY UP-TO-DATE AVAILABLE AGE timeserver 200/200 200 200 123m timeserver 199/200 199 199 130m timeserver 198/200 199 198 130m timeserver 195/200 197 195 130m timeserver 193/200 198 193 130m timeserver 193/200 200 193 130m timeserver 192/200 199 192 130m timeserver 191/200 199 191 130m timeserver 191/200 200 191 130m timeserver 190/200 199 190 130m timeserver 189/200 199 189 130m timeserver 187/200 198 187 130m timeserver 184/200 197 184 130m timeserver 180/200 196 180 130m timeserver 174/200 194 174 130m timeserver 173/200 199 173 130m timeserver 173/200 200 173 130m timeserver 172/200 199 172 131m timeserver 172/200 200 172 131m timeserver 171/200 199 171 131m timeserver 171/200 200 171 131m timeserver 170/200 199 170 131m timeserver 169/200 199 169 131m timeserver 167/200 198 167 131m timeserver 166/200 198 166 131m timeserver 165/200 198 165 131m timeserver 163/200 198 163 131m timeserver 162/200 197 162 131m timeserver 160/200 197 160 131m timeserver 152/200 190 152 131m timeserver 152/200 192 152 131m timeserver 146/200 194 146 131m timeserver 146/200 200 146 131m timeserver 145/200 199 145 131m timeserver 145/200 200 145 131m timeserver 144/200 199 144 131m timeserver 143/200 199 143 131m timeserver 142/200 199 142 131m timeserver 141/200 199 141 131m timeserver 139/200 198 139 131m timeserver 136/200 197 136 131m timeserver 130/200 194 130 131m timeserver 125/200 195 125 131m timeserver 119/200 194 119 131m timeserver 119/200 200 119 131m timeserver 118/200 199 118 131m timeserver 117/200 199 117 131m timeserver 115/200 198 115 131m timeserver 113/200 198 113 131m timeserver 110/200 197 110 131m timeserver 107/200 197 107 131m timeserver 93/200 186 93 131m timeserver 92/200 199 92 131m timeserver 92/200 200 92 131m timeserver 93/200 200 93 131m timeserver 95/200 200 95 131m timeserver 96/200 200 96 131m timeserver 98/200 200 98 131m timeserver 99/200 200 99 131m timeserver 100/200 200 100 131m timeserver 101/200 200 101 131m timeserver 102/200 200 102 131m timeserver 103/200 200 103 131m timeserver 104/200 200 104 131m timeserver 105/200 200 105 131m timeserver 106/200 200 106 131m timeserver 107/200 200 107 131m timeserver 108/200 200 108 131m timeserver 109/200 200 109 131m timeserver 110/200 200 110 131m timeserver 111/200 200 111 131m timeserver 112/200 200 112 131m timeserver 115/200 200 115 131m timeserver 116/200 200 116 131m timeserver 117/200 200 117 131m timeserver 118/200 200 118 131m timeserver 119/200 200 119 131m timeserver 120/200 200 120 131m timeserver 121/200 200 121 131m timeserver 122/200 200 122 131m timeserver 123/200 200 123 131m timeserver 128/200 200 128 131m timeserver 131/200 200 131 131m timeserver 132/200 200 132 132m timeserver 133/200 200 133 132m timeserver 134/200 200 134 132m timeserver 136/200 200 136 132m timeserver 138/200 200 138 132m timeserver 139/200 200 139 132m timeserver 144/200 200 144 132m timeserver 145/200 200 145 132m timeserver 146/200 200 146 132m timeserver 148/200 200 148 132m timeserver 149/200 200 149 132m timeserver 150/200 200 150 132m timeserver 152/200 200 152 132m timeserver 154/200 200 154 132m timeserver 155/200 200 155 132m timeserver 156/200 200 156 132m timeserver 157/200 200 157 132m timeserver 158/200 200 158 132m timeserver 159/200 200 159 132m timeserver 160/200 200 160 132m timeserver 161/200 200 161 132m timeserver 162/200 200 162 132m timeserver 163/200 200 163 132m timeserver 164/200 200 164 132m timeserver 165/200 200 165 132m timeserver 166/200 200 166 132m timeserver 167/200 200 167 132m timeserver 166/200 199 166 132m timeserver 158/200 189 158 132m timeserver 157/200 199 157 132m timeserver 152/200 195 152 132m timeserver 144/200 190 144 132m timeserver 145/200 190 145 132m timeserver 149/200 190 149 132m timeserver 154/200 192 154 132m timeserver 173/200 200 173 132m timeserver 177/200 200 177 132m timeserver 178/200 200 178 132m timeserver 180/200 200 180 132m timeserver 181/200 200 181 132m timeserver 182/200 200 182 132m timeserver 183/200 200 183 133m timeserver 182/200 199 182 133m timeserver 181/200 199 181 133m timeserver 180/200 199 180 133m timeserver 178/200 198 178 133m timeserver 176/200 198 176 133m timeserver 174/200 198 174 133m timeserver 170/200 196 170 133m timeserver 170/200 200 170 133m timeserver 171/200 200 171 133m timeserver 172/200 200 172 133m timeserver 173/200 200 173 133m timeserver 174/200 200 174 133m timeserver 175/200 200 175 133m timeserver 176/200 200 176 133m timeserver 177/200 200 177 133m timeserver 178/200 200 178 133m timeserver 179/200 200 179 133m timeserver 181/200 200 181 133m timeserver 182/200 200 182 133m timeserver 183/200 200 183 133m timeserver 184/200 200 184 136m timeserver 183/200 199 183 136m timeserver 187/200 200 187 136m timeserver 188/200 200 188 136m timeserver 190/200 200 190 136m timeserver 196/200 200 196 136m timeserver 197/200 200 197 137m timeserver 198/200 200 198 137m timeserver 199/200 200 199 137m timeserver 200/200 200 200 137m ``` What I observed is that during the test, is that after a short time where nodes were cordoned and pods evicted, there were no nodes in the affected zone (so the script was working). I observed the Deployment dip in available replicas, before recovering with capacity scaled up on the non-affected zones.