Minimizing Pod Disruption on Autopilot

2 min read

There are 3 common reasons why a Pod may be terminated on Autopilot: node upgrades, a cluster scale-down, and a node repair. PDBs and graceful termination periods modify the disruption to pods when these events happen, and maintenance windows and exclusions control when upgrade events can occur.

Upgrade

gracefulTerminationPeriod: limited to one hour
PDB: is respected for up to one hour
Maintenance windows: respected
Unmanaged Pod: no impact (won’t prevent upgrade)

Notes: under the hood, Autopilot performs upgrades using a node pool upgrade with surge enabled, the documentation of that behavior applies. Unmanaged Pods are those not managed by a higher-order workload controller like Deployment, Job, StatefulSet.

Repair

gracefulTerminationPeriod: limited to one hour
PDB: is respected for up to one hour
Maintenance windows: not applicable (won’t prevent node repair)
Unmanaged Pod: no impact (won’t prevent repair)

Notes: node repair behavior follows the behavior described in node auto-repair. The impact of PDBs and gracefulTerminationPeriod is that of a node upgrade.

Scale Down / Compaction

gracefulTerminationPeriod: limited to 10 minutes
PDB: respected (no time limit)
Maintenance windows: not applicable (won’t prevent scale down)
Unmanaged Pod: prevents scale down

Notes: under the hood, Autopilot is using the Cluster Autoscaler, and most of that documentation regarding the behavior applies. Unmanaged Pods already prevent scale down, and PDBs can prevent scale down events on managed Pods.

Overall Advice

To minimize termination of pods, the following Kubernetes configuration can be used:

  1. Set a gracefulTerminationPeriod of 1 hour
  2. Define appropriate PDBs for managed Pods. A maxUnavailable=0 rule can be used for the most critical workloads to prevent scale-down disruption.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: timeserver-pdb
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      pod: timeserver-pod

Example PDB for the most critical workloads

The impact of combining these two settings is to prevent scale-down disruptions, and give pods up to 2 hours during upgrades and repair. The first 1h is while the PDB is respected, at which point a SIGTERM is sent, and the gracefulTerminationPeriod of up to 1h kicks in.

To give the longest possible time between upgrade related disruptions, you can set a 30-day “no upgrade” maintenance exclusion window. Then, you can update the cluster to the latest version, and set a new 30-day exclusion window.

Sometimes a cluster may have been updated to a newer version, but still have nodes running on the older version. Unless prevented by maintenance or exclusion windows, these nodes are susceptible to upgrade disruption. If you wish to always have your Pods land on current-version nodes, you can give each new deployment a unique workloads separation key/value pair (e.g. using the current time as the value). Doing this ensures that a new node will be created, and new nodes are always on the current version. I wouldn’t recommend using this last technique generally, as it’s more involved, and the other advice above (using gracefulTerminationPeriod, PDBs and maintenance exclusions) should suffice in most cases.