# Scaling container resources automatically with VPA Vertical Pod Autoscaling is a system that measures Pod utilization and attempts to set the right resource requests. For example, if the Pod is constantly using more CPU, VPA will increase the CPU requests. Contrasting Horizontal Pod Autoscaling (HPA) which can create more replicas as your Pods use more resources, VPA changes the resources of each replica. In the past, GKE in Autopilot mode had a 250 milliCPU resource increment (meaning valid Pod sizes were 250m, 500m, 750m, etc). Now that Autopilot [supports burstable QoS and fine-grained resource increments](https://cloud.google.com/blog/products/containers-kubernetes/introducing-gke-autopilot-burstable-workloads/), VPA should be even more useful, so let’s give it a spin. VPA can run in update mode, or advisory mode. In the former, the Pod values are directly updated (within an optional minimum and maximum bound). Currently this causes the Pod to be restarted, but work is ongoing to have the option for [in-place updates (AEP-4016](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support)) (where capacity exists on the node). VPA can also be run in purely an advisory fashion to help you determine the right resource values to set on your pods automatically. To start, I’m going to create a cluster running the latest version of Autopilot so that it has new Burst QoS and removal of the resource increment (version 1.30.2-gke.1394000 or later is required), both that make VPA work even better. ```shell {hl_lines=[1]} VERSION="1.30" CHANNEL="rapid" CLUSTER_NAME=burst-test REGION=us-west1 gcloud container clusters create-auto $CLUSTER_NAME \ --release-channel $CHANNEL --region $REGION \ --cluster-version $VERSION ``` Now we can deploy a workload. This one is going to be very greedy on the CPU. We’ll use version 4 of the Timeserver container from my book in Chapter 6, tweaked to add some resource requests as a starting point: ```yaml {hl_lines=[19, 20, 21]} apiVersion: apps/v1 kind: Deployment metadata: name: timeserver spec: replicas: 5 selector: matchLabels: pod: timeserver-pod template: metadata: labels: pod: timeserver-pod spec: containers: - name: timeserver-container image: docker.io/wdenniss/timeserver:4 resources: requests: cpu: 50m memory: 50Mi ephemeral-storage: 5Gi limits: cpu: 70m memory: 70Mi ``` [deploy.yaml](https://github.com/WilliamDenniss/kubernetes-for-developers/blob/master/Bonus/vpa/deploy.yaml) ```yaml apiVersion: v1 kind: Service metadata: name: timeserver spec: selector: pod: timeserver-pod ports: - port: 80 targetPort: 80 protocol: TCP ``` [svc.yaml](https://github.com/WilliamDenniss/kubernetes-for-developers/blob/master/Bonus/vpa/svc.yaml) Plus a brand new Job to throw a bunch of requests at this Deployment. In the book, I use Apache Bench from the command line to do that. This Job will wrap that up into a container! We’ll need an internal service for our Deployment, plus the load-generating Job itself, defined as follows: ```yaml {hl_lines=[13]} apiVersion: batch/v1 kind: Job metadata: name: load-generate spec: backoffLimit: 1000 completions: 1000 template: spec: containers: - name: ab-container image: docker.io/jordi/ab command: ["ab", "-n", "1000000000", "-c", "20", "-s", "120", "http://timeserver/"] restartPolicy: OnFailure ``` [load-job.yaml](https://github.com/WilliamDenniss/kubernetes-for-developers/blob/master/Bonus/vpa/load-job.yaml) Creating all 3: ```shell kubectl create -f https://raw.githubusercontent.com/WilliamDenniss/kubernetes-for-developers/master/Bonus/vpa/deploy.yaml kubectl create -f https://raw.githubusercontent.com/WilliamDenniss/kubernetes-for-developers/master/Bonus/vpa/svc.yaml kubectl create -f https://raw.githubusercontent.com/WilliamDenniss/kubernetes-for-developers/master/Bonus/vpa/load-job.yaml ``` Once created, observe the usage in the cluster, it should be pretty high (at, or exceeding thanks to bursting, the 50m CPU it requests). ```shell {hl_lines=[4, 5, 6, 7, 8]} $ kubectl top pods NAME CPU(cores) MEMORY(bytes) load-generate-8cbbm 67m 2Mi timeserver-7dd495b684-74jt9 62m 12Mi timeserver-7dd495b684-8s9tt 62m 11Mi timeserver-7dd495b684-glrz6 66m 11Mi timeserver-7dd495b684-pl97n 63m 11Mi timeserver-7dd495b684-trhnx 64m 11Mi ``` Now let’s create the VPA to automatically size these pods. This example sets some guardrails, a minimum of 50m (which is Autopilot’s minimum), and a max of 2vCPU. ```yaml apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: timeserver-vpa spec: targetRef: apiVersion: "apps/v1" kind: Deployment name: timeserver updatePolicy: updateMode: "Auto" minReplicas: 1 resourcePolicy: containerPolicies: - containerName: '*' minAllowed: cpu: "50m" memory: "50Mi" maxAllowed: cpu: "2" memory: "2Gi" controlledValues: RequestsAndLimits ``` [vpa.yaml](https://github.com/WilliamDenniss/kubernetes-for-developers/blob/master/Bonus/vpa/vpa.yaml) Note that the **minReplicas** is a very important field. VPA has it’s own inbuilt Pod Disruption Budget (PDB) of sorts and will not Pods if it means there will be less replicas running than the minimum. If you have multiple containers, and want to configure minimums and maximums separately, you can do that with multiple policies. The above example has a single policy that applies to all containers. Create it like so: ```shell kubectl create -f https://raw.githubusercontent.com/WilliamDenniss/kubernetes-for-developers/master/Bonus/vpa/vpa.yaml ``` ## Inspecting the results Give the VPA a moment to work, and then query it. ```shell {hl_lines=[3]} $ kubectl get vpa NAME MODE CPU MEM PROVIDED AGE timeserver-vpa Auto 75m 50Mi True 3m55s ``` Here we can see that the VPA is recommending a new, higher request of 75m (from the starting 50m). We can drill into this more by describing the object. ```shell {hl_lines=[48, 49, 50]} $ kubectl describe vpa Name: timeserver-vpa Namespace: default Labels: Annotations: API Version: autoscaling.k8s.io/v1 Kind: VerticalPodAutoscaler Metadata: Creation Timestamp: 2024-03-29T04:17:02Z Generation: 2 Resource Version: 55138 UID: d8f3e03b-9c74-40f1-a8c7-06a448a05f8d Spec: Resource Policy: Container Policies: Container Name: * Controlled Values: RequestsAndLimits Max Allowed: Cpu: 2 Memory: 2Gi Min Allowed: Cpu: 50m Memory: 50Mi Target Ref: API Version: apps/v1 Kind: Deployment Name: timeserver Update Policy: Min Replicas: 1 Update Mode: Auto Status: Conditions: Last Transition Time: 2024-03-29T04:17:19Z Status: False Type: LowConfidence Last Transition Time: 2024-03-29T04:17:19Z Status: True Type: RecommendationProvided Recommendation: Container Recommendations: Container Name: timeserver-container Lower Bound: Cpu: 50m Memory: 50Mi Target: Cpu: 75m Memory: 50Mi Uncapped Target: Cpu: 75m Memory: 14680064 Upper Bound: Cpu: 2 Memory: 2Gi Events: ``` The “Uncapped Target” is particularly interesting, as this is the value the VPA would set without our minimum and maximum bounds. Let’s check the pods. Notice in the AGE column, that one of them is newer than the others. ```shell {hl_lines=[8]} $ kubectl get pods NAME READY STATUS RESTARTS AGE load-generate-8hqgm 1/1 Running 2 (39s ago) 2m31s timeserver-7dd495b684-4xdbr 1/1 Running 0 3m31s timeserver-7dd495b684-7bw7l 1/1 Running 0 2m31s timeserver-7dd495b684-87gv8 1/1 Running 0 2m31s timeserver-7dd495b684-dwz98 1/1 Running 0 3m31s timeserver-7dd495b684-jmhdz 1/1 Running 0 91s ``` Taking a closer look at that Pod, we can see that it’s resources are higher than before. ```shell {hl_lines=[22, 23, 24, 25]} $ kubectl get pod -o yaml timeserver-7dd495b684-jmhdz apiVersion: v1 kind: Pod metadata: annotations: vpaObservedContainers: timeserver-container vpaUpdates: 'Pod resources updated by timeserver-vpa: container 0: memory request, cpu request, cpu limit, memory limit' labels: pod: timeserver-pod pod-template-hash: 7dd495b684 name: timeserver-7dd495b684-jmhdz spec: containers: - image: docker.io/wdenniss/timeserver:4 name: timeserver-container resources: limits: cpu: 105m ephemeral-storage: 5Gi memory: 77Mi requests: cpu: 75m ephemeral-storage: 5Gi memory: 77Mi ``` Note: depending on your situation, it may take a little time before the VPA works. I had an issue where it wasn’t working at first, I stepped away, and when I got back, it was working. Make sure you have more Pod replicas in the Deployment than what is specified in the minReplicas field in the VPA otherwise the VPA will never restart the Pods (I had that problem!). If you do, and it’s not working—take a break for a minute, and see if it just needs a moment to catch up. ## Some time later… If we now inspect the Pods a short time later, we can see that they have been replaced with Pods that have higher requests and limits. The following get command conveniently displays the resource requests for all Pods inline without the need to inspect them one by one. *Aside:* I used an LLM to help create this command, I find them very handy for creating complex custom queries like this. Try a prompt in Gemini like “kubectl get command that uses custom columns to display the resource requests of a Pod”. ```shell {hl_lines=[4, 5, 6, 7, 8]} $ kubectl get pods -o custom-columns=NAME:.metadata.name,REQUESTS:.spec.containers[*].resources.requests NAME REQUESTS load-generate-8hqgm map[cpu:500m ephemeral-storage:1Gi memory:2Gi] timeserver-7dd495b684-4xdbr map[cpu:75m ephemeral-storage:5Gi memory:77Mi] timeserver-7dd495b684-7bw7l map[cpu:75m ephemeral-storage:5Gi memory:77Mi] timeserver-7dd495b684-87gv8 map[cpu:75m ephemeral-storage:5Gi memory:77Mi] timeserver-7dd495b684-dwz98 map[cpu:75m ephemeral-storage:5Gi memory:77Mi] timeserver-7dd495b684-jmhdz map[cpu:75m ephemeral-storage:5Gi memory:77Mi] ``` We can view the current usage, now with these higher requests: ```shell {hl_lines=[4, 5, 6, 7, 8]} $ kubectl top pods NAME CPU(cores) MEMORY(bytes) load-generate-8hqgm 88m 11Mi timeserver-7dd495b684-4xdbr 97m 11Mi timeserver-7dd495b684-7bw7l 100m 13Mi timeserver-7dd495b684-87gv8 94m 11Mi timeserver-7dd495b684-dwz98 94m 11Mi timeserver-7dd495b684-jmhdz 94m 11Mi ``` After 34 minutes, we can look again and see the VPA has been updated once more. ```shell {hl_lines=[3, 8, 9, 10, 11, 12]} $ kubectl get vpa NAME MODE CPU MEM PROVIDED AGE timeserver-vpa Auto 170m 50Mi True 34m $ kubectl get pods -o custom-columns=NAME:.metadata.name,REQUESTS:.spec.containers[*].resources.requests NAME REQUESTS load-generate-8hqgm map[cpu:500m ephemeral-storage:1Gi memory:2Gi] timeserver-7dd495b684-gbc8k map[cpu:115m ephemeral-storage:5Gi memory:118Mi] timeserver-7dd495b684-lxvjb map[cpu:115m ephemeral-storage:5Gi memory:118Mi] timeserver-7dd495b684-p6sfb map[cpu:115m ephemeral-storage:5Gi memory:118Mi] timeserver-7dd495b684-tmqxf map[cpu:115m ephemeral-storage:5Gi memory:118Mi] timeserver-7dd495b684-zv9bj map[cpu:115m ephemeral-storage:5Gi memory:118Mi] ``` This process continues automatically. Eventually, assuming consistent load on the Pods, it will settle and the Pods will no longer be re-created. Here’s the same VPA after 106 minutes. ```shell {hl_lines=[3, 8, 9, 10, 11, 12]} $ kubectl get vpa kubectl get pods -o custom-columns=NAME:.metadata.name,REQUESTS:.spec.containers[*].resources.requestsNAME MODE CPU MEM PROVIDED AGE timeserver-vpa Auto 265m 50Mi True 106m $ kubectl get pods -o custom-columns=NAME:.metadata.name,REQUESTS:.spec.containers[*].resources.requests NAME REQUESTS load-generate-j2rf6 map[cpu:500m ephemeral-storage:1Gi memory:2Gi] timeserver-7dd495b684-bvtpp map[cpu:260m ephemeral-storage:5Gi memory:267Mi] timeserver-7dd495b684-mv9js map[cpu:260m ephemeral-storage:5Gi memory:267Mi] timeserver-7dd495b684-qzvp8 map[cpu:260m ephemeral-storage:5Gi memory:267Mi] timeserver-7dd495b684-s75n4 map[cpu:260m ephemeral-storage:5Gi memory:267Mi] timeserver-7dd495b684-zffds map[cpu:260m ephemeral-storage:5Gi memory:267Mi] ``` All of this works in advisory mode too, that is with updateMode set to “Off” (see the [API reference](https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler#podupdatepolicy_v1_autoscalingk8sio)). However, in advisory mode, VPA is limited to observe only based on the resources given. In this example, we saw VPA already make multiple recommendations. Each new recommendation is possible due to the changes made from the prior recommendation. When used in advisory mode, since the underlying Pod resources are unchanged, you are just getting the initial recommendation. This may be useful information if the Pod’s requests are much higher than what it needs, and less useful if the Pod’s requests are much lower (as VPA won’t be able to know just how much it needs). ## Stable state I checked back in on this deployment a week later to see how it was going, here’s the final stable state: ```shell {hl_lines=[3, 7, 8, 9, 10, 11, 16, 17, 18, 19, 20]} $ kubectl get vpa NAME MODE CPU MEM PROVIDED AGE timeserver-vpa Auto 425m 50Mi True 7d14h $ kubectl get pods -o custom-columns=NAME:.metadata.name,REQUESTS:.spec.containers[*].resources.requests NAME REQUESTS timeserver-7dd495b684-95vdt map[cpu:430m ephemeral-storage:5Gi memory:441Mi] timeserver-7dd495b684-jsdm8 map[cpu:430m ephemeral-storage:5Gi memory:441Mi] timeserver-7dd495b684-l8m4s map[cpu:430m ephemeral-storage:5Gi memory:441Mi] timeserver-7dd495b684-vhzfz map[cpu:430m ephemeral-storage:5Gi memory:441Mi] timeserver-7dd495b684-wsrps map[cpu:430m ephemeral-storage:5Gi memory:441Mi] $ kubectl top pods NAME CPU(cores) MEMORY(bytes) load-generate-jhgtl 312m 1573Mi timeserver-7dd495b684-95vdt 365m 12Mi timeserver-7dd495b684-jsdm8 357m 11Mi timeserver-7dd495b684-l8m4s 256m 12Mi timeserver-7dd495b684-vhzfz 266m 11Mi timeserver-7dd495b684-wsrps 336m 11Mi ``` ## Further reading - [GKE’s VPA API reference](https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler) - [VPA Tutorial](https://cloud.google.com/kubernetes-engine/docs/how-to/vertical-pod-autoscaling#update-requests-automatically)