Scaling container resources automatically with VPA

6 min read

Vertical Pod Autoscaling is a system that measures Pod utilization and attempts to set the right resource requests. For example, if the Pod is constantly using more CPU, VPA will increase the CPU requests. Contrasting Horizontal Pod Autoscaling (HPA) which can create more replicas as your Pods use more resources, VPA changes the resources of each replica.

In the past, GKE in Autopilot mode had a 250 milliCPU resource increment (meaning valid Pod sizes were 250m, 500m, 750m, etc). Now that Autopilot supports burstable QoS and fine-grained resource increments, VPA should be even more useful, so let’s give it a spin.

VPA can run in update mode, or advisory mode. In the former, the Pod values are directly updated (within an optional minimum and maximum bound). Currently this causes the Pod to be restarted, but work is ongoing to have the option for in-place updates (AEP-4016) (where capacity exists on the node). VPA can also be run in purely an advisory fashion to help you determine the right resource values to set on your pods automatically.

To start, I’m going to create a cluster running the latest version of Autopilot so that it has new Burst QoS and removal of the resource increment (version 1.30.2-gke.1394000 or later is required), both that make VPA work even better.

gcloud container clusters create-auto $CLUSTER_NAME \
    --release-channel $CHANNEL --region $REGION \
    --cluster-version $VERSION

Now we can deploy a workload. This one is going to be very greedy on the CPU. We’ll use version 4 of the Timeserver container from my book in Chapter 6, tweaked to add some resource requests as a starting point:

apiVersion: apps/v1
kind: Deployment
  name: timeserver
  replicas: 5
      pod: timeserver-pod
        pod: timeserver-pod
      - name: timeserver-container
            cpu: 50m
            memory: 50Mi
            ephemeral-storage: 5Gi
            cpu: 70m
            memory: 70Mi


apiVersion: v1
kind: Service
  name: timeserver
    pod: timeserver-pod
  - port: 80
    targetPort: 80
    protocol: TCP


Plus a brand new Job to throw a bunch of requests at this Deployment. In the book, I use Apache Bench from the command line to do that. This Job will wrap that up into a container! We’ll need an internal service for our Deployment, plus the load-generating Job itself, defined as follows:

apiVersion: batch/v1
kind: Job
  name: load-generate
  backoffLimit: 1000
  completions: 1000
      - name: ab-container
        command: ["ab", "-n", "1000000000", "-c", "20", "-s", "120", "http://timeserver/"]
      restartPolicy: OnFailure


Creating all 3:

kubectl create -f
kubectl create -f
kubectl create -f

Once created, observe the usage in the cluster, it should be pretty high (at, or exceeding thanks to bursting, the 50m CPU it requests).

$ kubectl top pods
NAME                          CPU(cores)   MEMORY(bytes)   
load-generate-8cbbm           67m          2Mi             
timeserver-7dd495b684-74jt9   62m          12Mi            
timeserver-7dd495b684-8s9tt   62m          11Mi            
timeserver-7dd495b684-glrz6   66m          11Mi            
timeserver-7dd495b684-pl97n   63m          11Mi            
timeserver-7dd495b684-trhnx   64m          11Mi            

Now let’s create the VPA to automatically size these pods. This example sets some guardrails, a minimum of 50m (which is Autopilot’s minimum), and a max of 2vCPU.

kind: VerticalPodAutoscaler
  name: timeserver-vpa
    apiVersion: "apps/v1"
    kind: Deployment
    name: timeserver
    updateMode: "Auto"
    minReplicas: 1
      - containerName: '*'
          cpu: "50m" 
          memory: "50Mi"
          cpu: "2"
          memory: "2Gi"
        controlledValues: RequestsAndLimits


Note that the minReplicas is a very important field. VPA has it’s own inbuilt Pod Disruption Budget (PDB) of sorts and will not Pods if it means there will be less replicas running than the minimum. If you have multiple containers, and want to configure minimums and maximums separately, you can do that with multiple policies. The above example has a single policy that applies to all containers.

Create it like so:

kubectl create -f

Inspecting the results

Give the VPA a moment to work, and then query it.

$ kubectl get vpa
NAME             MODE   CPU   MEM    PROVIDED   AGE
timeserver-vpa   Auto   75m   50Mi   True       3m55s

Here we can see that the VPA is recommending a new, higher request of 75m (from the starting 50m). We can drill into this more by describing the object.

$ kubectl describe vpa
Name:         timeserver-vpa
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:
Kind:         VerticalPodAutoscaler
  Creation Timestamp:  2024-03-29T04:17:02Z
  Generation:          2
  Resource Version:    55138
  UID:                 d8f3e03b-9c74-40f1-a8c7-06a448a05f8d
  Resource Policy:
    Container Policies:
      Container Name:     *
      Controlled Values:  RequestsAndLimits
      Max Allowed:
        Cpu:     2
        Memory:  2Gi
      Min Allowed:
        Cpu:     50m
        Memory:  50Mi
  Target Ref:
    API Version:  apps/v1
    Kind:         Deployment
    Name:         timeserver
  Update Policy:
    Min Replicas:  1
    Update Mode:   Auto
    Last Transition Time:  2024-03-29T04:17:19Z
    Status:                False
    Type:                  LowConfidence
    Last Transition Time:  2024-03-29T04:17:19Z
    Status:                True
    Type:                  RecommendationProvided
    Container Recommendations:
      Container Name:  timeserver-container
      Lower Bound:
        Cpu:     50m
        Memory:  50Mi
        Cpu:     75m
        Memory:  50Mi
      Uncapped Target:
        Cpu:     75m
        Memory:  14680064
      Upper Bound:
        Cpu:     2
        Memory:  2Gi
Events:          <none>

The “Uncapped Target” is particularly interesting, as this is the value the VPA would set without our minimum and maximum bounds.

Let’s check the pods. Notice in the AGE column, that one of them is newer than the others.

$ kubectl get pods
NAME                          READY   STATUS    RESTARTS      AGE
load-generate-8hqgm           1/1     Running   2 (39s ago)   2m31s
timeserver-7dd495b684-4xdbr   1/1     Running   0             3m31s
timeserver-7dd495b684-7bw7l   1/1     Running   0             2m31s
timeserver-7dd495b684-87gv8   1/1     Running   0             2m31s
timeserver-7dd495b684-dwz98   1/1     Running   0             3m31s
timeserver-7dd495b684-jmhdz   1/1     Running   0             91s

Taking a closer look at that Pod, we can see that it’s resources are higher than before.

$ kubectl get pod -o yaml timeserver-7dd495b684-jmhdz
apiVersion: v1
kind: Pod
    vpaObservedContainers: timeserver-container
    vpaUpdates: 'Pod resources updated by timeserver-vpa: container 0: memory request,
      cpu request, cpu limit, memory limit'
    pod: timeserver-pod
    pod-template-hash: 7dd495b684
  name: timeserver-7dd495b684-jmhdz
  - image:
    name: timeserver-container
        cpu: 105m
        ephemeral-storage: 5Gi
        memory: 77Mi
        cpu: 75m
        ephemeral-storage: 5Gi
        memory: 77Mi

NOTE depending on your situation, it may take a little time before the VPA works. I had an issue where it wasn’t working at first, I stepped away, and when I got back, it was working. Make sure you have more Pod replicas in the Deployment than what is specified in the minReplicas field in the VPA otherwise the VPA will never restart the Pods (I had that problem!). If you do, and it’s not working—take a break for a minute, and see if it just needs a moment to catch up.

Some time later…

If we now inspect the Pods a short time later, we can see that they have been replaced with Pods that have higher requests and limits. The following get command conveniently displays the resource requests for all Pods inline without the need to inspect them one by one.

Aside: I used an LLM to help create this command, I find them very handy for creating complex custom queries like this. Try a prompt in Gemini like “kubectl get command that uses custom columns to display the resource requests of a Pod”.

$ kubectl get pods -o,REQUESTS:.spec.containers[*].resources.requests
NAME                          REQUESTS
load-generate-8hqgm           map[cpu:500m ephemeral-storage:1Gi memory:2Gi]
timeserver-7dd495b684-4xdbr   map[cpu:75m ephemeral-storage:5Gi memory:77Mi]
timeserver-7dd495b684-7bw7l   map[cpu:75m ephemeral-storage:5Gi memory:77Mi]
timeserver-7dd495b684-87gv8   map[cpu:75m ephemeral-storage:5Gi memory:77Mi]
timeserver-7dd495b684-dwz98   map[cpu:75m ephemeral-storage:5Gi memory:77Mi]
timeserver-7dd495b684-jmhdz   map[cpu:75m ephemeral-storage:5Gi memory:77Mi]

We can view the current usage, now with these higher requests:

$ kubectl top pods
NAME                          CPU(cores)   MEMORY(bytes)   
load-generate-8hqgm           88m          11Mi            
timeserver-7dd495b684-4xdbr   97m          11Mi            
timeserver-7dd495b684-7bw7l   100m         13Mi            
timeserver-7dd495b684-87gv8   94m          11Mi            
timeserver-7dd495b684-dwz98   94m          11Mi            
timeserver-7dd495b684-jmhdz   94m          11Mi            

After 34 minutes, we can look again and see the VPA has been updated once more.

$ kubectl get vpa
NAME             MODE   CPU    MEM    PROVIDED   AGE
timeserver-vpa   Auto   170m   50Mi   True       34m

$ kubectl get pods -o,REQUESTS:.spec.containers[*].resources.requests 
NAME                          REQUESTS
load-generate-8hqgm           map[cpu:500m ephemeral-storage:1Gi memory:2Gi]
timeserver-7dd495b684-gbc8k   map[cpu:115m ephemeral-storage:5Gi memory:118Mi]
timeserver-7dd495b684-lxvjb   map[cpu:115m ephemeral-storage:5Gi memory:118Mi]
timeserver-7dd495b684-p6sfb   map[cpu:115m ephemeral-storage:5Gi memory:118Mi]
timeserver-7dd495b684-tmqxf   map[cpu:115m ephemeral-storage:5Gi memory:118Mi]
timeserver-7dd495b684-zv9bj   map[cpu:115m ephemeral-storage:5Gi memory:118Mi]

This process continues automatically. Eventually, assuming consistent load on the Pods, it will settle and the Pods will no longer be re-created. Here’s the same VPA after 106 minutes.

$ kubectl get vpa
kubectl get pods -o,REQUESTS:.spec.containers[*].resources.requestsNAME             MODE   CPU    MEM    PROVIDED   AGE
timeserver-vpa   Auto   265m   50Mi   True       106m

$ kubectl get pods -o,REQUESTS:.spec.containers[*].resources.requests
NAME                          REQUESTS
load-generate-j2rf6           map[cpu:500m ephemeral-storage:1Gi memory:2Gi]
timeserver-7dd495b684-bvtpp   map[cpu:260m ephemeral-storage:5Gi memory:267Mi]
timeserver-7dd495b684-mv9js   map[cpu:260m ephemeral-storage:5Gi memory:267Mi]
timeserver-7dd495b684-qzvp8   map[cpu:260m ephemeral-storage:5Gi memory:267Mi]
timeserver-7dd495b684-s75n4   map[cpu:260m ephemeral-storage:5Gi memory:267Mi]
timeserver-7dd495b684-zffds   map[cpu:260m ephemeral-storage:5Gi memory:267Mi]

All of this works in advisory mode too, that is with updateMode set to “Off” (see the API reference). However, in advisory mode, VPA is limited to observe only based on the resources given. In this example, we saw VPA already make multiple recommendations. Each new recommendation is possible due to the changes made from the prior recommendation. When used in advisory mode, since the underlying Pod resources are unchanged, you are just getting the initial recommendation. This may be useful information if the Pod’s requests are much higher than what it needs, and less useful if the Pod’s requests are much lower (as VPA won’t be able to know just how much it needs).

Stable state

I checked back in on this deployment a week later to see how it was going, here’s the final stable state:

$ kubectl get vpa
NAME             MODE   CPU    MEM    PROVIDED   AGE
timeserver-vpa   Auto   425m   50Mi   True       7d14h

$ kubectl get pods -o,REQUESTS:.spec.containers[*].resources.requests
NAME                          REQUESTS
timeserver-7dd495b684-95vdt   map[cpu:430m ephemeral-storage:5Gi memory:441Mi]
timeserver-7dd495b684-jsdm8   map[cpu:430m ephemeral-storage:5Gi memory:441Mi]
timeserver-7dd495b684-l8m4s   map[cpu:430m ephemeral-storage:5Gi memory:441Mi]
timeserver-7dd495b684-vhzfz   map[cpu:430m ephemeral-storage:5Gi memory:441Mi]
timeserver-7dd495b684-wsrps   map[cpu:430m ephemeral-storage:5Gi memory:441Mi]

$ kubectl top pods
NAME                          CPU(cores)   MEMORY(bytes)   
load-generate-jhgtl           312m         1573Mi          
timeserver-7dd495b684-95vdt   365m         12Mi            
timeserver-7dd495b684-jsdm8   357m         11Mi            
timeserver-7dd495b684-l8m4s   256m         12Mi            
timeserver-7dd495b684-vhzfz   266m         11Mi            
timeserver-7dd495b684-wsrps   336m         11Mi 

Further reading

This article is bonus material to supplement my book Kubernetes for Developers.