Understanding Kubernetes Node Resource Allocation

The question often comes up: where do my node resources go in Kubernetes? In this post, I share how to use kubectl describe node to look at resource usage by different categories (system, user, etc).

The way I think about Kubernetes node resource usage is to assign resource allocations into 4 categories:

  • OS and Kubernetes overhead
  • System pods
  • User pods
  • Empty space

The OS & Kubernetes overhead includes the Linux kernel, the kubelet, and for memory, an allocation known as the “eviction threshold” which allows the kubelet a chance to catch Pods that are going over their memory allocation and take action according to their Kubernetes priorities before the kernel OOM handling kicks in which is a blunter instrument.

System pods include DaemonSet pods like kube-proxy which runs on every node, and non-DaemonSet deployments like kube-dns that are scaled separately.

If you’re using GKE Autopilot, then only one of those resource groupings is billable; the user pods. For other platforms generally you’re paying for the node as a whole, so you essentially pay for each resource group and so it is worth knowing where you’re resources are going. The ideal case is for “User pods” to be the largest allocation, and “Empty space” to be close to zero (for Autopilot, that is achieved as User pods is the only allocation).

If you want to know what your node resources currently get spent on, here’s how. This needs to be repeated on every node to get the complete picture, so you may wish to automate it or run it on a small representative cluster.

Firstly, describe the node to see the allocation table:

$ kubectl get nodes
$ kubectl describe node NODE_NAME

Then, you can categorize the allocated resources into our 4 resource groups, as follows:

User Pods

User pod allocation is calculated the by looking at the “Requests” resource columns from the kubectl get nodes output. The relevant columns here are the “Requests, not Limits. Requests impact how the pod is scheduled, and what resources are allocated to it, whereas limits are used to enable pods to burst beyond their allocation. Look at all pods not in the “kube-system” namespace, these are the Pods you will pay for on Autopilot. In my example, this is the pod in the “default” namespace.

Non-terminated Pods:          (5 in total)
  Namespace                   Name                                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                   ------------  ----------  ---------------  -------------  ---
  default                     pluscode-demo-75bc4678bd-982bk                         250m (25%)    250m (25%)  250Mi (8%)       250Mi (8%)      83s
  kube-system                 fluentbit-gke-tvgq8                                    100m (10%)    0 (0%)      200Mi (7%)       500Mi (19%)    104m
  kube-system                 gke-metrics-agent-pzbv6                                3m (0%)       0 (0%)      50Mi (1%)        50Mi (1%)      104m
  kube-system                 kube-proxy-gke-standard2-default-pool-eaa029db-bpbx    100m (10%)    0 (0%)      0 (0%)           0 (0%)         103m
  kube-system                 pdcsi-node-wn274                                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         104m

OS and Kubernetes overhead
You can see the reserved OS & Kubernetes overhead by comparing the Allocatable (what the Kubernetes Scheduler can allocate to Pods) and the Capacity.

Capacity:
  attachable-volumes-gce-pd:  127
  cpu:                        1
  ephemeral-storage:          98868448Ki
  hugepages-2Mi:              0
  memory:                     3776196Ki
  pods:                       110
Allocatable:
  attachable-volumes-gce-pd:  127
  cpu:                        940m
  ephemeral-storage:          47093746742
  hugepages-2Mi:              0
  memory:                     2690756Ki
  pods:                       110

Focusing on CPU and memory, we can see that 60mCPU of the CPU has been reserved for the system (Capacity of 1000m, minus the Allocatable of 940m), along with about a Gig of memory (Capacity of 3776196Ki minus Allocatable of 2690756Ki).

System pods
In addition to the OS/Kubernetes reservation there are several system components that are scheduled as regular Kubernetes Pods. These come out of Allocatable.

System pods can be seen by adding up the requests of pods in the “kube-system” namespace (and potentially other system-related namespaces):

Non-terminated Pods:          (5 in total)
  Namespace                   Name                                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                   ------------  ----------  ---------------  -------------  ---
  kube-system                 fluentbit-gke-djqv5                                    100m (10%)    0 (0%)      200Mi (7%)       500Mi (19%)    79m
  kube-system                 gke-metrics-agent-pdc8v                                3m (0%)       0 (0%)      50Mi (1%)        50Mi (1%)      79m
  kube-system                 kube-dns-6465f78586-8xhq8                              260m (27%)    0 (0%)      110Mi (4%)       210Mi (7%)     77m
  kube-system                 kube-proxy-gke-standard2-default-pool-9985cf1b-749b    100m (10%)    0 (0%)      0 (0%)           0 (0%)         78m
  kube-system                 pdcsi-node-np7q2                                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         79m

Again, focus on Requests, not Limits.

Empty space
Empty space can be calculated by comparing the Allocatable resources to the Allocated resources as given in the tables above. In this example with no user pods, for CPU we saw 940mCPU was the Allocatable, of which 463m is Allocated. Therefore, 477mCPU is the free space, which Pods can be scheduled on to.

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests     Limits
  --------                   --------     ------
  cpu                        463m (49%)   0 (0%)
  memory                     360Mi (13%)  760Mi (28%)
  ephemeral-storage          0 (0%)       0 (0%)
  hugepages-2Mi              0 (0%)       0 (0%)
  attachable-volumes-gce-pd  0            0

So that’s how you can track each of the 4 buckets where your node’s resources get allocated to.

Side note: Calculating per-node overhead

The above calculations categorize our resources for a given node. If you do this for every node, then you will have an accurate picture of your entire cluster.

If you’re more interested in a generalization of what the per-node overhead of system workloads is (i.e. the OS/Kubernetes overhead and system pods), then looking at a sample output of a single node will not give you the full story, as some system pods are not on every node.

In the above example, if you looked at Allocated resources for system pods, you will see that a full 463mCPU is used for system pods. That’s a lot! Fortunately, this doesn’t apply to every node as not all system pods are deployed as DaemonSets.

If you want to calculate just the DaemonSet system pods, you can figure out which of the above list are such pods querying the DameonSets like so:

$ kubectl get daemonset -n kube-system                                                                                      NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                        AGE
fluentbit-gke               9         9         9       9            9           kubernetes.io/os=linux                                               86m
gke-metrics-agent           9         9         9       9            9           kubernetes.io/os=linux                                               86m
gke-metrics-agent-windows   0         0         0       0            0           kubernetes.io/os=windows                                             86m
kube-proxy                  0         0         0       0            0           kubernetes.io/os=linux,node.kubernetes.io/kube-proxy-ds-ready=true   86m
metadata-proxy-v0.1         0         0         0       0            0           cloud.google.com/metadata-proxy-ready=true,kubernetes.io/os=linux    86m
nvidia-gpu-device-plugin    0         0         0       0            0           <none>                                                               86m
pdcsi-node                  9         9         9       9            9           kubernetes.io/os=linux                                               86m

These are the pods that will be on every node. You can now add up the requests for all these Pods, which will give you the total allocation to system pods on every node.

To do the same thing for the non-daemon set pods, you can query the system non-daemonset deployments like so:

$ kubectl get deploy -n kube-system                                                                                         NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
event-exporter-gke                         1/1     1            1           100m
kube-dns                                   2/2     2            2           100m
kube-dns-autoscaler                        1/1     1            1           100m
l7-default-backend                         1/1     1            1           100m
metrics-server-v0.3.6                      1/1     1            1           100m
stackdriver-metadata-agent-cluster-level   1/1     1            1           100m

In this case you would calculate the total resources allocated to kube-system Deployments across the cluster by multiplying the replica count with the resource requests of the pod for each deployment. Dividing this by the total number of nodes would give you the per-node average of non-daemonset system workloads.

Calculating the final average system overhead per node is then the sum of the OS/Kubernetes overhead, the resources requested by daemonset pods, and the per-node average of the non-daemonset system workloads.

While it’s good to consider what the per-node system overhead is, don’t ignore unallocated (empty) space, as depending on how well your cluster is bin-packed, this can be considerable. Once again Autopilot frees you from this concern, as you will just pay directly for your Pod resource requests, no need to worry about the rest.