The question often comes up: where do my node resources go in Kubernetes? In this post, I share how to use
kubectl describe node to look at resource usage by different categories (system, user, etc).
The way I think about Kubernetes node resource usage is to assign resource allocations into 4 categories:
- OS and Kubernetes overhead
- System pods
- User pods
- Empty space
The OS & Kubernetes overhead includes the Linux kernel, the kubelet, and for memory, an allocation known as the “eviction threshold” which allows the kubelet a chance to catch Pods that are going over their memory allocation and take action according to their Kubernetes priorities before the kernel OOM handling kicks in which is a blunter instrument.
System pods include DaemonSet pods like
kube-proxy which runs on every node, and non-DaemonSet deployments like
kube-dns that are scaled separately.
If you’re using GKE Autopilot, then only one of those resource groupings is billable; the user pods. For other platforms generally you’re paying for the node as a whole, so you essentially pay for each resource group and so it is worth knowing where you’re resources are going. The ideal case is for “User pods” to be the largest allocation, and “Empty space” to be close to zero (for Autopilot, that is achieved as User pods is the only allocation).
If you want to know what your node resources currently get spent on, here’s how. This needs to be repeated on every node to get the complete picture, so you may wish to automate it or run it on a small representative cluster.
Firstly, describe the node to see the allocation table:
kubectl get nodes
$ kubectl describe node NODE_NAME
Then, you can categorize the allocated resources into our 4 resource groups, as follows:
User pod allocation is calculated the by looking at the “Requests” resource columns from the
kubectl get nodes output. The relevant columns here are the “Requests, not Limits. Requests impact how the pod is scheduled, and what resources are allocated to it, whereas limits are used to enable pods to burst beyond their allocation. Look at all pods not in the “kube-system” namespace, these are the Pods you will pay for on Autopilot. In my example, this is the pod in the “default” namespace.
Non-terminated Pods: (5 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- default pluscode-demo-75bc4678bd-982bk 250m (25%) 250m (25%) 250Mi (8%) 250Mi (8%) 83s kube-system fluentbit-gke-tvgq8 100m (10%) 0 (0%) 200Mi (7%) 500Mi (19%) 104m kube-system gke-metrics-agent-pzbv6 3m (0%) 0 (0%) 50Mi (1%) 50Mi (1%) 104m kube-system kube-proxy-gke-standard2-default-pool-eaa029db-bpbx 100m (10%) 0 (0%) 0 (0%) 0 (0%) 103m kube-system pdcsi-node-wn274 0 (0%) 0 (0%) 0 (0%) 0 (0%) 104m
OS and Kubernetes overhead
You can see the reserved OS & Kubernetes overhead by comparing the Allocatable (what the Kubernetes Scheduler can allocate to Pods) and the Capacity.
Capacity: attachable-volumes-gce-pd: 127 cpu: 1 ephemeral-storage: 98868448Ki hugepages-2Mi: 0 memory: 3776196Ki pods: 110 Allocatable: attachable-volumes-gce-pd: 127 cpu: 940m ephemeral-storage: 47093746742 hugepages-2Mi: 0 memory: 2690756Ki pods: 110
Focusing on CPU and memory, we can see that 60mCPU of the CPU has been reserved for the system (Capacity of 1000m, minus the Allocatable of 940m), along with about a Gig of memory (Capacity of 3776196Ki minus Allocatable of 2690756Ki).
In addition to the OS/Kubernetes reservation there are several system components that are scheduled as regular Kubernetes Pods. These come out of Allocatable.
System pods can be seen by adding up the requests of pods in the “kube-system” namespace (and potentially other system-related namespaces):
Non-terminated Pods: (5 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- kube-system fluentbit-gke-djqv5 100m (10%) 0 (0%) 200Mi (7%) 500Mi (19%) 79m kube-system gke-metrics-agent-pdc8v 3m (0%) 0 (0%) 50Mi (1%) 50Mi (1%) 79m kube-system kube-dns-6465f78586-8xhq8 260m (27%) 0 (0%) 110Mi (4%) 210Mi (7%) 77m kube-system kube-proxy-gke-standard2-default-pool-9985cf1b-749b 100m (10%) 0 (0%) 0 (0%) 0 (0%) 78m kube-system pdcsi-node-np7q2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 79m
Again, focus on Requests, not Limits.
Empty space can be calculated by comparing the Allocatable resources to the Allocated resources as given in the tables above. In this example with no user pods, for CPU we saw 940mCPU was the Allocatable, of which 463m is Allocated. Therefore, 477mCPU is the free space, which Pods can be scheduled on to.
Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 463m (49%) 0 (0%) memory 360Mi (13%) 760Mi (28%) ephemeral-storage 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-gce-pd 0 0
So that’s how you can track each of the 4 buckets where your node’s resources get allocated to.
Side note: Calculating per-node overhead
The above calculations categorize our resources for a given node. If you do this for every node, then you will have an accurate picture of your entire cluster.
If you’re more interested in a generalization of what the per-node overhead of system workloads is (i.e. the OS/Kubernetes overhead and system pods), then looking at a sample output of a single node will not give you the full story, as some system pods are not on every node.
In the above example, if you looked at Allocated resources for system pods, you will see that a full 463mCPU is used for system pods. That’s a lot! Fortunately, this doesn’t apply to every node as not all system pods are deployed as DaemonSets.
If you want to calculate just the DaemonSet system pods, you can figure out which of the above list are such pods querying the DameonSets like so:
$ kubectl get daemonset -n kube-system NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE fluentbit-gke 9 9 9 9 9 kubernetes.io/os=linux 86m gke-metrics-agent 9 9 9 9 9 kubernetes.io/os=linux 86m gke-metrics-agent-windows 0 0 0 0 0 kubernetes.io/os=windows 86m kube-proxy 0 0 0 0 0 kubernetes.io/os=linux,node.kubernetes.io/kube-proxy-ds-ready=true 86m metadata-proxy-v0.1 0 0 0 0 0 cloud.google.com/metadata-proxy-ready=true,kubernetes.io/os=linux 86m nvidia-gpu-device-plugin 0 0 0 0 0 <none> 86m pdcsi-node 9 9 9 9 9 kubernetes.io/os=linux 86m
These are the pods that will be on every node. You can now add up the requests for all these Pods, which will give you the total allocation to system pods on every node.
To do the same thing for the non-daemon set pods, you can query the system non-daemonset deployments like so:
$ kubectl get deploy -n kube-system NAME READY UP-TO-DATE AVAILABLE AGE event-exporter-gke 1/1 1 1 100m kube-dns 2/2 2 2 100m kube-dns-autoscaler 1/1 1 1 100m l7-default-backend 1/1 1 1 100m metrics-server-v0.3.6 1/1 1 1 100m stackdriver-metadata-agent-cluster-level 1/1 1 1 100m
In this case you would calculate the total resources allocated to kube-system Deployments across the cluster by multiplying the replica count with the resource requests of the pod for each deployment. Dividing this by the total number of nodes would give you the per-node average of non-daemonset system workloads.
Calculating the final average system overhead per node is then the sum of the OS/Kubernetes overhead, the resources requested by daemonset pods, and the per-node average of the non-daemonset system workloads.
While it’s good to consider what the per-node system overhead is, don’t ignore unallocated (empty) space, as depending on how well your cluster is bin-packed, this can be considerable. Once again Autopilot frees you from this concern, as you will just pay directly for your Pod resource requests, no need to worry about the rest.