GKE operates on a flat VPC structure. That means that every Node and Pod has an identity within your VPC, and their IPs are not re-used. This is convenient, as Pods are addressable within the VPC, but unless you create multiple VPCs to isolate resources, you can end up using a lot of IPs very quickly. Fortunately, the system is pretty flexible and there are some steps you can take to optimize.
In this post, I share my personal analysis on IP address range planning and sharing after distilling the great docs on the subject, and running some experiments of my own.
NOTE: This post refers to VPC native GKE clusters.
First, a quick primer: there are 3 ranges allocated for a GKE cluster. Nodes, Pods and Services:
Node IPs are allocated from the subnet’s primary range, which can be shared within the region, and can be expended (provided there’s no overlap). In the UI this is chosen with “Node subnet”, in the CLI you can select an existing subnet with
--subnetwork or create a new one with
Service IPs are virtual within the cluster (you can’t access them from other clusters), and GKE node provides a range for you (
184.108.40.206/20) which is re-used on all clusters, giving you 4k services without needing to allocate any of your own IP space. In the UI this is “Service address range”, CLI it is
--services-secondary-range-name. When providing your own range, it should be a secondary range within the subnet.
Pod IPs are from a secondary range which you provide. This is the largest allocation, as GKE looks at the max number of Pods per node and assigns twice that number of IPs to every node regardless of how many Pods are actually running so it’s pretty hungry (default for Autopilot is 64 IPs per nodes, and 256 for node-based clusters). Fortunately though, additional secondary ranges for Pods can be added later. The UI refers to this range as “Cluster default Pod address range”, and the CLI with
--cluster-ipv4-cidr (to create a new range), and
--cluster-secondary-range (to specify an existing one). The UI will always create a new range.
|K8s Resource||IP Resource||Default||Shared||Extendable||Non-1918 IP Candidate|
|Nodes||Subnet primary range||/20|
|Yes, default.||Yes (must be contiguous and not overlapping other subnets)||No|
|Services||Subnet secondary range||/20, Google provided|
|Yes (can be reused in entirety)||No||Yes, highly recommended|
|Pods||Subnet secondary range||/17|
(512 nodes at 32 pods per node)
|Yes (named secondary ranges only)||Yes. Initial range is immutable, but can be extended with additional ranges.||Yes, recommended|
The Node range can be sized up, provided there is no-overlapping ranges. The default gets you 4k nodes. If you expect to need more than 4k nodes among your clusters in the region, you should create a larger subnet. Subnets can be shared by clusters in the region to reduce wastage (which is the default, so to avoid sharing you need to create a new subnet per cluster). Typically these IPs will be in your VPC’s private network space in 10.0.0.0/8.
Services are capped at 4k by default and cannot be changed after creation. If you need 4k or less, just leave as default (
220.127.116.11/20) as these IPs are provided by Google (so don’t consume any of your network). If 4k seems small to you, consider creating a named secondary range in the subnet of size
/19 or larger, and sharing that custom range among all the clusters in the region. Since services are virtual IPs and have no meaning outside the cluster it’s advisable to use non-RFC 1918 IPs, and to reuse the range for each cluster (this is also why Google can provide a single
/20 that is shared among many clusters). There is basically no benefit in not sharing this range between clusters, and no need to use RFC 1918 IPs as this traffic is staying local within the cluster.
Here’s how to create a larger shared service range:
$ gcloud compute networks subnets update default \ --add-secondary-ranges shared-services=100.64.0.0/19 --region=us-west1 Updated [.../regions/us-west1/subnetworks/default] $ gcloud container clusters create-auto autopilot-1 \ --services-secondary-range-name shared-services --region us-west1 Created [.../zones/us-west1/clusters/autopilot-1]. $ gcloud container clusters create-auto autopilot-2 \ --services-secondary-range-name shared-services --region us-west1 Created [.../zones/us-west1/clusters/autopilot-2].
Pod IPs can be added to any time, no need to provision excessive amounts upfront, though scaling does get paused when the limit is hit until you add more. GKE uses a lot of Pod IPs, check this doc to size the initial range.
Since so many Pod IPs are needed, it can be advisable to use non-RFC1918 space, preferably one of the other private IP ranges (like 240.0.0.0/4). These IPs can be used on your VPC (they will have no meaning outside of it). Traffic can be masqueraded using the Node’s IP (which is normally in 10.0.0.0/8) to on-prem locations. To create a cluster using non-RFC1918 address space:
$ gcloud container clusters create-auto autopilot-3 \ --cluster-ipv4-cidr "240.60.0.0/17" \ --region us-west1 Created [.../zones/us-west1/clusters/autopilot-3].
You can share Pod IP ranges, but only via the CLI and only by referencing existing ranges. The UI, and the CLI params where you pass a CIDR range (like the one above) create new secondary ranges. The secondary ranges created automatically by GKE in this way cannot be shared (as they are deleted when you delete the cluster). Here’s an example Pod IP range shared by 2 clusters, notice how I create the range first.
$ gcloud compute networks subnets update default \ --add-secondary-ranges shared-pods=240.10.0.0/17 \ --region=us-west1 Updated [.../regions/us-west1/subnetworks/default] $ gcloud container clusters create-auto autopilot-3 \ --cluster-secondary-range-name=shared-pods \ --region us-west1 Created [.../zones/us-west1/clusters/autopilot-3]. $ gcloud container clusters create-auto autopilot-4 \ --cluster-secondary-range-name=shared-pods \ --region us-west1 Created [.../zones/us-west1/clusters/autopilot-4].
You can add additional Pod secondary ranges at any time. This does interrupt autoscaling, and in the case of non-Autopilot clusters you will need to create new node pools to use them (for Autopilot, the new range is used automatically). Additional secondary ranges for Pods cannot be shared.
The default settings are pretty decent for most people. The range you’re most likely to run out of, Pods, and this can easily be extended (especially for Autopilot mode where the new ranges are picked up automatically). Things to consider upfront: is 4k nodes among the clusters enough, or do you need to increase that (or leave adjacent IP space so it can be enlarged later)? Are 4k services enough for the cluster, or should you create your own larger shared range?
Remember to use non-RFC 1918 ranges for Pods from the get-go to preserve RFC1918 address space for other uses (like nodes).