Strict Pod Co-location

3 min read

Pod affinity is a useful technique in Kubernetes for expressing a requirement that a pod, say with the “reader” role, is co-located with another pod, say with the “writer’ role. You can express that requirement by adding something like the following to the reader pod.

      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: pod
                operator: In
                values:
                - writer-pod
            topologyKey: "kubernetes.io/hostname"

The catch is, if there is no space on the node, and nothing that can be evicted, this reader pod may never schedule.

In the future, serverless Kubernetes platforms like Autopilot may recognize this state, and automatically move the writer pod to a larger node where the constraint of the reader Pod can be satisfied, but alas that’s not how it works today.

So, for this very specific case of wanting to strictly co-locate 2 pods, how can we do it? One solution, is to use a DaemonSet for one of the pods, and a regular Deployment for the other. We’ll need to layer on a few additional tricks to get this to work though, because we also only want a single reader pod (which we can achieve with pod anti-affinity), and we may wish to run other workloads (so we can separate this particular workload with workload separation).

Putting it all together, we can setup our reader pod as a DaemonSet with workload separation

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: writer
spec:
  selector:
    matchLabels:
      ds: writer-pod
  template:
    metadata:
      labels:
        ds: writer-pod
    spec:
      tolerations:
      - key: group
        operator: Equal
        value: "read_writer_pair"
        effect: NoSchedule
      nodeSelector:
        group: "read_writer_pair"  
      containers:
      - image: "k8s.gcr.io/pause"
        name: reader-container
        resources:
          requests:
            cpu: "1"

writer.yaml

When scheduled, this will actually not create any instances, until we add another pod onto this workload separation group. Here we will create a Deployment that has an anti-affinity to itself (so only one replica per node). The replica count here will determine the number of Pod-pairs that we get.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: reader
spec:
  replicas: 2
  selector:
    matchLabels:
      pod: reader-pod
  template:
    metadata:
      labels:
        pod: reader-pod
    spec:
      tolerations:
      - key: group
        operator: Equal
        value: "read_writer_pair"
        effect: NoSchedule
      nodeSelector:
        group: "read_writer_pair"
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: pod
                operator: In
                values:
                - reader-pod
            topologyKey: "kubernetes.io/hostname"        
      containers:
      - name: reader-container
        image: "k8s.gcr.io/pause"
        resources:
          requests:
            cpu: "1"

reader.yaml

Since Autopilot takes into account the DaemonSet size when creating nodes, this pattern should always result in a paired deployment. Simply set the replica count of the reader Deployment to be the quantity of pairs you want. There’s no additional cost or overhead to this solution on Autopilot either, workload separation has no fee, just a slightly higher CPU minimum of 500vCPU.

Note that the creation order matters: it’s important that the DaemonSet Pod (writer) is created before the reader (it’s OK if they are created together in rapid succession, but not if the Deployment is created a full minute before the DaemonSet).

Let’s run the demo:

kubectl create -f https://raw.githubusercontent.com/WilliamDenniss/autopilot-examples/master/strict-pod-colocation/writer.yaml
kubectl create -f https://raw.githubusercontent.com/WilliamDenniss/autopilot-examples/master/strict-pod-colocation/reader.yaml
watch -d kubectl get pods -o wide

In about a minute or so, you should see 3 pairs of pods deployed on their own nodes.

$ kubectl get pods -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP              NODE                                                 NOMINATED NODE   READINESS GATES
reader-5956f784f9-dfz8t   1/1     Running   0          3m20s   10.51.128.194   gk3-autopilot-cluster-3-nap-y725ql9p-986f4bbc-886z   <none>           <none>
reader-5956f784f9-gwqwj   1/1     Running   0          3m20s   10.51.128.131   gk3-autopilot-cluster-3-nap-y725ql9p-4cc171f7-9pr7   <none>           <none>
reader-5956f784f9-tdgxv   1/1     Running   0          3m20s   10.51.129.5     gk3-autopilot-cluster-3-nap-y725ql9p-de6eb2d5-76q6   <none>           <none>
writer-bf5f2              1/1     Running   0          86s     10.51.128.195   gk3-autopilot-cluster-3-nap-y725ql9p-986f4bbc-886z   <none>           <none>
writer-qzxpc              1/1     Running   0          81s     10.51.129.6     gk3-autopilot-cluster-3-nap-y725ql9p-de6eb2d5-76q6   <none>           <none>
writer-svhmv              1/1     Running   0          83s     10.51.128.130   gk3-autopilot-cluster-3-nap-y725ql9p-4cc171f7-9pr7   <none>           <none>