Using Image Streaming with DockerHub

2 min read

Image streaming is a really great way to speed up workload scaling on GKE. Take for example the deep learning image from Google. In my testing, the container is created in just 20s, instead of 3m50s. While there is slightly higher latency on reads while the image streams, the 3m30s head start is going to well make up for that.

If you use GKE in Autopilot mode, and Artifact Registry for your images, then you’re already using image streaming (to verify, look for the ImageStreaming event with kubectl get events, if it’s not there, then check your settings such as the API enablement).

But what about public images you download from DockerHub? These are not stored by Google in Artifact Registry, so miss out on the benefits of image streaming. Fortunately there is a way to enable Image Streaming, using a remote artifact registry repository. Remote repositories act as a pass-through cache, whereby if you request an image that isn’t in the cache, it will be pulled from DockerHub, but on subsequent attempts it will be served directly from AR. They also have the benefit of being compatible with image streaming!

To create your own DockerHub mirror, do the following:

REPOSITORY_NAME=dockerhub-mirror
LOCATION=us
gcloud artifacts repositories create $REPOSITORY_NAME \
    --repository-format=docker \
    --mode=remote-repository \
    --remote-docker-repo=docker-hub \
    --location=$LOCATION

As DockerHub has rate limits on public connections, it’s highly encouraged to pass your DockerHub credentials, by adding the following. For production, make sure you’re doing this. For this simple test, it’s not needed.

--remote-username=REMOTE_USERNAME
--remote-password-secret-version=REMOTE_PASSWORD_SECRET_VERSION

To pull from this newly created repo locally, make sure you authenticate docker:

gcloud auth configure-docker $LOCATION-docker.pkg.dev

Now all you need to do is prepend your repository path to the DockerHub container image.

For example, the images:
anyscale/ray-ml:2.7.0, and
docker.io/anyscale/ray-ml:2.7.0
become:
us-docker.pkg.dev/gke-autopilot-demo/dockerhub-mirror/anyscale/ray-ml:2.7.0

Replacing my mirror with the one you just created:

PROJECT=$(gcloud config list project --format "value(core.project)")
IMAGE_PREFIX="$LOCATION-docker.pkg.dev/${PROJECT}/$REPOSITORY_NAME/"
DOCKER_IMAGE="anyscale/ray-ml:2.7.0"
echo ${IMAGE_PREFIX}${DOCKER_IMAGE}

The first time you pull this image, it won’t be any faster than pulling from DockerHub. But the second and subsequent times it will. To try it, schedule the workload, then cordon the node (the node has a local container cache, so deleting just the Pod won’t work), then watch it schedule a second time.

Note: Only pulling the image in GKE will cause the image to be loaded into the image-streaming cache. While docker pull will work on your remote repository path, it doesn’t pre-cache the image.

Watch your events like so, and you should see an event ImageStreaming. In my testing on this exact container, the container went from ContainerCreating to Running in 9.82s, compared to 5m17s without image streaming.

$ k get events --watch-only=true
0s          Normal    Pulling                                  pod/test2-5f897c684d-jzrdb                         Pulling image "us-docker.pkg.dev/gke-autopilot-demo/dockerhub-mirror/anyscale/ray-ml:2.7.0"
0s          Normal    ImageStreaming                           node/gk3-test-cluster-nap-z1jaqc1d-d3316a3d-x2pd   (combined from similar events): Image us-docker.pkg.dev/gke-autopilot-demo/dockerhub-mirror/anyscale/ray-ml:2.7.0 is backed by image streaming.
0s          Normal    Pulled                                   pod/test2-5f897c684d-jzrdb                         Successfully pulled image "us-docker.pkg.dev/gke-autopilot-demo/dockerhub-mirror/anyscale/ray-ml:2.7.0" in 7.923s (7.923s including waiting)

In my testing, I used the following two examples to compare. The first pulls directly from DockerHub:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test
spec:
  replicas: 1
  selector:
    matchLabels:
      pod: test
  template:
    metadata:
      labels:
        pod: test
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4    
        cloud.google.com/gke-spot: "true"
      containers:
      - name: test-cont
        image: anyscale/ray-ml:2.7.0
        command: ["sleep", "infinity"]
        resources:
          requests:
            ephemeral-storage: 10Gi
            memory: 26Gi
          limits:
            nvidia.com/gpu: "1"

The second uses the newly created remote repo. Just remember to run this twice, by cordoning the first node it lands on!

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test2
spec:
  replicas: 1
  selector:
    matchLabels:
      pod: test2
  template:
    metadata:
      labels:
        pod: test2
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4    
        cloud.google.com/gke-spot: "true"
      containers:
      - name: test-cont
        image: us-docker.pkg.dev/gke-autopilot-demo/dockerhub-mirror4/anyscale/ray-ml:2.7.0
        command: ["sleep", "infinity"]
        resources:
          requests:
            ephemeral-storage: 10Gi
            memory: 26Gi
          limits:
            nvidia.com/gpu: "1"