Image streaming is a really great way to speed up workload scaling on GKE. Take for example the deep learning image from Google. In my testing, the container is created in just 20s, instead of 3m50s. While there is slightly higher latency on reads while the image streams, the 3m30s head start is going to well make up for that.
If you use GKE in Autopilot mode, and Artifact Registry for your images, then you’re already using image streaming (to verify, look for the ImageStreaming
event with kubectl get events
, if it’s not there, then check your settings such as the API enablement).
But what about public images you download from DockerHub? These are not stored by Google in Artifact Registry, so miss out on the benefits of image streaming. Fortunately there is a way to enable Image Streaming, using a remote artifact registry repository. Remote repositories act as a pass-through cache, whereby if you request an image that isn’t in the cache, it will be pulled from DockerHub, but on subsequent attempts it will be served directly from AR. They also have the benefit of being compatible with image streaming!
To create your own DockerHub mirror, do the following:
REPOSITORY_NAME=dockerhub-mirror
LOCATION=us
gcloud artifacts repositories create $REPOSITORY_NAME \
--repository-format=docker \
--mode=remote-repository \
--remote-docker-repo=docker-hub \
--location=$LOCATION
As DockerHub has rate limits on public connections, it’s highly encouraged to pass your DockerHub credentials, by adding the following. For production, make sure you’re doing this. For this simple test, it’s not needed.
--remote-username=REMOTE_USERNAME
--remote-password-secret-version=REMOTE_PASSWORD_SECRET_VERSION
To pull from this newly created repo locally, make sure you authenticate docker:
gcloud auth configure-docker $LOCATION-docker.pkg.dev
Now all you need to do is prepend your repository path to the DockerHub container image.
For example, the images:anyscale/ray-ml:2.7.0
, anddocker.io/anyscale/ray-ml:2.7.0
become:us-docker.pkg.dev/gke-autopilot-demo/dockerhub-mirror/anyscale/ray-ml:2.7.0
Replacing my mirror with the one you just created:
PROJECT=$(gcloud config list project --format "value(core.project)")
IMAGE_PREFIX="$LOCATION-docker.pkg.dev/${PROJECT}/$REPOSITORY_NAME/"
DOCKER_IMAGE="anyscale/ray-ml:2.7.0"
echo ${IMAGE_PREFIX}${DOCKER_IMAGE}
The first time you pull this image, it won’t be any faster than pulling from DockerHub. But the second and subsequent times it will. To try it, schedule the workload, then cordon the node (the node has a local container cache, so deleting just the Pod won’t work), then watch it schedule a second time.
Note: Only pulling the image in GKE will cause the image to be loaded into the image-streaming cache. While docker pull
will work on your remote repository path, it doesn’t pre-cache the image.
Watch your events like so, and you should see an event ImageStreaming. In my testing on this exact container, the container went from ContainerCreating to Running in 9.82s, compared to 5m17s without image streaming.
$ k get events --watch-only=true
0s Normal Pulling pod/test2-5f897c684d-jzrdb Pulling image "us-docker.pkg.dev/gke-autopilot-demo/dockerhub-mirror/anyscale/ray-ml:2.7.0"
0s Normal ImageStreaming node/gk3-test-cluster-nap-z1jaqc1d-d3316a3d-x2pd (combined from similar events): Image us-docker.pkg.dev/gke-autopilot-demo/dockerhub-mirror/anyscale/ray-ml:2.7.0 is backed by image streaming.
0s Normal Pulled pod/test2-5f897c684d-jzrdb Successfully pulled image "us-docker.pkg.dev/gke-autopilot-demo/dockerhub-mirror/anyscale/ray-ml:2.7.0" in 7.923s (7.923s including waiting)
In my testing, I used the following two examples to compare. The first pulls directly from DockerHub:
apiVersion: apps/v1
kind: Deployment
metadata:
name: test
spec:
replicas: 1
selector:
matchLabels:
pod: test
template:
metadata:
labels:
pod: test
spec:
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
cloud.google.com/gke-spot: "true"
containers:
- name: test-cont
image: anyscale/ray-ml:2.7.0
command: ["sleep", "infinity"]
resources:
requests:
ephemeral-storage: 10Gi
memory: 26Gi
limits:
nvidia.com/gpu: "1"
The second uses the newly created remote repo. Just remember to run this twice, by cordoning the first node it lands on!
apiVersion: apps/v1
kind: Deployment
metadata:
name: test2
spec:
replicas: 1
selector:
matchLabels:
pod: test2
template:
metadata:
labels:
pod: test2
spec:
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
cloud.google.com/gke-spot: "true"
containers:
- name: test-cont
image: us-docker.pkg.dev/gke-autopilot-demo/dockerhub-mirror4/anyscale/ray-ml:2.7.0
command: ["sleep", "infinity"]
resources:
requests:
ephemeral-storage: 10Gi
memory: 26Gi
limits:
nvidia.com/gpu: "1"