# Using Image Streaming with DockerHub Image streaming is a really great way to speed up workload scaling on GKE. Take for example the deep learning image from Google. In my testing, the container is created in just 20s, instead of 3m50s. While there is slightly higher latency on reads while the image streams, the 3m30s head start is going to well make up for that. If you use GKE in Autopilot mode, and Artifact Registry for your images, then you’re already using image streaming (to verify, look for the `ImageStreaming` event with `kubectl get events`, if it’s not there, then check your settings such as the API enablement). But what about public images you download from DockerHub? These are not stored by Google in Artifact Registry, so miss out on the benefits of image streaming. Fortunately there is a way to enable Image Streaming, using a [remote artifact registry repository](https://cloud.google.com/artifact-registry/docs/repositories/remote-repo). Remote repositories act as a pass-through cache, whereby if you request an image that isn’t in the cache, it will be pulled from DockerHub, but on subsequent attempts it will be served directly from AR. They also have the benefit of being compatible with image streaming! To create your own DockerHub mirror, do the following: ```shell REPOSITORY_NAME=dockerhub-mirror LOCATION=us gcloud artifacts repositories create $REPOSITORY_NAME \ --repository-format=docker \ --mode=remote-repository \ --remote-docker-repo=docker-hub \ --location=$LOCATION ``` As DockerHub has rate limits on public connections, it’s highly encouraged to pass your DockerHub credentials, by adding the following. For production, make sure you’re doing this. For this simple test, it’s not needed. ```shell --remote-username=REMOTE_USERNAME --remote-password-secret-version=REMOTE_PASSWORD_SECRET_VERSION ``` To pull from this newly created repo locally, make sure you authenticate docker: ```shell gcloud auth configure-docker $LOCATION-docker.pkg.dev ``` Now all you need to do is prepend your repository path to the DockerHub container image. For example, the images: `anyscale/ray-ml:2.7.0`, and `docker.io/anyscale/ray-ml:2.7.0` become: **`us-docker.pkg.dev/gke-autopilot-demo/dockerhub-mirror/`**`anyscale/ray-ml:2.7.0` Replacing my mirror with the one you just created: ```shell PROJECT=$(gcloud config list project --format "value(core.project)") IMAGE_PREFIX="$LOCATION-docker.pkg.dev/${PROJECT}/$REPOSITORY_NAME/" DOCKER_IMAGE="anyscale/ray-ml:2.7.0" echo ${IMAGE_PREFIX}${DOCKER_IMAGE} ``` The first time you pull this image, it won’t be any faster than pulling from DockerHub. But the second and subsequent times it will. To try it, schedule the workload, then cordon the node (the node has a local container cache, so deleting just the Pod won’t work), then watch it schedule a second time. Note: Only pulling the image in GKE will cause the image to be loaded into the image-streaming cache. While `docker pull` will work on your remote repository path, it doesn’t pre-cache the image. Watch your events like so, and you should see an event ImageStreaming. In my testing on this exact container, the container went from ContainerCreating to Running in 9.82s, compared to 5m17s without image streaming. ```shell $ kubectl get events --watch-only=true 0s Normal Pulling pod/test2-5f897c684d-jzrdb Pulling image "us-docker.pkg.dev/gke-autopilot-demo/dockerhub-mirror/anyscale/ray-ml:2.7.0" 0s Normal ImageStreaming node/gk3-test-cluster-nap-z1jaqc1d-d3316a3d-x2pd (combined from similar events): Image us-docker.pkg.dev/gke-autopilot-demo/dockerhub-mirror/anyscale/ray-ml:2.7.0 is backed by image streaming. 0s Normal Pulled pod/test2-5f897c684d-jzrdb Successfully pulled image "us-docker.pkg.dev/gke-autopilot-demo/dockerhub-mirror/anyscale/ray-ml:2.7.0" in 7.923s (7.923s including waiting) ``` In my testing, I used the following two examples to compare. The first pulls directly from DockerHub: ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: test spec: replicas: 1 selector: matchLabels: pod: test template: metadata: labels: pod: test spec: nodeSelector: cloud.google.com/gke-accelerator: nvidia-l4 cloud.google.com/gke-spot: "true" containers: - name: test-cont image: anyscale/ray-ml:2.7.0 command: ["sleep", "infinity"] resources: requests: ephemeral-storage: 10Gi memory: 26Gi limits: nvidia.com/gpu: "1" ``` The second uses the newly created remote repo. Just remember to run this twice, by cordoning the first node it lands on! ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: test2 spec: replicas: 1 selector: matchLabels: pod: test2 template: metadata: labels: pod: test2 spec: nodeSelector: cloud.google.com/gke-accelerator: nvidia-l4 cloud.google.com/gke-spot: "true" containers: - name: test-cont image: us-docker.pkg.dev/gke-autopilot-demo/dockerhub-mirror4/anyscale/ray-ml:2.7.0 command: ["sleep", "infinity"] resources: requests: ephemeral-storage: 10Gi memory: 26Gi limits: nvidia.com/gpu: "1" ```