
DeepSeek’s R1 open model launch caused quite a stir with one of the first open reasoning models. Here’s how to run a demo of it locally on GKE!
We can use an Nvidia L4 (or A100 40GB) to run the 8B Llama distilled model, or a A100 80GB to run the 14 and 32B Quen distilled models. I’ll be pairing this with a sample Gradio application that can stream responses, and handles the “</think>” block (unique to the reasoning models) by adding a horizontal line to the output to delineate where the thinking ends, and the final response begins. In your own application of course, you may choose to hide the “</think>” block entirely, or display it in a different manner, like the current LLM chat apps do.
Deploying the 8B model on an Nvidia L4
First, create a GKE Autopilot cluster. You can leave everything as default, except I recommend the Rapid channel for the latest enhancements. us-central1
is a good choice of region due to its selection of GPUs.
Then you need to create a Secret with your hugging face token in order to download the data. You can create an access token in your account, and create with:
HF_TOKEN=your_api_key
kubectl create secret generic hf-secret \
--from-literal=hf_api_token=$HF_TOKEN \
--dry-run=client -o yaml | kubectl apply -f -
Then we need two deployments. Firstly the vLLM which I customized based on this GKE tutorial:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deployment
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
cloud.google.com/gke-spot: "true"
cloud.google.com/gke-gpu-driver-version: latest
containers:
- name: vllm-container
image: vllm/vllm-openai:v0.7.0
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --model=$(MODEL_ID)
- --trust-remote-code
- --max-model-len=10000
- --gpu-memory-utilization=0.95
- --tensor-parallel-size=1
ports:
- containerPort: 8000
env:
- name: MODEL_ID
value: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_api_token
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: hf-cache
mountPath: /root/.cache/huggingface
volumes:
- name: hf-cache
emptyDir: {}
And an internal service to expose it:
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- protocol: TCP
port: 8000
targetPort: 8000
Then my custom Gradio app consisting of a single python file to create a demo interface. The MODEL_ID here must match that specified in the vLLM deployment.
apiVersion: apps/v1
kind: Deployment
metadata:
name: gradio-deployment
spec:
replicas: 1
selector:
matchLabels:
app: gradio
template:
metadata:
labels:
app: gradio
spec:
containers:
- name: gradio-container
image: wdenniss/gradio-demo:latest
ports:
- containerPort: 7860
env:
- name: MODEL_ID
value: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
- name: MAX_TOKENS
value: "9000"
- name: PYTHONUNBUFFERED
value: "1"
And a LoadBalancer
apiVersion: v1
kind: Service
metadata:
name: gradio-service
spec:
selector:
app: gradio
ports:
- protocol: TCP
port: 80
targetPort: 7860
type: LoadBalancer
TIP: Don’t want a public IP? Delete the “type: LoadBalancer” in the gradio service definition and forward the port instead with kubectl port-forward svc/gradio-service 8080:80
Don’t want to download YAML? Here’s everything you need to deploy:
git clone [email protected]:WilliamDenniss/autopilot-examples.git
cd autopilot-examples/llm/deepseek
kubectl create -f .
To connect & share, kubectl get svc
to get the external IP address of the load balancer.
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
gradio-service LoadBalancer 34.118.235.174 34.60.31.218 80:32614/TCP 17h
kubernetes ClusterIP 34.118.224.1 <none> 443/TCP 17h
vllm-service ClusterIP 34.118.226.44 <none> 8000/TCP 17h
Now try a query with reasoning. One of my favorites is to ask for a recipe, while adapting the ingredients, for example “recipe for ANZAC biscuits, adapting the ingredients to those available in the USA”.

To clean up:
kubectl delete deploy vllm-deployment gradio-deployment
kubectl delete svc gradio-service vllm-service
Deploying the 32B model on an Nvidia A100 80GB
Want to run something bigger? The 14B and 32B Qwen models can fit in the GPU memory of an A100 80GB. Let’s go with the larger of the two!
When deploying this larger model but using the same config as 8B, I encountered this error:
ERROR 02-21 16:40:07 engine.py:387] The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (32016). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Following the helpful advice from this log, we can tweak the max model length and increase the gpu memory utilization. In the vllm deployment, add these parameters:
- --max-model-len=32016
- --gpu-memory-utilization=0.95
Here’s the full new setup:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deployment-32b-a10080
spec:
replicas: 1
selector:
matchLabels:
pod: vllm-32b-a10080
template:
metadata:
labels:
pod: vllm-32b-a10080
app: vllm
spec:
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-a100-80gb
cloud.google.com/gke-spot: "true"
cloud.google.com/gke-gpu-driver-version: latest
containers:
- name: vllm-container
image: vllm/vllm-openai:v0.7.0
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --model=$(MODEL_ID)
- --trust-remote-code
- --max-model-len=32016
- --gpu-memory-utilization=0.95
ports:
- containerPort: 8000
env:
- name: MODEL_ID
value: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_api_token
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: hf-cache
mountPath: /root/.cache/huggingface
volumes:
- name: hf-cache
emptyDir: {}
and in the gradio deployment, reducing the MAX_TOKENS to be inside the model length:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gradio-deployment-32b
spec:
replicas: 1
selector:
matchLabels:
app: gradio
template:
metadata:
labels:
app: gradio
spec:
containers:
- name: gradio-container
image: wdenniss/gradio-demo:latest
ports:
- containerPort: 7860
env:
- name: MODEL_ID
value: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
- name: MAX_TOKENS
value: "30000"
- name: PYTHONUNBUFFERED
value: "1"
Here’s the deployment from scratch. You’ll need to delete the previous demo to run it.
git clone [email protected]:WilliamDenniss/autopilot-examples.git
cd autopilot-examples/llm/deepseek-32b
kubectl create -f .
Run the demo by discovering the IP
kubectl get svc
And when you’re done, to clean up:
kubectl delete deploy vllm-deployment-32b-a10080 gradio-deployment-32b
kubectl delete svc gradio-service vllm-service
Summary
That’s the deepseek R1 reasoning model running on GKE! As you can see, it’s a pretty capable model, and not that hard to deploy your own version.