# Running DeepSeek open reasoning models on GKE DeepSeek’s R1 open model launch caused quite a stir with one of the first open reasoning models. Here’s how to run a demo of it locally on GKE! We can use an Nvidia L4 (or A100 40GB) to run the 8B Llama distilled model, or a A100 80GB to run the 14 and 32B Quen distilled models. I’ll be pairing this with a [sample Gradio application](https://github.com/WilliamDenniss/autopilot-examples/tree/master/llm/gradio) that can stream responses, and handles the “\” block (unique to the reasoning models) by adding a horizontal line to the output to delineate where the thinking ends, and the final response begins. In your own application of course, you may choose to hide the “\” block entirely, or display it in a different manner, like the current LLM chat apps do. ## Deploying the 8B model on an Nvidia L4 First, create a GKE Autopilot cluster. You can leave everything as default, except I recommend the Rapid channel for the latest enhancements. `us-central1` is a good choice of region due to its selection of GPUs. Then you need to create a Secret with your hugging face token in order to download the data. You can create an access token in your [account](https://huggingface.co/settings/tokens), and create with: ```bash HF_TOKEN=your_api_key kubectl create secret generic hf-secret \ --from-literal=hf_api_token=$HF_TOKEN \ --dry-run=client -o yaml | kubectl apply -f - ``` Then we need two deployments. Firstly the vLLM which I customized based on this GKE tutorial: ```yaml {hl_lines=[16, 17, 22, 23, 24, 25, 26, 27, 28, 32, 33]} apiVersion: apps/v1 kind: Deployment metadata: name: vllm-deployment spec: replicas: 1 selector: matchLabels: app: vllm template: metadata: labels: app: vllm spec: nodeSelector: cloud.google.com/gke-accelerator: nvidia-l4 cloud.google.com/gke-spot: "true" cloud.google.com/gke-gpu-driver-version: latest containers: - name: vllm-container image: vllm/vllm-openai:v0.7.0 command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] args: - --model=$(MODEL_ID) - --trust-remote-code - --max-model-len=10000 - --gpu-memory-utilization=0.95 - --tensor-parallel-size=1 ports: - containerPort: 8000 env: - name: MODEL_ID value: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-secret key: hf_api_token resources: limits: nvidia.com/gpu: 1 volumeMounts: - name: hf-cache mountPath: /root/.cache/huggingface volumes: - name: hf-cache emptyDir: {} ``` [vllm-deploy.yaml](https://github.com/WilliamDenniss/autopilot-examples/blob/master/llm/deepseek/vllm-deploy.yaml) And an internal service to expose it: ```yaml apiVersion: v1 kind: Service metadata: name: vllm-service spec: selector: app: vllm ports: - protocol: TCP port: 8000 targetPort: 8000 ``` [vllm-service.yaml](https://github.com/WilliamDenniss/autopilot-examples/blob/master/llm/deepseek/vllm-service.yaml) Then my [custom Gradio app](https://github.com/WilliamDenniss/autopilot-examples/tree/master/llm/gradio) consisting of a single python file to create a demo interface. The MODEL_ID here must match that specified in the vLLM deployment. ```yaml {hl_lines=[17, 21, 22]} apiVersion: apps/v1 kind: Deployment metadata: name: gradio-deployment spec: replicas: 1 selector: matchLabels: app: gradio template: metadata: labels: app: gradio spec: containers: - name: gradio-container image: wdenniss/gradio-demo:latest ports: - containerPort: 7860 env: - name: MODEL_ID value: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - name: MAX_TOKENS value: "9000" - name: PYTHONUNBUFFERED value: "1" ``` [gradio-deployment.yaml](https://github.com/WilliamDenniss/autopilot-examples/blob/master/llm/deepseek/gradio-deployment.yaml) And a LoadBalancer ```yaml apiVersion: v1 kind: Service metadata: name: gradio-service spec: selector: app: gradio ports: - protocol: TCP port: 80 targetPort: 7860 type: LoadBalancer ``` [gradio-service.yaml](https://github.com/WilliamDenniss/autopilot-examples/blob/master/llm/deepseek/gradio-service.yaml) Tip: Don’t want a public IP? Delete the “type: LoadBalancer” in the gradio service definition and forward the port instead with `kubectl port-forward svc/gradio-service 8080:80` Don’t want to download YAML? Here’s everything you need to deploy: ```bash git clone git@github.com:WilliamDenniss/autopilot-examples.git cd autopilot-examples/llm/deepseek kubectl create -f . ``` To connect & share, `kubectl get svc` to get the external IP address of the load balancer. ```text {hl_lines=[3]} $ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE gradio-service LoadBalancer 34.118.235.174 34.60.31.218 80:32614/TCP 17h kubernetes ClusterIP 34.118.224.1 443/TCP 17h vllm-service ClusterIP 34.118.226.44 8000/TCP 17h ``` Now try a query with reasoning. One of my favorites is to ask for a recipe, while adapting the ingredients, for example “recipe for ANZAC biscuits, adapting the ingredients to those available in the USA”. ![](Screenshot-2025-02-26-at-1.11.41-PM.png) To clean up: ```bash kubectl delete deploy vllm-deployment gradio-deployment kubectl delete svc gradio-service vllm-service ``` ## Deploying the 32B model on an Nvidia A100 80GB Want to run something bigger? The 14B and 32B Qwen models can fit in the GPU memory of an A100 80GB. Let’s go with the larger of the two! When deploying this larger model but using the same config as 8B, I encountered this error: ``` ERROR 02-21 16:40:07 engine.py:387] The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (32016). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. ``` Following the helpful advice from this log, we can tweak the max model length and increase the gpu memory utilization. In the vllm deployment, add these parameters: ``` - --max-model-len=32016 - --gpu-memory-utilization=0.95 ``` Here’s the full new setup: ```yaml {hl_lines=[17, 27, 28, 33]} apiVersion: apps/v1 kind: Deployment metadata: name: vllm-deployment-32b-a10080 spec: replicas: 1 selector: matchLabels: pod: vllm-32b-a10080 template: metadata: labels: pod: vllm-32b-a10080 app: vllm spec: nodeSelector: cloud.google.com/gke-accelerator: nvidia-a100-80gb cloud.google.com/gke-spot: "true" cloud.google.com/gke-gpu-driver-version: latest containers: - name: vllm-container image: vllm/vllm-openai:v0.7.0 command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] args: - --model=$(MODEL_ID) - --trust-remote-code - --max-model-len=32016 - --gpu-memory-utilization=0.95 ports: - containerPort: 8000 env: - name: MODEL_ID value: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-secret key: hf_api_token resources: limits: nvidia.com/gpu: 1 volumeMounts: - name: hf-cache mountPath: /root/.cache/huggingface volumes: - name: hf-cache emptyDir: {} ``` [vllm-deploy-32b-a10080.yaml](https://github.com/WilliamDenniss/autopilot-examples/blob/master/llm/deepseek-32b/vllm-deploy-32b-a10080.yaml) and in the gradio deployment, reducing the MAX_TOKENS to be inside the model length: ```yaml {hl_lines=[22, 23, 24]} apiVersion: apps/v1 kind: Deployment metadata: name: gradio-deployment-32b spec: replicas: 1 selector: matchLabels: app: gradio template: metadata: labels: app: gradio spec: containers: - name: gradio-container image: wdenniss/gradio-demo:latest ports: - containerPort: 7860 env: - name: MODEL_ID value: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B - name: MAX_TOKENS value: "30000" - name: PYTHONUNBUFFERED value: "1" ``` [gradio-deployment-32b.yaml](https://github.com/WilliamDenniss/autopilot-examples/blob/master/llm/deepseek-32b/gradio-deployment-32b.yaml) Here’s the deployment from scratch. You’ll need to delete the previous demo to run it. ```php git clone git@github.com:WilliamDenniss/autopilot-examples.git cd autopilot-examples/llm/deepseek-32b kubectl create -f . ``` Run the demo by discovering the IP ```bash kubectl get svc ``` And when you’re done, to clean up: ```bash kubectl delete deploy vllm-deployment-32b-a10080 gradio-deployment-32b kubectl delete svc gradio-service vllm-service ``` ## Summary That’s the deepseek R1 reasoning model running on GKE! As you can see, it’s a pretty capable model, and not that hard to deploy your own version.