# Running DeepSeek open reasoning models on GKE


DeepSeek’s R1 open model launch caused quite a stir with one of the first open reasoning models. Here’s how to run a demo of it locally on GKE!

We can use an Nvidia L4 (or A100 40GB) to run the 8B Llama distilled model, or a A100 80GB to run the 14 and 32B Quen distilled models. I’ll be pairing this with a [sample Gradio application](https://github.com/WilliamDenniss/autopilot-examples/tree/master/llm/gradio) that can stream responses, and handles the “\</think\>” block (unique to the reasoning models) by adding a horizontal line to the output to delineate where the thinking ends, and the final response begins. In your own application of course, you may choose to hide the “\</think\>” block entirely, or display it in a different manner, like the current LLM chat apps do.

## Deploying the 8B model on an Nvidia L4

First, create a GKE Autopilot cluster. You can leave everything as default, except I recommend the Rapid channel for the latest enhancements. `us-central1` is a good choice of region due to its selection of GPUs.

Then you need to create a Secret with your hugging face token in order to download the data. You can create an access token in your [account](https://huggingface.co/settings/tokens), and create with:

```bash
HF_TOKEN=your_api_key
kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=$HF_TOKEN \
    --dry-run=client -o yaml | kubectl apply -f -
```

Then we need two deployments. Firstly the vLLM which I customized based on this GKE tutorial:

```yaml {hl_lines=[16, 17, 22, 23, 24, 25, 26, 27, 28, 32, 33]}
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
        cloud.google.com/gke-spot: "true"        
        cloud.google.com/gke-gpu-driver-version: latest
      containers:
      - name: vllm-container
        image: vllm/vllm-openai:v0.7.0
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --trust-remote-code
        - --max-model-len=10000
        - --gpu-memory-utilization=0.95
        - --tensor-parallel-size=1
        ports:
        - containerPort: 8000
        env:
        - name: MODEL_ID
          value: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: hf-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: hf-cache
        emptyDir: {}
```

[vllm-deploy.yaml](https://github.com/WilliamDenniss/autopilot-examples/blob/master/llm/deepseek/vllm-deploy.yaml)

And an internal service to expose it:

```yaml
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
```

[vllm-service.yaml](https://github.com/WilliamDenniss/autopilot-examples/blob/master/llm/deepseek/vllm-service.yaml)

Then my [custom Gradio app](https://github.com/WilliamDenniss/autopilot-examples/tree/master/llm/gradio) consisting of a single python file to create a demo interface. The MODEL_ID here must match that specified in the vLLM deployment.

```yaml {hl_lines=[17, 21, 22]}
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gradio-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gradio
  template:
    metadata:
      labels:
        app: gradio
    spec:
      containers:
      - name: gradio-container
        image: wdenniss/gradio-demo:latest
        ports:
        - containerPort: 7860
        env:
        - name: MODEL_ID
          value: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
        - name: MAX_TOKENS
          value: "9000"
        - name: PYTHONUNBUFFERED
          value: "1"
```

[gradio-deployment.yaml](https://github.com/WilliamDenniss/autopilot-examples/blob/master/llm/deepseek/gradio-deployment.yaml)

And a LoadBalancer

```yaml
apiVersion: v1
kind: Service
metadata:
  name: gradio-service
spec:
  selector:
    app: gradio
  ports:
    - protocol: TCP
      port: 80
      targetPort: 7860
  type: LoadBalancer
```

[gradio-service.yaml](https://github.com/WilliamDenniss/autopilot-examples/blob/master/llm/deepseek/gradio-service.yaml)

Tip: Don’t want a public IP? Delete the “type: LoadBalancer” in the gradio service definition and forward the port instead with `kubectl port-forward svc/gradio-service 8080:80`

Don’t want to download YAML? Here’s everything you need to deploy:

```bash
git clone git@github.com:WilliamDenniss/autopilot-examples.git
cd autopilot-examples/llm/deepseek
kubectl create -f .
```

To connect & share, `kubectl get svc` to get the external IP address of the load balancer.

```text {hl_lines=[3]}
$ kubectl get svc
NAME             TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)        AGE
gradio-service   LoadBalancer   34.118.235.174   34.60.31.218   80:32614/TCP   17h
kubernetes       ClusterIP      34.118.224.1     <none>         443/TCP        17h
vllm-service     ClusterIP      34.118.226.44    <none>         8000/TCP       17h
```

Now try a query with reasoning. One of my favorites is to ask for a recipe, while adapting the ingredients, for example “recipe for ANZAC biscuits, adapting the ingredients to those available in the USA”.

![](Screenshot-2025-02-26-at-1.11.41-PM.png)

To clean up:

```bash
kubectl delete deploy vllm-deployment gradio-deployment 
kubectl delete svc gradio-service vllm-service
```

## Deploying the 32B model on an Nvidia A100 80GB

Want to run something bigger? The 14B and 32B Qwen models can fit in the GPU memory of an A100 80GB. Let’s go with the larger of the two!

When deploying this larger model but using the same config as 8B, I encountered this error:

```
ERROR 02-21 16:40:07 engine.py:387] The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (32016). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
```

Following the helpful advice from this log, we can tweak the max model length and increase the gpu memory utilization. In the vllm deployment, add these parameters:

```
        - --max-model-len=32016
        - --gpu-memory-utilization=0.95
```

Here’s the full new setup:

```yaml {hl_lines=[17, 27, 28, 33]}
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment-32b-a10080
spec:
  replicas: 1
  selector:
    matchLabels:
      pod: vllm-32b-a10080
  template:
    metadata:
      labels:
        pod: vllm-32b-a10080
        app: vllm
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-a100-80gb
        cloud.google.com/gke-spot: "true"
        cloud.google.com/gke-gpu-driver-version: latest
      containers:
      - name: vllm-container
        image: vllm/vllm-openai:v0.7.0
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --trust-remote-code
        - --max-model-len=32016
        - --gpu-memory-utilization=0.95
        ports:
        - containerPort: 8000
        env:
        - name: MODEL_ID
          value: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: hf-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: hf-cache
        emptyDir: {}
```

[vllm-deploy-32b-a10080.yaml](https://github.com/WilliamDenniss/autopilot-examples/blob/master/llm/deepseek-32b/vllm-deploy-32b-a10080.yaml)

and in the gradio deployment, reducing the MAX_TOKENS to be inside the model length:

```yaml {hl_lines=[22, 23, 24]}
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gradio-deployment-32b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gradio
  template:
    metadata:
      labels:
        app: gradio
    spec:
      containers:
      - name: gradio-container
        image: wdenniss/gradio-demo:latest
        ports:
        - containerPort: 7860
        env:
        - name: MODEL_ID
          value: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
        - name: MAX_TOKENS
          value: "30000"
        - name: PYTHONUNBUFFERED
          value: "1"
```

[gradio-deployment-32b.yaml](https://github.com/WilliamDenniss/autopilot-examples/blob/master/llm/deepseek-32b/gradio-deployment-32b.yaml)

Here’s the deployment from scratch. You’ll need to delete the previous demo to run it.

```php
git clone git@github.com:WilliamDenniss/autopilot-examples.git
cd autopilot-examples/llm/deepseek-32b
kubectl create -f .
```

Run the demo by discovering the IP

```bash
kubectl get svc
```

And when you’re done, to clean up:

```bash
kubectl delete deploy vllm-deployment-32b-a10080 gradio-deployment-32b
kubectl delete svc gradio-service vllm-service
```

## Summary

That’s the deepseek R1 reasoning model running on GKE! As you can see, it’s a pretty capable model, and not that hard to deploy your own version.