Running DeepSeek open reasoning models on GKE

4 min read

DeepSeek’s R1 open model launch caused quite a stir with one of the first open reasoning models. Here’s how to run a demo of it locally on GKE!

We can use an Nvidia L4 (or A100 40GB) to run the 8B Llama distilled model, or a A100 80GB to run the 14 and 32B Quen distilled models. I’ll be pairing this with a sample Gradio application that can stream responses, and handles the “</think>” block (unique to the reasoning models) by adding a horizontal line to the output to delineate where the thinking ends, and the final response begins. In your own application of course, you may choose to hide the “</think>” block entirely, or display it in a different manner, like the current LLM chat apps do.

Deploying the 8B model on an Nvidia L4

First, create a GKE Autopilot cluster. You can leave everything as default, except I recommend the Rapid channel for the latest enhancements. us-central1 is a good choice of region due to its selection of GPUs.

Then you need to create a Secret with your hugging face token in order to download the data. You can create an access token in your account, and create with:

HF_TOKEN=your_api_key
kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=$HF_TOKEN \
    --dry-run=client -o yaml | kubectl apply -f -

Then we need two deployments. Firstly the vLLM which I customized based on this GKE tutorial:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
        cloud.google.com/gke-spot: "true"        
        cloud.google.com/gke-gpu-driver-version: latest
      containers:
      - name: vllm-container
        image: vllm/vllm-openai:v0.7.0
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --trust-remote-code
        - --max-model-len=10000
        - --gpu-memory-utilization=0.95
        - --tensor-parallel-size=1
        ports:
        - containerPort: 8000
        env:
        - name: MODEL_ID
          value: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: hf-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: hf-cache
        emptyDir: {}

vllm-deploy.yaml

And an internal service to expose it:

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

vllm-service.yaml

Then my custom Gradio app consisting of a single python file to create a demo interface. The MODEL_ID here must match that specified in the vLLM deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gradio-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gradio
  template:
    metadata:
      labels:
        app: gradio
    spec:
      containers:
      - name: gradio-container
        image: wdenniss/gradio-demo:latest
        ports:
        - containerPort: 7860
        env:
        - name: MODEL_ID
          value: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
        - name: MAX_TOKENS
          value: "9000"
        - name: PYTHONUNBUFFERED
          value: "1"

gradio-deployment.yaml

And a LoadBalancer

apiVersion: v1
kind: Service
metadata:
  name: gradio-service
spec:
  selector:
    app: gradio
  ports:
    - protocol: TCP
      port: 80
      targetPort: 7860
  type: LoadBalancer

gradio-service.yaml

TIP: Don’t want a public IP? Delete the “type: LoadBalancer” in the gradio service definition and forward the port instead with kubectl port-forward svc/gradio-service 8080:80

Don’t want to download YAML? Here’s everything you need to deploy:

git clone [email protected]:WilliamDenniss/autopilot-examples.git
cd autopilot-examples/llm/deepseek
kubectl create -f .

To connect & share, kubectl get svc to get the external IP address of the load balancer.

$ kubectl get svc
NAME             TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)        AGE
gradio-service   LoadBalancer   34.118.235.174   34.60.31.218   80:32614/TCP   17h
kubernetes       ClusterIP      34.118.224.1     &lt;none>         443/TCP        17h
vllm-service     ClusterIP      34.118.226.44    &lt;none>         8000/TCP       17h

Now try a query with reasoning. One of my favorites is to ask for a recipe, while adapting the ingredients, for example “recipe for ANZAC biscuits, adapting the ingredients to those available in the USA”.

To clean up:

kubectl delete deploy vllm-deployment gradio-deployment 
kubectl delete svc gradio-service vllm-service

Deploying the 32B model on an Nvidia A100 80GB

Want to run something bigger? The 14B and 32B Qwen models can fit in the GPU memory of an A100 80GB. Let’s go with the larger of the two!

When deploying this larger model but using the same config as 8B, I encountered this error:

ERROR 02-21 16:40:07 engine.py:387] The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (32016). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

Following the helpful advice from this log, we can tweak the max model length and increase the gpu memory utilization. In the vllm deployment, add these parameters:

        - --max-model-len=32016
        - --gpu-memory-utilization=0.95

Here’s the full new setup:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment-32b-a10080
spec:
  replicas: 1
  selector:
    matchLabels:
      pod: vllm-32b-a10080
  template:
    metadata:
      labels:
        pod: vllm-32b-a10080
        app: vllm
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-a100-80gb
        cloud.google.com/gke-spot: "true"
        cloud.google.com/gke-gpu-driver-version: latest
      containers:
      - name: vllm-container
        image: vllm/vllm-openai:v0.7.0
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --trust-remote-code
        - --max-model-len=32016
        - --gpu-memory-utilization=0.95
        ports:
        - containerPort: 8000
        env:
        - name: MODEL_ID
          value: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: hf-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: hf-cache
        emptyDir: {}

vllm-deploy-32b-a10080.yaml

and in the gradio deployment, reducing the MAX_TOKENS to be inside the model length:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gradio-deployment-32b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gradio
  template:
    metadata:
      labels:
        app: gradio
    spec:
      containers:
      - name: gradio-container
        image: wdenniss/gradio-demo:latest
        ports:
        - containerPort: 7860
        env:
        - name: MODEL_ID
          value: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
        - name: MAX_TOKENS
          value: "30000"
        - name: PYTHONUNBUFFERED
          value: "1"

gradio-deployment-32b.yaml

Here’s the deployment from scratch. You’ll need to delete the previous demo to run it.

git clone [email protected]:WilliamDenniss/autopilot-examples.git
cd autopilot-examples/llm/deepseek-32b
kubectl create -f .

Run the demo by discovering the IP

kubectl get svc

And when you’re done, to clean up:

kubectl delete deploy vllm-deployment-32b-a10080 gradio-deployment-32b
kubectl delete svc gradio-service vllm-service

Summary

That’s the deepseek R1 reasoning model running on GKE! As you can see, it’s a pretty capable model, and not that hard to deploy your own version.