RunPod Setup

Setup

Create a RunPod account

Create a Serverless endpoint

Go to Serverless → New Endpoint. Configure:

Container image: Use the vLLM OpenAI-compatible server image
GPU type: A100 80GB or H100 recommended (32B model requires ~40GB VRAM)
Model: Set the model path to the Mako-32B Conductor weights
Max workers: Set based on expected concurrent load
Idle timeout: Controls how quickly workers scale down (affects cold start frequency)

Get your credentials

Endpoint ID: Found on the endpoint dashboard page
API Key: Generate under Settings → API Keys

Configure the gateway

Add these to your gateway’s .env file:

RUNPOD_ENDPOINT_ID=your_endpoint_id
RUNPOD_API_KEY=your_runpod_api_key
MODEL_NAME=DeepMako/Mako-32B-Conductor

When both RUNPOD_ENDPOINT_ID and RUNPOD_API_KEY are set, the gateway routes model: "conductor" requests to RunPod instead of the local Ollama instance.

How the gateway communicates with RunPod

The gateway uses RunPod’s synchronous endpoint (/runsync):

Submit — POST to https://api.runpod.ai/v2/{endpoint_id}/runsync with the model input

Poll — If the response is IN_QUEUE or IN_PROGRESS, the gateway polls /status/{job_id} every 5 seconds

Timeout — Maximum wait of 6 minutes for cold starts

Retry — If the model returns an empty output error (common with tool-calling edge cases), the gateway retries without tool definitions

Cold start considerations

First request after idle may take 30–90 seconds while a GPU worker spins up

Subsequent requests to a warm worker typically complete in 3–10 seconds

The gateway returns a 503 status during cold starts, allowing frontends to display a loading indicator

Set a longer idle timeout on RunPod to reduce cold start frequency (at the cost of higher GPU spend)

​Setup

​How the gateway communicates with RunPod

​Cold start considerations

Setup

How the gateway communicates with RunPod

Cold start considerations