Setup
Create a RunPod account
Sign up at runpod.io and add credits.
Create a Serverless endpoint
Go to Serverless → New Endpoint. Configure:
- Container image: Use the vLLM OpenAI-compatible server image
- GPU type: A100 80GB or H100 recommended (32B model requires ~40GB VRAM)
- Model: Set the model path to the Mako-32B Conductor weights
- Max workers: Set based on expected concurrent load
- Idle timeout: Controls how quickly workers scale down (affects cold start frequency)
Get your credentials
- Endpoint ID: Found on the endpoint dashboard page
- API Key: Generate under Settings → API Keys
How the gateway communicates with RunPod
The gateway uses RunPod’s synchronous endpoint (/runsync):
- Submit — POST to
https://api.runpod.ai/v2/{endpoint_id}/runsyncwith the model input - Poll — If the response is
IN_QUEUEorIN_PROGRESS, the gateway polls/status/{job_id}every 5 seconds - Timeout — Maximum wait of 6 minutes for cold starts
- Retry — If the model returns an empty output error (common with tool-calling edge cases), the gateway retries without tool definitions
Cold start considerations
- First request after idle may take 30–90 seconds while a GPU worker spins up
- Subsequent requests to a warm worker typically complete in 3–10 seconds
- The gateway returns a
503status during cold starts, allowing frontends to display a loading indicator - Set a longer idle timeout on RunPod to reduce cold start frequency (at the cost of higher GPU spend)