Skip to main content
The Mako-32B Conductor model requires GPU infrastructure to run. RunPod Serverless provides autoscaling GPU workers that the gateway communicates with via REST API.

Setup

1

Create a RunPod account

Sign up at runpod.io and add credits.
2

Create a Serverless endpoint

Go to ServerlessNew Endpoint. Configure:
  • Container image: Use the vLLM OpenAI-compatible server image
  • GPU type: A100 80GB or H100 recommended (32B model requires ~40GB VRAM)
  • Model: Set the model path to the Mako-32B Conductor weights
  • Max workers: Set based on expected concurrent load
  • Idle timeout: Controls how quickly workers scale down (affects cold start frequency)
3

Get your credentials

  • Endpoint ID: Found on the endpoint dashboard page
  • API Key: Generate under SettingsAPI Keys
4

Configure the gateway

Add these to your gateway’s .env file:
RUNPOD_ENDPOINT_ID=your_endpoint_id
RUNPOD_API_KEY=your_runpod_api_key
MODEL_NAME=DeepMako/Mako-32B-Conductor
When both RUNPOD_ENDPOINT_ID and RUNPOD_API_KEY are set, the gateway routes model: "conductor" requests to RunPod instead of the local Ollama instance.

How the gateway communicates with RunPod

The gateway uses RunPod’s synchronous endpoint (/runsync):
  1. Submit — POST to https://api.runpod.ai/v2/{endpoint_id}/runsync with the model input
  2. Poll — If the response is IN_QUEUE or IN_PROGRESS, the gateway polls /status/{job_id} every 5 seconds
  3. Timeout — Maximum wait of 6 minutes for cold starts
  4. Retry — If the model returns an empty output error (common with tool-calling edge cases), the gateway retries without tool definitions

Cold start considerations

  • First request after idle may take 30–90 seconds while a GPU worker spins up
  • Subsequent requests to a warm worker typically complete in 3–10 seconds
  • The gateway returns a 503 status during cold starts, allowing frontends to display a loading indicator
  • Set a longer idle timeout on RunPod to reduce cold start frequency (at the cost of higher GPU spend)