Overview
| Property | Value |
|---|---|
| Base model | Qwen3-32B |
| Parameters | 32 billion |
| Streaming | Not supported |
| Backend | RunPod Serverless (A100/H100) |
| Request format | model: "conductor", stream: false |
Inference flow
When a request arrives withmodel: "conductor" and stream: false, the gateway routes it to RunPod Serverless:
- The gateway calls RunPod’s
/runsyncendpoint with the full message history, system prompt, and tool definitions. - If the worker is cold, RunPod returns
IN_QUEUE. The gateway polls/status/{jobId}every 5 seconds for up to 6 minutes. - Once the model responds, the gateway checks for tool calls and enters the agentic loop (up to 8 rounds).
- If the model returns an empty output error, the gateway retries the request without tool definitions as a fallback.
Temperature defaults
The Conductor uses conservative sampling to ensure reliable tool calling:- Temperature: 0.4
- Top-p: 0.9
Cold starts
RunPod Serverless workers may take 30–90 seconds to cold start. The gateway handles this automatically with polling. If you’re building a frontend, the503 status code from the gateway indicates a cold start — display a loading state and retry.
Limitations
- No streaming. If you request
model: "conductor"withstream: true, the gateway silently routes to the 8B Operator model instead. - Latency. Expect 3–15 seconds per response depending on tool chain complexity, plus cold start time if the worker is idle.
- Context window. The full system prompt, knowledge context, tool definitions, and conversation history all count against the context window.