Router service
The router is a stateless FastAPI service. It speaks the OpenAI Chat Completions wire format to clients and translates to the appropriate backend protocol per request.
Why FastAPI
- Native async + Server-Sent Events, which we need for streaming.
- Pydantic gives us strict validation of the request schema for free.
- The Anthropic and OpenAI Python SDKs are async and integrate cleanly.
API surface
We expose the OpenAI Chat Completions endpoint (so OpenAI clients drop in
by changing base_url + api_key) and the Anthropic Messages endpoint
(so Anthropic-native clients like Claude Code drop in by setting
ANTHROPIC_BASE_URL + ANTHROPIC_AUTH_TOKEN). Both ingresses share the
same classify → route → audit core; see
anthropic-ingress.md.
| Method | Path | Notes |
|---|---|---|
| POST | /v1/chat/completions |
OpenAI ingress. SSE if stream=true. |
| POST | /v1/messages |
Anthropic ingress. SSE if stream=true. Claude path is a native pass-through (keeps thinking / caching); ScalarLM path translates. Classifies over the whole payload. |
| POST | /v1/messages/count_tokens |
Local token estimate — never forwarded to Anthropic (would leak novel content). |
| GET | /v1/models |
Lists router-auto plus the underlying backend ids. |
| GET | /v1/audit/export |
NDJSON audit export (token-scoped). See orbital-traces.md. |
| GET | /healthz |
Liveness — process is up. |
| GET | /readyz |
Readiness — process is up and the token store has loaded. |
Clients pick a model by name:
router-auto— run the classifier and route automatically (default).claude— force Claude, if the classifier confirms the prompt is confidently general (p_novel <= threshold). Anything in the uncertain band or above is rejected with 403 — uncertain prompts are gated here for the same reason they default to ScalarLM inrouter-auto: the IP invariant.scalarlm— force ScalarLM regardless of classifier output.
Only claude and scalarlm are treated as explicit backend overrides.
Any other model value — an arbitrary name a downstream OpenAI client
sends (e.g. nvidia/Gemma-4-31B-IT-NVFP4), or an omitted field — is
treated as router-auto rather than rejected, so off-the-shelf clients
work without reconfiguration. The raw requested name is still recorded in
the audit log's request_model.
The forced names exist so that evaluation pipelines can collect ground-truth A/B data without the classifier in the loop.
Request lifecycle
async def chat_completions(req: ChatRequest) -> Response:
decision = await classify(req) # ~20-50ms
backend = pick_backend(req.model, decision)
if req.stream:
return StreamingResponse(
stream_from(backend, req),
media_type="text/event-stream",
)
return JSONResponse(await call(backend, req))
Classifier call
The two ingresses classify at different granularities:
- OpenAI ingress (
/v1/chat/completions) classifies the last user message (translate.extract_last_user_message). - Anthropic ingress (
/v1/messages) classifies the whole payload — every user text block and everytool_result, across all turns (translate.user_content_spans) — one classify call per span, and routes on the maxp_novel. Agentic clients (Claude Code) carry proprietary content in tool results and earlier turns, not just the last message, so a benign last turn must not be able to ship the codebase to Claude.
Either way the system prompt and tool definitions are omitted (they bias the classifier toward "novel" and are client boilerplate); each span is capped at 8000 chars before the call. See anthropic-ingress.md § Classifier coverage.
Backend translation
| Source field (OpenAI) | Claude (Anthropic Messages) | ScalarLM (OpenAI-compat) |
|---|---|---|
messages[*] |
messages[*] (system separated) |
messages[*] (pass-through) |
temperature |
temperature |
temperature |
max_tokens |
max_tokens |
max_tokens |
stop |
stop_sequences |
stop |
tools |
tools (parameters → input_schema) |
tools (pass-through) |
tool_choice |
tool_choice (required→any, named→tool, none→tools omitted) |
tool_choice (pass-through) |
assistant tool_calls |
tool_use content blocks (arguments JSON string → input object) |
pass-through |
tool results |
tool_result blocks in a user turn (parallel results merged) |
pass-through |
Tool calling is translated in both directions for Claude: on the way back,
Anthropic tool_use response blocks become OpenAI tool_calls (the
input object is re-serialized to an arguments JSON string, and the
assistant content is null when the turn is tool-calls-only), with
stop_reason: tool_use → finish_reason: tool_calls. ScalarLM is
OpenAI-compatible, so tools pass through untouched.
Streaming is converted in both directions: Anthropic's event types
(message_start, content_block_delta, …) are repackaged into OpenAI
chat.completion.chunk events. Tool calls arrive in a single tool_calls
delta once the upstream turn completes (Claude reports them on the final
message), then the terminal chunk carries finish_reason: tool_calls.
Response headers
Every router response (streaming and non-streaming) carries a small set of headers that tell the client how the request was handled. These let an application correlate latency outliers with backend choice, surface routing decisions in client logs, and cross-reference the audit log without separate UI access.
| Header | Example | Meaning |
|---|---|---|
Split-Brain-Request-Id |
01931a8a-… |
UUIDv7. Same id appears in the audit log. |
Split-Brain-Backend |
claude | scalarlm |
Which backend produced the response. |
Split-Brain-Backend-Model |
claude-sonnet-4-6 or scalarlm:nvidia/Gemma-4-31B-IT-NVFP4 |
Specific model + checkpoint that served the request. On the Anthropic ingress this is the model actually used (the client's requested claude-* is honored). |
Split-Brain-Decision |
general | novel | uncertain | forced |
Routing decision. uncertain means the classifier's p_novel fell in the uncertain band and we routed to ScalarLM as the safe default (distinct from novel, where we were confident). forced if the client used model=claude / model=scalarlm and the veto passed. |
Split-Brain-Confidence |
0.92 |
p_novel from the classifier, formatted to two decimal places. |
Split-Brain-Classifier-Version |
minilm/sentence-transformers/all-MiniLM-L6-v2+head-trained |
Classifier model version that made the decision. |
Split-Brain-Classifier-Ms |
23 |
Time spent in the classifier, integer milliseconds. |
Header names use the Split-Brain-* prefix without X-
(deprecated by RFC 6648).
Streaming
For stream=true (SSE) responses, all headers are set on the
initial HTTP response before the first data: event. Clients
that only consume the event stream still work unchanged; clients
that inspect headers learn the routing decision before the first
token arrives.
Error responses
The same headers are set on error responses where they are meaningful:
Split-Brain-Request-Idis always set.Split-Brain-DecisionandSplit-Brain-Classifier-*are set if the classifier ran successfully and the failure was downstream (backend 5xx). They are omitted on classifier failures.Split-Brain-Backendis set if a backend was selected before the failure, omitted otherwise.
This makes 5xx responses tractable to debug from the client side without UI access — the request-id alone gets the operator to the audit-log entry, and the headers tell the developer whether the failure was at the classifier or beyond.
Information disclosure
These headers reveal the classifier's decision and confidence to the client. We accept this because split-brain is internal-only (see ui.md) and every caller is an authenticated org member who could read the audit log directly anyway.
If split-brain ever serves external untrusted clients, the
Split-Brain-Decision, Split-Brain-Confidence, and
Split-Brain-Classifier-Version headers should be stripped from
outbound responses — that trio is what would let an attacker
probe the classifier's decision boundary. Split-Brain-Backend
and Split-Brain-Request-Id are safe to expose to any client.
Error handling
| Condition | Status | Behavior |
|---|---|---|
| Classifier 5xx / timeout | 503 | Fail closed. Routing can't be decided; defaulting to Claude could leak proprietary content, so the request is rejected — no Claude fallback. |
| ScalarLM 5xx / timeout | 502 | Propagate. No fallback to Claude (the request went to ScalarLM precisely because it may be proprietary). |
| Claude 5xx / timeout | 502 | Propagate. No fallback to ScalarLM. |
Forced model=claude, not confidently general |
403 | IP veto (policy.IPVetoError). |
| Over daily token quota | 429 | See token-limits.md. |
| Invalid request schema | 400 | Standard OpenAI error envelope. |
| Auth failure (missing/invalid token) | 401 | Standard OpenAI error envelope. |
The router never silently fails over between backends. A request routed
to ScalarLM (novel/uncertain) is never retried on Claude — that would be the
exact IP leak the classifier exists to prevent — and a classifier outage
fails the request closed (503) rather than guessing. The only path to
Claude for a forced request is the deliberate, classifier-confirmed
forced-claude route above (or a per-token claude override — see
token-routing.md).
Authentication
Clients authenticate with a bearer token
(Authorization: Bearer sbk_<…>). Tokens are issued by users
themselves through the UI's Tokens view (see
ui.md) — the router has no admin path for
creating or distributing tokens, and there is no static API-key
file.
The router loads the active token set from a directory of
per-token JSON files on the shared PVC at
/var/split-brain/tokens/ (one file per token, named
tok_<id>.json). It scans the directory on startup, and
refreshes every 30 seconds on a background loop. On each
incoming request:
- Extract the bearer token from the
Authorizationheader. - Compute
sha256(token)and look it up in the active set (constant-time comparison; reject if not present orrevoked_at != null). - Stamp
token_idandowner_emailonto the request context for the audit log.
The token check itself is only "is this token active." Per-user daily
token quotas are enforced separately — by owner_email, not per token —
see token-limits.md. Last-used timestamps are batched and
written back every 60 seconds — never on the request hot path.
Failure modes:
- If the token directory becomes unreadable in steady state, the router keeps serving with the last-known set (fail-static, not fail-open).
- If the directory is unreadable at startup, the router refuses to become ready. An empty allow-list would silently lock out every client, and "deny all on first miss" is the only safe behavior.
Last-used updates also write to the same files (temp-write +
atomic rename(2)). Concurrent updates to different tokens
never touch the same file; concurrent updates to the same
token serialize via flock(2) on the target path.
Observability
- Audit log — the primary signal. Every request is written as one JSON
line to a per-pod file on the shared PVC
(
{audit_dir}/{pod}/{date}/{HH}.jsonl): owner, requested model, routing decision + confidence, backend and model actually used, token counts (including Claude cache reads), latency, status, and the (bounded) prompt/response. The UI's Request explorer reads and filters these, and the router exposes them via/v1/audit/export. - Logs — structured logs on stdout for warnings/errors (classifier fail-closed, backend errors, persistence failures).
There is no Prometheus /metrics endpoint or OpenTelemetry tracing in the
current build.
Concurrency model
Single Python process, uvicorn --workers 1, but async throughout.
For more parallelism we scale pods horizontally. We do not use threaded
workers because we have no CPU-bound work in the request path.
Concurrency per pod is capped by an asyncio.Semaphore sized from the
ROUTER_MAX_INFLIGHT env (default 256). Excess requests get 429.
Configuration (env vars)
| Name | Purpose |
|---|---|
ANTHROPIC_API_KEY |
Claude credential (from Secret). |
ANTHROPIC_MODEL |
Seed default Claude model (e.g. claude-sonnet-4-6). The live default is admin-tunable at runtime — see default-model.md. |
SCALARLM_BASE_URL |
In-cluster URL, e.g. http://scalarlm:8000/v1. |
CLASSIFIER_BASE_URL |
e.g. http://classifier:8080. |
CLASSIFIER_THRESHOLD |
Routing band half-width τ (default 0.4). p_novel ≤ τ → Claude; ≥ 1−τ → ScalarLM; the band between → ScalarLM (safe default). |
ROUTER_TOKEN_DIR |
PVC path holding tok_*.json files (default /var/split-brain/tokens). |
ROUTER_TOKEN_REFRESH_SECONDS |
Token-set refresh interval (default 30). |
ROUTER_TOKEN_LASTUSED_FLUSH_SECONDS |
Last-used batch interval (default 60). |
ROUTER_AUDIT_DIR |
PVC path for audit log writes (default /var/split-brain/audit). |
ROUTER_MAX_INFLIGHT |
Per-pod concurrency cap (default 256). |
ROUTER_CACHE_CLAUDE_PROMPT |
Inject Anthropic prompt-cache breakpoints on the OpenAI→Claude path (default true). See claude-prompt-caching.md. |
Per-user token limits and the runtime default-model setting are read from
JSON files on the PVC (limits.json, settings.json), not env — see
token-limits.md and default-model.md.
Code layout
router/
pyproject.toml
src/router/
__init__.py
app.py # FastAPI app + endpoints (lifespan wires it all)
config.py # env parsing (pydantic-settings)
auth.py # bearer-token store + check
tokens.py # token record (incl. per-token routing_mode)
classify.py # classifier HTTP client
policy.py # decide(): band rule + IP veto
translate.py # OpenAI <-> Anthropic schema, span extraction, caching
headers.py # Split-Brain-* response headers
concurrency.py # in-flight cap middleware
audit.py # audit record + writer (per-pod JSONL)
audit_read.py # audit reader (export)
usage.py # per-user daily token usage + limit store
settings.py # runtime settings store (default Claude model)
backends/
base.py # Backend protocol
claude.py # anthropic SDK adapter (pass-through + OpenAI path)
scalarlm.py # openai SDK adapter (custom base_url)
tests/ # test_translate / policy / backends / usage / settings / ...