Router service

The router is a stateless FastAPI service. It speaks the OpenAI Chat Completions wire format to clients and translates to the appropriate backend protocol per request.

Why FastAPI

Native async + Server-Sent Events, which we need for streaming.
Pydantic gives us strict validation of the request schema for free.
The Anthropic and OpenAI Python SDKs are async and integrate cleanly.

API surface

We expose the OpenAI Chat Completions endpoint (so OpenAI clients drop in by changing base_url + api_key) and the Anthropic Messages endpoint (so Anthropic-native clients like Claude Code drop in by setting ANTHROPIC_BASE_URL + ANTHROPIC_AUTH_TOKEN). Both ingresses share the same classify → route → audit core; see anthropic-ingress.md.

Method	Path	Notes
POST	`/v1/chat/completions`	OpenAI ingress. SSE if `stream=true`.
POST	`/v1/messages`	Anthropic ingress. SSE if `stream=true`. Claude path is a native pass-through (keeps thinking / caching); ScalarLM path translates. Classifies over the whole payload.
POST	`/v1/messages/count_tokens`	Local token estimate — never forwarded to Anthropic (would leak novel content).
GET	`/v1/models`	Lists `router-auto` plus the underlying backend ids.
GET	`/v1/audit/export`	NDJSON audit export (token-scoped). See orbital-traces.md.
GET	`/healthz`	Liveness — process is up.
GET	`/readyz`	Readiness — process is up and the token store has loaded.

Clients pick a model by name:

router-auto — run the classifier and route automatically (default).
claude — force Claude, if the classifier confirms the prompt is confidently general (p_novel <= threshold). Anything in the uncertain band or above is rejected with 403 — uncertain prompts are gated here for the same reason they default to ScalarLM in router-auto: the IP invariant.
scalarlm — force ScalarLM regardless of classifier output.

Only claude and scalarlm are treated as explicit backend overrides. Any other model value — an arbitrary name a downstream OpenAI client sends (e.g. nvidia/Gemma-4-31B-IT-NVFP4), or an omitted field — is treated as router-auto rather than rejected, so off-the-shelf clients work without reconfiguration. The raw requested name is still recorded in the audit log's request_model.

The forced names exist so that evaluation pipelines can collect ground-truth A/B data without the classifier in the loop.

Request lifecycle

async def chat_completions(req: ChatRequest) -> Response:
    decision = await classify(req)            # ~20-50ms
    backend = pick_backend(req.model, decision)
    if req.stream:
        return StreamingResponse(
            stream_from(backend, req),
            media_type="text/event-stream",
        )
    return JSONResponse(await call(backend, req))

Classifier call

The two ingresses classify at different granularities:

OpenAI ingress (/v1/chat/completions) classifies the last user message (translate.extract_last_user_message).
Anthropic ingress (/v1/messages) classifies the whole payload — every user text block and every tool_result, across all turns (translate.user_content_spans) — one classify call per span, and routes on the max p_novel. Agentic clients (Claude Code) carry proprietary content in tool results and earlier turns, not just the last message, so a benign last turn must not be able to ship the codebase to Claude.

Either way the system prompt and tool definitions are omitted (they bias the classifier toward "novel" and are client boilerplate); each span is capped at 8000 chars before the call. See anthropic-ingress.md § Classifier coverage.

Backend translation

Source field (OpenAI)	Claude (Anthropic Messages)	ScalarLM (OpenAI-compat)
`messages[*]`	`messages[*]` (system separated)	`messages[*]` (pass-through)
`temperature`	`temperature`	`temperature`
`max_tokens`	`max_tokens`	`max_tokens`
`stop`	`stop_sequences`	`stop`
`tools`	`tools` (`parameters` → `input_schema`)	`tools` (pass-through)
`tool_choice`	`tool_choice` (`required`→`any`, named→`tool`, `none`→tools omitted)	`tool_choice` (pass-through)
assistant `tool_calls`	`tool_use` content blocks (`arguments` JSON string → `input` object)	pass-through
`tool` results	`tool_result` blocks in a user turn (parallel results merged)	pass-through

Tool calling is translated in both directions for Claude: on the way back, Anthropic tool_use response blocks become OpenAI tool_calls (the input object is re-serialized to an arguments JSON string, and the assistant content is null when the turn is tool-calls-only), with stop_reason: tool_use → finish_reason: tool_calls. ScalarLM is OpenAI-compatible, so tools pass through untouched.

Streaming is converted in both directions: Anthropic's event types (message_start, content_block_delta, …) are repackaged into OpenAI chat.completion.chunk events. Tool calls arrive in a single tool_calls delta once the upstream turn completes (Claude reports them on the final message), then the terminal chunk carries finish_reason: tool_calls.

Response headers

Every router response (streaming and non-streaming) carries a small set of headers that tell the client how the request was handled. These let an application correlate latency outliers with backend choice, surface routing decisions in client logs, and cross-reference the audit log without separate UI access.

Header	Example	Meaning
`Split-Brain-Request-Id`	`01931a8a-…`	UUIDv7. Same id appears in the audit log.
`Split-Brain-Backend`	`claude` \| `scalarlm`	Which backend produced the response.
`Split-Brain-Backend-Model`	`claude-sonnet-4-6` or `scalarlm:nvidia/Gemma-4-31B-IT-NVFP4`	Specific model + checkpoint that served the request. On the Anthropic ingress this is the model actually used (the client's requested `claude-*` is honored).
`Split-Brain-Decision`	`general` \| `novel` \| `uncertain` \| `forced`	Routing decision. `uncertain` means the classifier's `p_novel` fell in the uncertain band and we routed to ScalarLM as the safe default (distinct from `novel`, where we were confident). `forced` if the client used `model=claude` / `model=scalarlm` and the veto passed.
`Split-Brain-Confidence`	`0.92`	`p_novel` from the classifier, formatted to two decimal places.
`Split-Brain-Classifier-Version`	`minilm/sentence-transformers/all-MiniLM-L6-v2+head-trained`	Classifier model version that made the decision.
`Split-Brain-Classifier-Ms`	`23`	Time spent in the classifier, integer milliseconds.

Header names use the Split-Brain-* prefix without X- (deprecated by RFC 6648).

Streaming

For stream=true (SSE) responses, all headers are set on the initial HTTP response before the first data: event. Clients that only consume the event stream still work unchanged; clients that inspect headers learn the routing decision before the first token arrives.

Error responses

The same headers are set on error responses where they are meaningful:

Split-Brain-Request-Id is always set.
Split-Brain-Decision and Split-Brain-Classifier-* are set if the classifier ran successfully and the failure was downstream (backend 5xx). They are omitted on classifier failures.
Split-Brain-Backend is set if a backend was selected before the failure, omitted otherwise.

This makes 5xx responses tractable to debug from the client side without UI access — the request-id alone gets the operator to the audit-log entry, and the headers tell the developer whether the failure was at the classifier or beyond.

Information disclosure

These headers reveal the classifier's decision and confidence to the client. We accept this because split-brain is internal-only (see ui.md) and every caller is an authenticated org member who could read the audit log directly anyway.

If split-brain ever serves external untrusted clients, the Split-Brain-Decision, Split-Brain-Confidence, and Split-Brain-Classifier-Version headers should be stripped from outbound responses — that trio is what would let an attacker probe the classifier's decision boundary. Split-Brain-Backend and Split-Brain-Request-Id are safe to expose to any client.

Error handling

Condition	Status	Behavior
Classifier 5xx / timeout	503	Fail closed. Routing can't be decided; defaulting to Claude could leak proprietary content, so the request is rejected — no Claude fallback.
ScalarLM 5xx / timeout	502	Propagate. No fallback to Claude (the request went to ScalarLM precisely because it may be proprietary).
Claude 5xx / timeout	502	Propagate. No fallback to ScalarLM.
Forced `model=claude`, not confidently general	403	IP veto (`policy.IPVetoError`).
Over daily token quota	429	See token-limits.md.
Invalid request schema	400	Standard OpenAI error envelope.
Auth failure (missing/invalid token)	401	Standard OpenAI error envelope.

The router never silently fails over between backends. A request routed to ScalarLM (novel/uncertain) is never retried on Claude — that would be the exact IP leak the classifier exists to prevent — and a classifier outage fails the request closed (503) rather than guessing. The only path to Claude for a forced request is the deliberate, classifier-confirmed forced-claude route above (or a per-token claude override — see token-routing.md).

Authentication

Clients authenticate with a bearer token (Authorization: Bearer sbk_<…>). Tokens are issued by users themselves through the UI's Tokens view (see ui.md) — the router has no admin path for creating or distributing tokens, and there is no static API-key file.

The router loads the active token set from a directory of per-token JSON files on the shared PVC at /var/split-brain/tokens/ (one file per token, named tok_<id>.json). It scans the directory on startup, and refreshes every 30 seconds on a background loop. On each incoming request:

Extract the bearer token from the Authorization header.
Compute sha256(token) and look it up in the active set (constant-time comparison; reject if not present or revoked_at != null).
Stamp token_id and owner_email onto the request context for the audit log.

The token check itself is only "is this token active." Per-user daily token quotas are enforced separately — by owner_email, not per token — see token-limits.md. Last-used timestamps are batched and written back every 60 seconds — never on the request hot path.

Failure modes:

If the token directory becomes unreadable in steady state, the router keeps serving with the last-known set (fail-static, not fail-open).
If the directory is unreadable at startup, the router refuses to become ready. An empty allow-list would silently lock out every client, and "deny all on first miss" is the only safe behavior.

Last-used updates also write to the same files (temp-write + atomic rename(2)). Concurrent updates to different tokens never touch the same file; concurrent updates to the same token serialize via flock(2) on the target path.

Observability

Audit log — the primary signal. Every request is written as one JSON line to a per-pod file on the shared PVC ({audit_dir}/{pod}/{date}/{HH}.jsonl): owner, requested model, routing decision + confidence, backend and model actually used, token counts (including Claude cache reads), latency, status, and the (bounded) prompt/response. The UI's Request explorer reads and filters these, and the router exposes them via /v1/audit/export.
Logs — structured logs on stdout for warnings/errors (classifier fail-closed, backend errors, persistence failures).

There is no Prometheus /metrics endpoint or OpenTelemetry tracing in the current build.

Concurrency model

Single Python process, uvicorn --workers 1, but async throughout. For more parallelism we scale pods horizontally. We do not use threaded workers because we have no CPU-bound work in the request path.

Concurrency per pod is capped by an asyncio.Semaphore sized from the ROUTER_MAX_INFLIGHT env (default 256). Excess requests get 429.

Configuration (env vars)

Name	Purpose
`ANTHROPIC_API_KEY`	Claude credential (from Secret).
`ANTHROPIC_MODEL`	Seed default Claude model (e.g. `claude-sonnet-4-6`). The live default is admin-tunable at runtime — see default-model.md.
`SCALARLM_BASE_URL`	In-cluster URL, e.g. `http://scalarlm:8000/v1`.
`CLASSIFIER_BASE_URL`	e.g. `http://classifier:8080`.
`CLASSIFIER_THRESHOLD`	Routing band half-width τ (default `0.4`). `p_novel ≤ τ` → Claude; `≥ 1−τ` → ScalarLM; the band between → ScalarLM (safe default).
`ROUTER_TOKEN_DIR`	PVC path holding `tok_*.json` files (default `/var/split-brain/tokens`).
`ROUTER_TOKEN_REFRESH_SECONDS`	Token-set refresh interval (default 30).
`ROUTER_TOKEN_LASTUSED_FLUSH_SECONDS`	Last-used batch interval (default 60).
`ROUTER_AUDIT_DIR`	PVC path for audit log writes (default `/var/split-brain/audit`).
`ROUTER_MAX_INFLIGHT`	Per-pod concurrency cap (default 256).
`ROUTER_CACHE_CLAUDE_PROMPT`	Inject Anthropic prompt-cache breakpoints on the OpenAI→Claude path (default true). See claude-prompt-caching.md.

Per-user token limits and the runtime default-model setting are read from JSON files on the PVC (limits.json, settings.json), not env — see token-limits.md and default-model.md.

Code layout

router/
  pyproject.toml
  src/router/
    __init__.py
    app.py              # FastAPI app + endpoints (lifespan wires it all)
    config.py           # env parsing (pydantic-settings)
    auth.py             # bearer-token store + check
    tokens.py           # token record (incl. per-token routing_mode)
    classify.py         # classifier HTTP client
    policy.py           # decide(): band rule + IP veto
    translate.py        # OpenAI <-> Anthropic schema, span extraction, caching
    headers.py          # Split-Brain-* response headers
    concurrency.py      # in-flight cap middleware
    audit.py            # audit record + writer (per-pod JSONL)
    audit_read.py       # audit reader (export)
    usage.py            # per-user daily token usage + limit store
    settings.py         # runtime settings store (default Claude model)
    backends/
      base.py           # Backend protocol
      claude.py         # anthropic SDK adapter (pass-through + OpenAI path)
      scalarlm.py       # openai SDK adapter (custom base_url)
  tests/                # test_translate / policy / backends / usage / settings / ...