split-brain

Sign in

Router service

The router is a stateless FastAPI service. It speaks the OpenAI Chat Completions wire format to clients and translates to the appropriate backend protocol per request.

Why FastAPI

  • Native async + Server-Sent Events, which we need for streaming.
  • Pydantic gives us strict validation of the request schema for free.
  • The Anthropic and OpenAI Python SDKs are async and integrate cleanly.

API surface

We expose the OpenAI Chat Completions endpoint (so OpenAI clients drop in by changing base_url + api_key) and the Anthropic Messages endpoint (so Anthropic-native clients like Claude Code drop in by setting ANTHROPIC_BASE_URL + ANTHROPIC_AUTH_TOKEN). Both ingresses share the same classify → route → audit core; see anthropic-ingress.md.

Method Path Notes
POST /v1/chat/completions OpenAI ingress. SSE if stream=true.
POST /v1/messages Anthropic ingress. SSE if stream=true. Claude path is a native pass-through (keeps thinking / caching); ScalarLM path translates. Classifies over the whole payload.
POST /v1/messages/count_tokens Local token estimate — never forwarded to Anthropic (would leak novel content).
GET /v1/models Lists router-auto plus the underlying backend ids.
GET /v1/audit/export NDJSON audit export (token-scoped). See orbital-traces.md.
GET /healthz Liveness — process is up.
GET /readyz Readiness — process is up and the token store has loaded.

Clients pick a model by name:

  • router-auto — run the classifier and route automatically (default).
  • claude — force Claude, if the classifier confirms the prompt is confidently general (p_novel <= threshold). Anything in the uncertain band or above is rejected with 403 — uncertain prompts are gated here for the same reason they default to ScalarLM in router-auto: the IP invariant.
  • scalarlm — force ScalarLM regardless of classifier output.

Only claude and scalarlm are treated as explicit backend overrides. Any other model value — an arbitrary name a downstream OpenAI client sends (e.g. nvidia/Gemma-4-31B-IT-NVFP4), or an omitted field — is treated as router-auto rather than rejected, so off-the-shelf clients work without reconfiguration. The raw requested name is still recorded in the audit log's request_model.

The forced names exist so that evaluation pipelines can collect ground-truth A/B data without the classifier in the loop.

Request lifecycle

async def chat_completions(req: ChatRequest) -> Response:
    decision = await classify(req)            # ~20-50ms
    backend = pick_backend(req.model, decision)
    if req.stream:
        return StreamingResponse(
            stream_from(backend, req),
            media_type="text/event-stream",
        )
    return JSONResponse(await call(backend, req))

Classifier call

The two ingresses classify at different granularities:

  • OpenAI ingress (/v1/chat/completions) classifies the last user message (translate.extract_last_user_message).
  • Anthropic ingress (/v1/messages) classifies the whole payload — every user text block and every tool_result, across all turns (translate.user_content_spans) — one classify call per span, and routes on the max p_novel. Agentic clients (Claude Code) carry proprietary content in tool results and earlier turns, not just the last message, so a benign last turn must not be able to ship the codebase to Claude.

Either way the system prompt and tool definitions are omitted (they bias the classifier toward "novel" and are client boilerplate); each span is capped at 8000 chars before the call. See anthropic-ingress.md § Classifier coverage.

Backend translation

Source field (OpenAI) Claude (Anthropic Messages) ScalarLM (OpenAI-compat)
messages[*] messages[*] (system separated) messages[*] (pass-through)
temperature temperature temperature
max_tokens max_tokens max_tokens
stop stop_sequences stop
tools tools (parametersinput_schema) tools (pass-through)
tool_choice tool_choice (requiredany, named→tool, none→tools omitted) tool_choice (pass-through)
assistant tool_calls tool_use content blocks (arguments JSON string → input object) pass-through
tool results tool_result blocks in a user turn (parallel results merged) pass-through

Tool calling is translated in both directions for Claude: on the way back, Anthropic tool_use response blocks become OpenAI tool_calls (the input object is re-serialized to an arguments JSON string, and the assistant content is null when the turn is tool-calls-only), with stop_reason: tool_usefinish_reason: tool_calls. ScalarLM is OpenAI-compatible, so tools pass through untouched.

Streaming is converted in both directions: Anthropic's event types (message_start, content_block_delta, …) are repackaged into OpenAI chat.completion.chunk events. Tool calls arrive in a single tool_calls delta once the upstream turn completes (Claude reports them on the final message), then the terminal chunk carries finish_reason: tool_calls.

Response headers

Every router response (streaming and non-streaming) carries a small set of headers that tell the client how the request was handled. These let an application correlate latency outliers with backend choice, surface routing decisions in client logs, and cross-reference the audit log without separate UI access.

Header Example Meaning
Split-Brain-Request-Id 01931a8a-… UUIDv7. Same id appears in the audit log.
Split-Brain-Backend claude | scalarlm Which backend produced the response.
Split-Brain-Backend-Model claude-sonnet-4-6 or scalarlm:nvidia/Gemma-4-31B-IT-NVFP4 Specific model + checkpoint that served the request. On the Anthropic ingress this is the model actually used (the client's requested claude-* is honored).
Split-Brain-Decision general | novel | uncertain | forced Routing decision. uncertain means the classifier's p_novel fell in the uncertain band and we routed to ScalarLM as the safe default (distinct from novel, where we were confident). forced if the client used model=claude / model=scalarlm and the veto passed.
Split-Brain-Confidence 0.92 p_novel from the classifier, formatted to two decimal places.
Split-Brain-Classifier-Version minilm/sentence-transformers/all-MiniLM-L6-v2+head-trained Classifier model version that made the decision.
Split-Brain-Classifier-Ms 23 Time spent in the classifier, integer milliseconds.

Header names use the Split-Brain-* prefix without X- (deprecated by RFC 6648).

Streaming

For stream=true (SSE) responses, all headers are set on the initial HTTP response before the first data: event. Clients that only consume the event stream still work unchanged; clients that inspect headers learn the routing decision before the first token arrives.

Error responses

The same headers are set on error responses where they are meaningful:

  • Split-Brain-Request-Id is always set.
  • Split-Brain-Decision and Split-Brain-Classifier-* are set if the classifier ran successfully and the failure was downstream (backend 5xx). They are omitted on classifier failures.
  • Split-Brain-Backend is set if a backend was selected before the failure, omitted otherwise.

This makes 5xx responses tractable to debug from the client side without UI access — the request-id alone gets the operator to the audit-log entry, and the headers tell the developer whether the failure was at the classifier or beyond.

Information disclosure

These headers reveal the classifier's decision and confidence to the client. We accept this because split-brain is internal-only (see ui.md) and every caller is an authenticated org member who could read the audit log directly anyway.

If split-brain ever serves external untrusted clients, the Split-Brain-Decision, Split-Brain-Confidence, and Split-Brain-Classifier-Version headers should be stripped from outbound responses — that trio is what would let an attacker probe the classifier's decision boundary. Split-Brain-Backend and Split-Brain-Request-Id are safe to expose to any client.

Error handling

Condition Status Behavior
Classifier 5xx / timeout 503 Fail closed. Routing can't be decided; defaulting to Claude could leak proprietary content, so the request is rejected — no Claude fallback.
ScalarLM 5xx / timeout 502 Propagate. No fallback to Claude (the request went to ScalarLM precisely because it may be proprietary).
Claude 5xx / timeout 502 Propagate. No fallback to ScalarLM.
Forced model=claude, not confidently general 403 IP veto (policy.IPVetoError).
Over daily token quota 429 See token-limits.md.
Invalid request schema 400 Standard OpenAI error envelope.
Auth failure (missing/invalid token) 401 Standard OpenAI error envelope.

The router never silently fails over between backends. A request routed to ScalarLM (novel/uncertain) is never retried on Claude — that would be the exact IP leak the classifier exists to prevent — and a classifier outage fails the request closed (503) rather than guessing. The only path to Claude for a forced request is the deliberate, classifier-confirmed forced-claude route above (or a per-token claude override — see token-routing.md).

Authentication

Clients authenticate with a bearer token (Authorization: Bearer sbk_<…>). Tokens are issued by users themselves through the UI's Tokens view (see ui.md) — the router has no admin path for creating or distributing tokens, and there is no static API-key file.

The router loads the active token set from a directory of per-token JSON files on the shared PVC at /var/split-brain/tokens/ (one file per token, named tok_<id>.json). It scans the directory on startup, and refreshes every 30 seconds on a background loop. On each incoming request:

  1. Extract the bearer token from the Authorization header.
  2. Compute sha256(token) and look it up in the active set (constant-time comparison; reject if not present or revoked_at != null).
  3. Stamp token_id and owner_email onto the request context for the audit log.

The token check itself is only "is this token active." Per-user daily token quotas are enforced separately — by owner_email, not per token — see token-limits.md. Last-used timestamps are batched and written back every 60 seconds — never on the request hot path.

Failure modes:

  • If the token directory becomes unreadable in steady state, the router keeps serving with the last-known set (fail-static, not fail-open).
  • If the directory is unreadable at startup, the router refuses to become ready. An empty allow-list would silently lock out every client, and "deny all on first miss" is the only safe behavior.

Last-used updates also write to the same files (temp-write + atomic rename(2)). Concurrent updates to different tokens never touch the same file; concurrent updates to the same token serialize via flock(2) on the target path.

Observability

  • Audit log — the primary signal. Every request is written as one JSON line to a per-pod file on the shared PVC ({audit_dir}/{pod}/{date}/{HH}.jsonl): owner, requested model, routing decision + confidence, backend and model actually used, token counts (including Claude cache reads), latency, status, and the (bounded) prompt/response. The UI's Request explorer reads and filters these, and the router exposes them via /v1/audit/export.
  • Logs — structured logs on stdout for warnings/errors (classifier fail-closed, backend errors, persistence failures).

There is no Prometheus /metrics endpoint or OpenTelemetry tracing in the current build.

Concurrency model

Single Python process, uvicorn --workers 1, but async throughout. For more parallelism we scale pods horizontally. We do not use threaded workers because we have no CPU-bound work in the request path.

Concurrency per pod is capped by an asyncio.Semaphore sized from the ROUTER_MAX_INFLIGHT env (default 256). Excess requests get 429.

Configuration (env vars)

Name Purpose
ANTHROPIC_API_KEY Claude credential (from Secret).
ANTHROPIC_MODEL Seed default Claude model (e.g. claude-sonnet-4-6). The live default is admin-tunable at runtime — see default-model.md.
SCALARLM_BASE_URL In-cluster URL, e.g. http://scalarlm:8000/v1.
CLASSIFIER_BASE_URL e.g. http://classifier:8080.
CLASSIFIER_THRESHOLD Routing band half-width τ (default 0.4). p_novel ≤ τ → Claude; ≥ 1−τ → ScalarLM; the band between → ScalarLM (safe default).
ROUTER_TOKEN_DIR PVC path holding tok_*.json files (default /var/split-brain/tokens).
ROUTER_TOKEN_REFRESH_SECONDS Token-set refresh interval (default 30).
ROUTER_TOKEN_LASTUSED_FLUSH_SECONDS Last-used batch interval (default 60).
ROUTER_AUDIT_DIR PVC path for audit log writes (default /var/split-brain/audit).
ROUTER_MAX_INFLIGHT Per-pod concurrency cap (default 256).
ROUTER_CACHE_CLAUDE_PROMPT Inject Anthropic prompt-cache breakpoints on the OpenAI→Claude path (default true). See claude-prompt-caching.md.

Per-user token limits and the runtime default-model setting are read from JSON files on the PVC (limits.json, settings.json), not env — see token-limits.md and default-model.md.

Code layout

router/
  pyproject.toml
  src/router/
    __init__.py
    app.py              # FastAPI app + endpoints (lifespan wires it all)
    config.py           # env parsing (pydantic-settings)
    auth.py             # bearer-token store + check
    tokens.py           # token record (incl. per-token routing_mode)
    classify.py         # classifier HTTP client
    policy.py           # decide(): band rule + IP veto
    translate.py        # OpenAI <-> Anthropic schema, span extraction, caching
    headers.py          # Split-Brain-* response headers
    concurrency.py      # in-flight cap middleware
    audit.py            # audit record + writer (per-pod JSONL)
    audit_read.py       # audit reader (export)
    usage.py            # per-user daily token usage + limit store
    settings.py         # runtime settings store (default Claude model)
    backends/
      base.py           # Backend protocol
      claude.py         # anthropic SDK adapter (pass-through + OpenAI path)
      scalarlm.py       # openai SDK adapter (custom base_url)
  tests/                # test_translate / policy / backends / usage / settings / ...