Anthropic ingress (Claude Code integration) — design

Status: Implemented (Phase 1 + 3). The router now serves POST /v1/messages (streaming + non-streaming) and POST /v1/messages/count_tokens. The Claude path is a native pass-through; the ScalarLM path translates Anthropic↔OpenAI (request, response, and the streaming event stream). Classification is over the whole payload. count_tokens is local. Not done: the Phase 2 canonical-IR refactor — we took the pragmatic route of giving each backend both complete/stream (OpenAI) and complete_messages/stream_messages (Anthropic) methods, which leaves the two ingresses' orchestration (auth/classify/decide/audit) somewhat duplicated. Open items below remain (notably the span-sampling strategy for very large payloads and tokenizer accuracy).

Motivation

We want Anthropic-native clients — first of all the Claude Code CLI — to route through split-brain, so that an engineer's coding traffic gets the same novelty-based routing and IP protection as everything else: proprietary work stays on ScalarLM, general work can use Claude.

The blocker is a protocol mismatch. Claude Code speaks only the Anthropic Messages API (POST /v1/messages, POST /v1/messages/count_tokens); it expects Anthropic-shaped responses and errors on OpenAI-shaped ones. The router today speaks only the OpenAI Chat Completions API. So ANTHROPIC_BASE_URL=https://router/... does not work as-is — the path 404s and the payloads don't match.

This design adds an Anthropic ingress to the router and reorganizes the internal translation so the Claude path stays high-fidelity.

Goals

A Claude Code user sets two env vars and is routed by split-brain:
ANTHROPIC_BASE_URL=https://<router-host>
ANTHROPIC_AUTH_TOKEN=sbk_… (a split-brain API token)
Native fidelity on the Claude path: extended thinking, prompt caching (cache_control), tool calling, and streaming all survive.
The IP invariant holds for agentic clients — proprietary code never reaches Claude, even when it is spread across tool results and history rather than the last user message (see § Classifier coverage).
Reuse the existing OpenAI↔Anthropic translation rather than fork it.

Non-goals

Bedrock / Vertex gateway shapes (Claude Code also supports these; we do the Anthropic shape only).
Vision / images on the ScalarLM path (text + tools first).
Making ScalarLM a good agentic coding model — that is a ScalarLM concern; this design only makes the plumbing faithful.

Background: what Claude Code actually sends

Confirmed against the Claude Code docs:

Aspect	Value
Endpoints	`POST /v1/messages`, `POST /v1/messages/count_tokens`, and `/v1/models` for gateway model discovery
Auth	`ANTHROPIC_AUTH_TOKEN` → `Authorization: Bearer <token>` (matches the router's `require_token`). `ANTHROPIC_API_KEY` → `x-api-key` instead.
Headers	`anthropic-version`, `anthropic-beta`, plus `X-Claude-Code-Session-Id`
Body	Large system prompt, tool definitions (often tens of KB), `cache_control` on system/context by default, `thinking` blocks when enabled, streaming on by default
Responses	Anthropic shape required (`{"content": [...], "usage": {...}}`); no OpenAI-response translation on the client side

The core decision: Anthropic is the internal canonical format

The router's internal "lingua franca" is currently OpenAI: both backends consume and emit OpenAI bodies, and ClaudeBackend translates OpenAI→Anthropic internally (backends/claude.py).

If the Anthropic ingress translated down to OpenAI internally, the Claude path would become Anthropic → OpenAI → Anthropic — exactly the double-translation that drops thinking and cache_control. That defeats the purpose.

So we flip the canonical internal representation to Anthropic (the richer superset: it carries thinking, cache_control, and tool_use; OpenAI is a subset). Every edge becomes a thin adapter, and — crucially — the two translation functions we already have get reused in flipped roles:

Edge	Request adapter	Response adapter
OpenAI ingress (existing clients)	`openai_to_anthropic_request` (have)	`anthropic_to_openai_response` (have)
Anthropic ingress (Claude Code)	identity	identity
Claude backend	identity (near pass-through)	identity
ScalarLM backend	`anthropic_to_openai_request` (new)	`openai_to_anthropic_response` (new)

Net effect:

The Claude backend gets simpler — it forwards the canonical (Anthropic) body to Anthropic with the model overridden, and returns the result unchanged. Thinking / caching / tools / streaming are native.
The ScalarLM backend gains the mirror translation pair.
The orchestration in the middle — auth → classify → policy decision → audit — stays protocol-agnostic, operating on the canonical body.

            OpenAI client                     Claude Code
                 │ /v1/chat/completions            │ /v1/messages
                 ▼                                  ▼
        openai_to_anthropic_request           (identity)
                 └───────────────┬──────────────────┘
                                 ▼
                 classify · decide · audit   (canonical = Anthropic)
                        ┌────────┴─────────┐
                        ▼                  ▼
              Claude backend        ScalarLM backend
              (identity →           (anthropic_to_openai_request →
               Anthropic API)        ScalarLM → openai_to_anthropic_response)
                        │                  │
                        ▼                  ▼
              response rendered per ingress:
              OpenAI client → anthropic_to_openai_response
              Claude Code   → identity

New surface to build

POST /v1/messages (non-streaming + SSE) and /v1/models in Anthropic shape (gateway model discovery hits /v1/models).
POST /v1/messages/count_tokens — see § count_tokens.
anthropic_to_openai_request and openai_to_anthropic_response in translate.py, including the tool-calling cases (mirror of the functions we already have).
Streaming translation, both directions. Anthropic SSE events (message_start, content_block_start, content_block_delta / input_json_delta, message_delta, message_stop) ↔ OpenAI chat.completion.chunks. The Anthropic→OpenAI half largely exists inside backends/claude.py::stream and moves into the OpenAI-ingress adapter; the OpenAI→Anthropic half (for the ScalarLM path) is new.
Header handling: forward anthropic-version / anthropic-beta to Claude; strip Claude-only fields (cache_control, thinking) on the ScalarLM path.

Invariant risks (the parts that matter more than the plumbing)

Classifier coverage

For chat, the proprietary content is the user's question, so classifying the last user message would be enough. For an agent it is not: Claude Code spreads your codebase across the system prompt, tool definitions, prior assistant turns, and tool results (file reads). A benign last user message — "now add a test" — would classify general → route to Claude, shipping the entire proprietary context with it. A direct IP-invariant violation.

So the Anthropic ingress classifies the whole payload. _classify_whole_payload extracts every user text block and every tool_result, across all turns (translate.user_content_spans), classifies each span independently (each capped at 8000 chars, fanned out under a concurrency cap), and routes on the max p_novel — any one proprietary span biases the whole request to ScalarLM. The system prompt and tool definitions are omitted (client boilerplate that skews novelty). The OpenAI ingress (/v1/chat/completions), which serves plain chat, still classifies the last user message (extract_last_user_message).

The original design left "classify the union vs. the max over chunks vs. a sliding window" open; the default is max over spans — the most conservative choice for the IP invariant.

Span-fraction aggregation (opt-in). Max-over-spans is brittle for long agentic payloads: with N spans and a non-zero per-span false-positive rate, P(≥1 span flips) = 1 − (1−p)^N grows with N, so a single borderline span sends a 20-span request to ScalarLM. ROUTER_NOVEL_SPAN_FRACTION (default 0.0 = max) instead routes on the k-th highest span p_novel, where k = ceil(fraction × N) — i.e. the request is novel only if that fraction of its spans are novel. This normalizes for span count (more spans no longer inflate the novel odds) and ignores a lone outlier, but relaxes the IP invariant — a single proprietary span diluted in a large general request can reach Claude. A small, concentrated proprietary request keeps k=1 and stays protected. Opt-in per deployment; the dev overlay uses 0.15.

count_tokens must not leak

Claude Code calls /v1/messages/count_tokens frequently to manage its context window. The naive implementation proxies it to Anthropic — but that ships the full message content (your code) to Anthropic even for novel prompts, violating the invariant before a single completion is routed.

Decision: the router answers count_tokens locally (a tokenizer or a calibrated estimate) and never forwards it to Anthropic. An approximate count is acceptable for context management; correctness of the invariant is not negotiable.

Per-session routing stability

Once the first turn is routed, Claude Code's follow-ups are tool results with no new user message, so routing stays put for the session — provided we adopt whole-payload classification above (otherwise a later tool-result-only turn has nothing to classify and could drift). This is a reason to make the classification cover the whole payload rather than re-deriving per turn.

ScalarLM has to drive the loop

With the invariant correctly enforced, essentially every coding session on your own code routes to ScalarLM (Gemma-class). The entire UX then depends on ScalarLM's agentic tool-use quality and context window. This is a product risk, not a plumbing one — worth validating in isolation (point Claude Code at ScalarLM directly through a thin shim) before investing in the ingress.

Operational

Cloudflare Access: the router host must be reachable programmatically with the bearer token — a hostname with no interactive Access SSO (that gates the UI), or an Access service token. cloudflared is token-mode, so this is configured in the Cloudflare dashboard.
Context window: ScalarLM's window must cover Claude Code's contexts or requests truncate/fail.
Non-first-party host: Claude Code disables some MCP tool-search behavior against non-Anthropic hosts unless ENABLE_TOOL_SEARCH=true.

Phased plan

Phase 0 — de-risk (no router change): point Claude Code at ScalarLM directly via a thin Anthropic-shim (or LiteLLM) and judge whether Gemma-class can drive the agent loop. Make/break before building.
Phase 1 — ingress spike: add /v1/messages as a parallel mini-pipeline (don't refactor the OpenAI core yet) that classifies over the whole payload and only passes through to Claude. Point Claude Code at it; prove auth + streaming + tools + thinking end-to-end against the real router.
Phase 2 — canonical IR refactor: flip the internal format to Anthropic so both ingresses share orchestration; the existing translate functions become the OpenAI-ingress adapters.
Phase 3 — ScalarLM path: anthropic_to_openai_request / openai_to_anthropic_response, then streaming translation, then count_tokens (local), then strip cache_control / thinking.

Alternatives considered

OpenAI stays internal; Anthropic ingress translates down. Rejected: double-translates the Claude path, losing thinking + caching — the exact fidelity we're adding the ingress to preserve.
LiteLLM / claude-code-proxy in front, no router change. Fast to stand up and good for Phase 0, but the Claude path still double-translates (Anthropic→OpenAI at the proxy, OpenAI→Anthropic in the Claude backend). Acceptable as a spike, not as the product.
Anthropic ingress as a permanent self-contained mini-pipeline. Avoids the core refactor but duplicates orchestration (auth, classify, decide, audit, streaming plumbing) across two ingresses. Fine for Phase 1; the canonical-IR refactor (Phase 2) removes the duplication.

Open questions

Whole-payload classification: union vs. max-over-chunks vs. sliding window — and the latency/cost budget for classifying large agent payloads on every request.
count_tokens: which tokenizer do we count with, and how close does it need to be to Claude's for Claude Code's context management to behave?
Do we expose model aliases (opus, sonnet) and map them to router-auto / forced backends, or only router-auto?
Streaming: do we need true incremental tool-argument streaming on the Claude path, or is the current "tool_calls in one delta at end-of-turn" acceptable for Claude Code?

Testing

Unit: the new translate functions (request + response + streaming), with tool calling, mirroring tests/test_translate.py.
Unit: whole-payload classifier extraction (proprietary content in tool results / system → routes to ScalarLM).
Unit: count_tokens never performs network I/O.
Integration: a recorded Claude Code session replayed against the ingress (both backends), asserting Anthropic-shaped responses and that no novel content egresses to Anthropic.