Anthropic ingress (Claude Code integration) — design
Status: Implemented (Phase 1 + 3). The router now serves
POST /v1/messages (streaming + non-streaming) and
POST /v1/messages/count_tokens. The Claude path is a native
pass-through; the ScalarLM path translates Anthropic↔OpenAI (request,
response, and the streaming event stream). Classification is over the
whole payload. count_tokens is local. Not done: the Phase 2
canonical-IR refactor — we took the pragmatic route of giving each
backend both complete/stream (OpenAI) and
complete_messages/stream_messages (Anthropic) methods, which leaves
the two ingresses' orchestration (auth/classify/decide/audit) somewhat
duplicated. Open items below remain (notably the span-sampling strategy
for very large payloads and tokenizer accuracy).
Motivation
We want Anthropic-native clients — first of all the Claude Code CLI — to route through split-brain, so that an engineer's coding traffic gets the same novelty-based routing and IP protection as everything else: proprietary work stays on ScalarLM, general work can use Claude.
The blocker is a protocol mismatch. Claude Code speaks only the
Anthropic Messages API (POST /v1/messages, POST
/v1/messages/count_tokens); it expects Anthropic-shaped responses and
errors on OpenAI-shaped ones. The router today speaks only the
OpenAI Chat Completions API. So ANTHROPIC_BASE_URL=https://router/...
does not work as-is — the path 404s and the payloads don't match.
This design adds an Anthropic ingress to the router and reorganizes the internal translation so the Claude path stays high-fidelity.
Goals
- A Claude Code user sets two env vars and is routed by split-brain:
ANTHROPIC_BASE_URL=https://<router-host>ANTHROPIC_AUTH_TOKEN=sbk_…(a split-brain API token)- Native fidelity on the Claude path: extended thinking, prompt caching
(
cache_control), tool calling, and streaming all survive. - The IP invariant holds for agentic clients — proprietary code never reaches Claude, even when it is spread across tool results and history rather than the last user message (see § Classifier coverage).
- Reuse the existing OpenAI↔Anthropic translation rather than fork it.
Non-goals
- Bedrock / Vertex gateway shapes (Claude Code also supports these; we do the Anthropic shape only).
- Vision / images on the ScalarLM path (text + tools first).
- Making ScalarLM a good agentic coding model — that is a ScalarLM concern; this design only makes the plumbing faithful.
Background: what Claude Code actually sends
Confirmed against the Claude Code docs:
| Aspect | Value |
|---|---|
| Endpoints | POST /v1/messages, POST /v1/messages/count_tokens, and /v1/models for gateway model discovery |
| Auth | ANTHROPIC_AUTH_TOKEN → Authorization: Bearer <token> (matches the router's require_token). ANTHROPIC_API_KEY → x-api-key instead. |
| Headers | anthropic-version, anthropic-beta, plus X-Claude-Code-Session-Id |
| Body | Large system prompt, tool definitions (often tens of KB), cache_control on system/context by default, thinking blocks when enabled, streaming on by default |
| Responses | Anthropic shape required ({"content": [...], "usage": {...}}); no OpenAI-response translation on the client side |
The core decision: Anthropic is the internal canonical format
The router's internal "lingua franca" is currently OpenAI: both
backends consume and emit OpenAI bodies, and ClaudeBackend translates
OpenAI→Anthropic internally (backends/claude.py).
If the Anthropic ingress translated down to OpenAI internally, the
Claude path would become Anthropic → OpenAI → Anthropic — exactly the
double-translation that drops thinking and cache_control. That
defeats the purpose.
So we flip the canonical internal representation to Anthropic (the richer superset: it carries thinking, cache_control, and tool_use; OpenAI is a subset). Every edge becomes a thin adapter, and — crucially — the two translation functions we already have get reused in flipped roles:
| Edge | Request adapter | Response adapter |
|---|---|---|
| OpenAI ingress (existing clients) | openai_to_anthropic_request (have) |
anthropic_to_openai_response (have) |
| Anthropic ingress (Claude Code) | identity | identity |
| Claude backend | identity (near pass-through) | identity |
| ScalarLM backend | anthropic_to_openai_request (new) |
openai_to_anthropic_response (new) |
Net effect:
- The Claude backend gets simpler — it forwards the canonical (Anthropic) body to Anthropic with the model overridden, and returns the result unchanged. Thinking / caching / tools / streaming are native.
- The ScalarLM backend gains the mirror translation pair.
- The orchestration in the middle — auth → classify → policy decision → audit — stays protocol-agnostic, operating on the canonical body.
OpenAI client Claude Code
│ /v1/chat/completions │ /v1/messages
▼ ▼
openai_to_anthropic_request (identity)
└───────────────┬──────────────────┘
▼
classify · decide · audit (canonical = Anthropic)
┌────────┴─────────┐
▼ ▼
Claude backend ScalarLM backend
(identity → (anthropic_to_openai_request →
Anthropic API) ScalarLM → openai_to_anthropic_response)
│ │
▼ ▼
response rendered per ingress:
OpenAI client → anthropic_to_openai_response
Claude Code → identity
New surface to build
POST /v1/messages(non-streaming + SSE) and/v1/modelsin Anthropic shape (gateway model discovery hits/v1/models).POST /v1/messages/count_tokens— see § count_tokens.anthropic_to_openai_requestandopenai_to_anthropic_responseintranslate.py, including the tool-calling cases (mirror of the functions we already have).- Streaming translation, both directions. Anthropic SSE events
(
message_start,content_block_start,content_block_delta/input_json_delta,message_delta,message_stop) ↔ OpenAIchat.completion.chunks. The Anthropic→OpenAI half largely exists insidebackends/claude.py::streamand moves into the OpenAI-ingress adapter; the OpenAI→Anthropic half (for the ScalarLM path) is new. - Header handling: forward
anthropic-version/anthropic-betato Claude; strip Claude-only fields (cache_control,thinking) on the ScalarLM path.
Invariant risks (the parts that matter more than the plumbing)
Classifier coverage
For chat, the proprietary content is the user's question, so classifying the last user message would be enough. For an agent it is not: Claude Code spreads your codebase across the system prompt, tool definitions, prior assistant turns, and tool results (file reads). A benign last user message — "now add a test" — would classify general → route to Claude, shipping the entire proprietary context with it. A direct IP-invariant violation.
So the Anthropic ingress classifies the whole payload. _classify_whole_payload
extracts every user text block and every tool_result, across all turns
(translate.user_content_spans), classifies each span independently (each
capped at 8000 chars, fanned out under a concurrency cap), and routes on the
max p_novel — any one proprietary span biases the whole request to
ScalarLM. The system prompt and tool definitions are omitted (client
boilerplate that skews novelty). The OpenAI ingress (/v1/chat/completions),
which serves plain chat, still classifies the last user message
(extract_last_user_message).
The original design left "classify the union vs. the max over chunks vs. a sliding window" open; the default is max over spans — the most conservative choice for the IP invariant.
Span-fraction aggregation (opt-in). Max-over-spans is brittle for long
agentic payloads: with N spans and a non-zero per-span false-positive rate,
P(≥1 span flips) = 1 − (1−p)^N grows with N, so a single borderline span
sends a 20-span request to ScalarLM. ROUTER_NOVEL_SPAN_FRACTION (default
0.0 = max) instead routes on the k-th highest span p_novel, where
k = ceil(fraction × N) — i.e. the request is novel only if that fraction of
its spans are novel. This normalizes for span count (more spans no longer
inflate the novel odds) and ignores a lone outlier, but relaxes the IP
invariant — a single proprietary span diluted in a large general request can
reach Claude. A small, concentrated proprietary request keeps k=1 and stays
protected. Opt-in per deployment; the dev overlay uses 0.15.
count_tokens must not leak
Claude Code calls /v1/messages/count_tokens frequently to manage its
context window. The naive implementation proxies it to Anthropic — but
that ships the full message content (your code) to Anthropic even for
novel prompts, violating the invariant before a single completion is
routed.
Decision: the router answers count_tokens locally (a tokenizer
or a calibrated estimate) and never forwards it to Anthropic. An
approximate count is acceptable for context management; correctness of
the invariant is not negotiable.
Per-session routing stability
Once the first turn is routed, Claude Code's follow-ups are tool results with no new user message, so routing stays put for the session — provided we adopt whole-payload classification above (otherwise a later tool-result-only turn has nothing to classify and could drift). This is a reason to make the classification cover the whole payload rather than re-deriving per turn.
ScalarLM has to drive the loop
With the invariant correctly enforced, essentially every coding session on your own code routes to ScalarLM (Gemma-class). The entire UX then depends on ScalarLM's agentic tool-use quality and context window. This is a product risk, not a plumbing one — worth validating in isolation (point Claude Code at ScalarLM directly through a thin shim) before investing in the ingress.
Operational
- Cloudflare Access: the router host must be reachable programmatically with the bearer token — a hostname with no interactive Access SSO (that gates the UI), or an Access service token. cloudflared is token-mode, so this is configured in the Cloudflare dashboard.
- Context window: ScalarLM's window must cover Claude Code's contexts or requests truncate/fail.
- Non-first-party host: Claude Code disables some MCP tool-search
behavior against non-Anthropic hosts unless
ENABLE_TOOL_SEARCH=true.
Phased plan
- Phase 0 — de-risk (no router change): point Claude Code at ScalarLM directly via a thin Anthropic-shim (or LiteLLM) and judge whether Gemma-class can drive the agent loop. Make/break before building.
- Phase 1 — ingress spike: add
/v1/messagesas a parallel mini-pipeline (don't refactor the OpenAI core yet) that classifies over the whole payload and only passes through to Claude. Point Claude Code at it; prove auth + streaming + tools + thinking end-to-end against the real router. - Phase 2 — canonical IR refactor: flip the internal format to Anthropic so both ingresses share orchestration; the existing translate functions become the OpenAI-ingress adapters.
- Phase 3 — ScalarLM path:
anthropic_to_openai_request/openai_to_anthropic_response, then streaming translation, thencount_tokens(local), then stripcache_control/thinking.
Alternatives considered
- OpenAI stays internal; Anthropic ingress translates down. Rejected: double-translates the Claude path, losing thinking + caching — the exact fidelity we're adding the ingress to preserve.
- LiteLLM / claude-code-proxy in front, no router change. Fast to stand up and good for Phase 0, but the Claude path still double-translates (Anthropic→OpenAI at the proxy, OpenAI→Anthropic in the Claude backend). Acceptable as a spike, not as the product.
- Anthropic ingress as a permanent self-contained mini-pipeline. Avoids the core refactor but duplicates orchestration (auth, classify, decide, audit, streaming plumbing) across two ingresses. Fine for Phase 1; the canonical-IR refactor (Phase 2) removes the duplication.
Open questions
- Whole-payload classification: union vs. max-over-chunks vs. sliding window — and the latency/cost budget for classifying large agent payloads on every request.
count_tokens: which tokenizer do we count with, and how close does it need to be to Claude's for Claude Code's context management to behave?- Do we expose model aliases (
opus,sonnet) and map them torouter-auto/ forced backends, or onlyrouter-auto? - Streaming: do we need true incremental tool-argument streaming on the Claude path, or is the current "tool_calls in one delta at end-of-turn" acceptable for Claude Code?
Testing
- Unit: the new translate functions (request + response + streaming), with
tool calling, mirroring
tests/test_translate.py. - Unit: whole-payload classifier extraction (proprietary content in tool results / system → routes to ScalarLM).
- Unit:
count_tokensnever performs network I/O. - Integration: a recorded Claude Code session replayed against the ingress (both backends), asserting Anthropic-shaped responses and that no novel content egresses to Anthropic.