Auto prompt-caching for OpenAI clients on the Claude path — design

Status: Implemented (phases 1 + 2). translate.add_cache_breakpoints injects tools + system breakpoints on the OpenAI→Claude path (ClaudeBackend.complete / .stream) when ROUTER_CACHE_CLAUDE_PROMPT is on (default), and — when ROUTER_CACHE_CLAUDE_CONVERSATION is on (opt-in) — a conversation-prefix breakpoint on the last message. cache_read/cache_creation are folded into the OpenAI usage (with prompt_tokens_details.cached_tokens), and cache-read tokens are recorded in the audit log (cached_tokens) and shown in the UI request detail. The Anthropic ingress is left untouched (client-controlled).

Motivation

OpenAI's Chat Completions wire format has no cache_control field, so an OpenAI client routed to Claude through split-brain can't ask for Anthropic prompt caching. Today the /v1/chat/completions → Claude path (translate.openai_to_anthropic_request) forwards no cache breakpoints, so every request re-pays full input cost for a stable system prompt + tool definitions that don't change across an agent's turns.

This design has split-brain inject cache breakpoints on the Anthropic request it builds for the Claude backend, so OpenAI clients get caching for free. The Anthropic ingress (/v1/messages) is untouched — those clients (e.g. Claude Code) set cache_control themselves and we already pass it through.

Scope

In: OpenAI ingress → Claude backend only (ClaudeBackend.complete / .stream, which call openai_to_anthropic_request).
Out: the Anthropic ingress (client-controlled), and the ScalarLM path (caching is Anthropic-specific and is already stripped there).

Background: how Anthropic caching works (the bits that constrain us)

A cache_control: {"type": "ephemeral"} marker on a content block declares a cache breakpoint: the whole prompt prefix up to and including that block is cached. The cacheable prefix order is tools → system → messages.
Up to 4 breakpoints per request.
Minimum cacheable prefix: 1024 tokens (Opus/Sonnet), 2048 (Haiku). Below that, no cache is created — but the marker is harmless (no error).
TTL: ephemeral is 5 min by default; 1 h is available ({"type": "ephemeral", "ttl": "1h"}, beta header).
Cost: a cache write costs +25% (5 m) / +100% (1 h) over base input tokens; a cache read costs 10% of base. So caching only saves money when the prefix is reused within the TTL; a one-off large prompt pays the write premium for nothing. The 1024-token floor means small prompts are never cached (no premium, no benefit).
The cached prefix must be byte-identical across requests to hit, and the cache key includes the model. We already pin the model (ANTHROPIC_MODEL), and we inject breakpoints deterministically, so repeated identical prefixes hit.

Where we put breakpoints

For an OpenAI-ingress request, the stable, high-value prefix is the system prompt and the tool definitions. We add up to three breakpoints (well under the 4 limit), most-stable-first:

Tools — cache_control on the last tool. Caches the whole tool array. Only added when tools is present.
System — convert the (string) system into a single text block and mark it: json "system": [{"type": "text", "text": "...", "cache_control": {"type": "ephemeral"}}] Because system follows tools in the prefix, this breakpoint caches tools + system together; the separate tools breakpoint above only matters when the system text changes but tools don't.
Conversation prefix (optional, phase 2) — for multi-turn / agentic OpenAI clients, mark the last content block of the message before the final user turn, so the growing history is cached and only the newest turn is uncached. Requires converting that message's content to block form. Bigger win for agents, more fiddly; gated behind its own flag.

Phase 1 ships (1) + (2): the system+tools prefix, which is the dominant, always-present cost in agent and RAG workloads.

Implementation sketch

A pure helper in translate.py:

def add_cache_breakpoints(req: dict[str, Any], *, ttl: str = "5m",
                          cache_conversation: bool = False) -> dict[str, Any]:
    cc = {"type": "ephemeral"} if ttl == "5m" else {"type": "ephemeral", "ttl": ttl}
    tools = req.get("tools")
    if isinstance(tools, list) and tools:
        tools[-1] = {**tools[-1], "cache_control": cc}
    system = req.get("system")
    if isinstance(system, str) and system:
        req["system"] = [{"type": "text", "text": system, "cache_control": cc}]
    elif isinstance(system, list) and system:
        system[-1] = {**system[-1], "cache_control": cc}
    if cache_conversation:
        ...  # phase 2: breakpoint on the block before the final user turn
    return req

Applied in the Claude backend's OpenAI path only:

# ClaudeBackend.complete / .stream, after openai_to_anthropic_request:
anthropic_req = translate.openai_to_anthropic_request(request)
anthropic_req["model"] = self.model_id
if self._cache_prompt:
    translate.add_cache_breakpoints(anthropic_req, ttl=self._cache_ttl)

complete_messages / stream_messages (Anthropic ingress) are left as-is.

Config

Setting	Env	Default	Notes
enable	`ROUTER_CACHE_CLAUDE_PROMPT`	`true`	Inject system+tools breakpoints on the OpenAI→Claude path.
ttl	`ROUTER_CACHE_CLAUDE_TTL`	`5m`	`5m` or `1h`.
conversation	`ROUTER_CACHE_CLAUDE_CONVERSATION`	`false`	Phase 2: also cache the history prefix.

Plumbed through RouterConfig → ClaudeBackend(cache_prompt=…, cache_ttl=…) in app.py.

Default = on. Rationale: the floor (1024 tokens) means trivial prompts are never cached, so the +25% write premium only ever touches large prefixes — exactly the ones agent/RAG clients reuse. A single kill switch covers the rare workload of large, never-repeated prompts.

Usage reporting

Anthropic's response usage gains cache_creation_input_tokens and cache_read_input_tokens (and input_tokens then excludes the cached portion). Two touch-points:

OpenAI response (anthropic_to_openai_response): surface the read hits as prompt_tokens_details.cached_tokens (the OpenAI shape), so OpenAI clients can see caching working. Keep prompt_tokens = input_tokens + cache_read + cache_creation so totals stay honest (today we map only input_tokens).
Audit log / headers (optional): record cache read/creation tokens so the savings are visible in the UI request explorer — e.g. a Split-Brain-Cache: read=1234,write=0 header.

Edge cases

System is a string today → must become a one-element block list to carry cache_control. Anthropic accepts both forms; downstream is unaffected.
Below the min length → marker is a no-op; safe to always inject.
No system and no tools → nothing to mark; helper is a no-op.
Breakpoint budget → at most 3 here, within the 4 limit.
Determinism → we mutate out deterministically; identical system/tools across an agent's turns produce identical cached prefixes, so reads hit.
Streaming → no change; injection is in request construction, which both complete and stream share. Cache usage shows up in the final message event.

Testing

Unit (test_translate.py): string system → block + cache_control; list system → marker on last block; tools → marker on last tool; empty cases no-op; ttl=1h shape; ≤4 breakpoints.
Backend (test_backends.py): ClaudeBackend.complete with cache_prompt=True sends cache_control to the SDK; complete_messages (Anthropic ingress) never injects; flag off → no injection.
Response: cached_tokens surfaced in prompt_tokens_details.

Phasing

Phase 1: add_cache_breakpoints (tools + system), config flag, wire into ClaudeBackend, usage passthrough. Default on.
Phase 2: conversation-prefix breakpoint for multi-turn OpenAI clients; cached_tokens in the audit log + UI.