Auto prompt-caching for OpenAI clients on the Claude path — design
Status: Implemented (phases 1 + 2). translate.add_cache_breakpoints
injects tools + system breakpoints on the OpenAI→Claude path
(ClaudeBackend.complete / .stream) when ROUTER_CACHE_CLAUDE_PROMPT
is on (default), and — when ROUTER_CACHE_CLAUDE_CONVERSATION is on
(opt-in) — a conversation-prefix breakpoint on the last message.
cache_read/cache_creation are folded into the OpenAI usage (with
prompt_tokens_details.cached_tokens), and cache-read tokens are recorded
in the audit log (cached_tokens) and shown in the UI request detail.
The Anthropic ingress is left untouched (client-controlled).
Motivation
OpenAI's Chat Completions wire format has no cache_control field, so an
OpenAI client routed to Claude through split-brain can't ask for
Anthropic prompt caching.
Today the /v1/chat/completions → Claude path
(translate.openai_to_anthropic_request) forwards no cache breakpoints,
so every request re-pays full input cost for a stable system prompt +
tool definitions that don't change across an agent's turns.
This design has split-brain inject cache breakpoints on the Anthropic
request it builds for the Claude backend, so OpenAI clients get caching
for free. The Anthropic ingress (/v1/messages) is untouched — those
clients (e.g. Claude Code) set cache_control themselves and we already
pass it through.
Scope
- In: OpenAI ingress → Claude backend only
(
ClaudeBackend.complete/.stream, which callopenai_to_anthropic_request). - Out: the Anthropic ingress (client-controlled), and the ScalarLM path (caching is Anthropic-specific and is already stripped there).
Background: how Anthropic caching works (the bits that constrain us)
- A
cache_control: {"type": "ephemeral"}marker on a content block declares a cache breakpoint: the whole prompt prefix up to and including that block is cached. The cacheable prefix order is tools → system → messages. - Up to 4 breakpoints per request.
- Minimum cacheable prefix: 1024 tokens (Opus/Sonnet), 2048 (Haiku). Below that, no cache is created — but the marker is harmless (no error).
- TTL: ephemeral is 5 min by default; 1 h is available
(
{"type": "ephemeral", "ttl": "1h"}, beta header). - Cost: a cache write costs +25% (5 m) / +100% (1 h) over base input tokens; a cache read costs 10% of base. So caching only saves money when the prefix is reused within the TTL; a one-off large prompt pays the write premium for nothing. The 1024-token floor means small prompts are never cached (no premium, no benefit).
- The cached prefix must be byte-identical across requests to hit, and
the cache key includes the model. We already pin the model
(
ANTHROPIC_MODEL), and we inject breakpoints deterministically, so repeated identical prefixes hit.
Where we put breakpoints
For an OpenAI-ingress request, the stable, high-value prefix is the system prompt and the tool definitions. We add up to three breakpoints (well under the 4 limit), most-stable-first:
- Tools —
cache_controlon the last tool. Caches the whole tool array. Only added whentoolsis present. - System — convert the (string) system into a single text block and
mark it:
json "system": [{"type": "text", "text": "...", "cache_control": {"type": "ephemeral"}}]Because system follows tools in the prefix, this breakpoint caches tools + system together; the separate tools breakpoint above only matters when the system text changes but tools don't. - Conversation prefix (optional, phase 2) — for multi-turn / agentic OpenAI clients, mark the last content block of the message before the final user turn, so the growing history is cached and only the newest turn is uncached. Requires converting that message's content to block form. Bigger win for agents, more fiddly; gated behind its own flag.
Phase 1 ships (1) + (2): the system+tools prefix, which is the dominant, always-present cost in agent and RAG workloads.
Implementation sketch
A pure helper in translate.py:
def add_cache_breakpoints(req: dict[str, Any], *, ttl: str = "5m",
cache_conversation: bool = False) -> dict[str, Any]:
cc = {"type": "ephemeral"} if ttl == "5m" else {"type": "ephemeral", "ttl": ttl}
tools = req.get("tools")
if isinstance(tools, list) and tools:
tools[-1] = {**tools[-1], "cache_control": cc}
system = req.get("system")
if isinstance(system, str) and system:
req["system"] = [{"type": "text", "text": system, "cache_control": cc}]
elif isinstance(system, list) and system:
system[-1] = {**system[-1], "cache_control": cc}
if cache_conversation:
... # phase 2: breakpoint on the block before the final user turn
return req
Applied in the Claude backend's OpenAI path only:
# ClaudeBackend.complete / .stream, after openai_to_anthropic_request:
anthropic_req = translate.openai_to_anthropic_request(request)
anthropic_req["model"] = self.model_id
if self._cache_prompt:
translate.add_cache_breakpoints(anthropic_req, ttl=self._cache_ttl)
complete_messages / stream_messages (Anthropic ingress) are left as-is.
Config
| Setting | Env | Default | Notes |
|---|---|---|---|
| enable | ROUTER_CACHE_CLAUDE_PROMPT |
true |
Inject system+tools breakpoints on the OpenAI→Claude path. |
| ttl | ROUTER_CACHE_CLAUDE_TTL |
5m |
5m or 1h. |
| conversation | ROUTER_CACHE_CLAUDE_CONVERSATION |
false |
Phase 2: also cache the history prefix. |
Plumbed through RouterConfig → ClaudeBackend(cache_prompt=…, cache_ttl=…)
in app.py.
Default = on. Rationale: the floor (1024 tokens) means trivial prompts are never cached, so the +25% write premium only ever touches large prefixes — exactly the ones agent/RAG clients reuse. A single kill switch covers the rare workload of large, never-repeated prompts.
Usage reporting
Anthropic's response usage gains cache_creation_input_tokens and
cache_read_input_tokens (and input_tokens then excludes the cached
portion). Two touch-points:
- OpenAI response (
anthropic_to_openai_response): surface the read hits asprompt_tokens_details.cached_tokens(the OpenAI shape), so OpenAI clients can see caching working. Keepprompt_tokens = input_tokens + cache_read + cache_creationso totals stay honest (today we map onlyinput_tokens). - Audit log / headers (optional): record cache read/creation tokens so
the savings are visible in the UI request explorer — e.g. a
Split-Brain-Cache: read=1234,write=0header.
Edge cases
- System is a string today → must become a one-element block list to
carry
cache_control. Anthropic accepts both forms; downstream is unaffected. - Below the min length → marker is a no-op; safe to always inject.
- No system and no tools → nothing to mark; helper is a no-op.
- Breakpoint budget → at most 3 here, within the 4 limit.
- Determinism → we mutate
outdeterministically; identical system/tools across an agent's turns produce identical cached prefixes, so reads hit. - Streaming → no change; injection is in request construction, which
both
completeandstreamshare. Cache usage shows up in the final message event.
Testing
- Unit (
test_translate.py): string system → block +cache_control; list system → marker on last block; tools → marker on last tool; empty cases no-op; ttl=1h shape; ≤4 breakpoints. - Backend (
test_backends.py):ClaudeBackend.completewithcache_prompt=Truesendscache_controlto the SDK;complete_messages(Anthropic ingress) never injects; flag off → no injection. - Response:
cached_tokenssurfaced inprompt_tokens_details.
Phasing
- Phase 1:
add_cache_breakpoints(tools + system), config flag, wire intoClaudeBackend, usage passthrough. Default on. - Phase 2: conversation-prefix breakpoint for multi-turn OpenAI
clients;
cached_tokensin the audit log + UI.