Difficulty-based model tiering — design

Status: Approved, Phase 1 in progress. Adds a second routing stage that, for traffic already cleared to a non-proprietary frontier model, picks which model by task difficulty and whether progress is stuck — a cheap/fast model by default, the smartest model when the task needs it or the agent is looping. The fast tier is MiniMax 2.5 on Cerebras (low latency).

Decisions (locked): - Opt-in: per-token routing mode; the default is tier-auto (general traffic is tiered unless a token opts out to auto/claude/scalarlm). - Base tier: fast — start cheap/fast, escalate up. - Fast tier: Cerebras (OpenAI-compatible, ~3000 tok/s). MiniMax 2.5 was the intent, but this Cerebras account serves only gpt-oss-120b and zai-glm-4.7 — so the fast tier is gpt-oss-120b (swap via settings if a MiniMax-capable endpoint becomes available). - Scorer: heuristic first (router-side, no training); ML head is Phase 2. - Labeling in the UI ships in Phase 1, to seed the Phase-3 flywheel. - Fastlane fallback: a fast-tier backend error (bad model / Cerebras outage) falls back to Claude at request time — IP-safe (both general), so the fast tier can never break general traffic.

Where this fits (and the IP invariant)

split-brain already has one routing decision — novelty: general → a frontier model (Claude today); novel/uncertain → ScalarLM (see classifier.md, router.md). Tiering is a sub-decision of the general branch only:

request ─▶ novelty classifier (max p_novel over spans)
             ├─ novel / uncertain ─▶ ScalarLM            (unchanged; IP-safe)
             └─ general ───────────▶ TIER SELECTOR
                                       ├─ fast      ─▶ Cerebras / MiniMax host (external)
                                       ├─ balanced  ─▶ Claude Sonnet
                                       └─ deep      ─▶ Claude Opus

Critical: every tier on the general branch is an external model (Claude, Cerebras, etc.), so it may only ever receive content the novelty classifier labeled general. The fast tier is not an IP escape hatch — it sits behind the same novelty gate. Novel content never reaches any of these models; it still goes to ScalarLM. Tiering changes which general model, never whether proprietary content leaves.

The model ladder (tiers)

Tiers are configuration, not code — a named ladder mapping tier → (backend, model), admin-tunable at runtime the same way the default model is (see default-model.md; this generalizes settings.json):

{
  "model_tiers": {
    "fast":     { "backend": "fastlane", "model": "gpt-oss-120b" },
    "balanced": { "backend": "claude",   "model": "claude-sonnet-4-6" },
    "deep":     { "backend": "claude",   "model": "claude-opus-4-8" }
  },
  "tier_policy": { "base": "fast", "escalate": "deep",
                   "difficulty_tau": 0.6, "stuck_tau": 0.5 }
}

The ladder is ordered (fast < balanced < deep). An operator can add/rename tiers, repoint a tier at a different model, or collapse the ladder to two tiers without a code change.

The fast tier and "MiniMax 2.5 on Cerebras"

The fast tier runs on Cerebras's OpenAI-compatible endpoint (https://api.cerebras.ai/v1) — wafer-scale, so very low latency (~3000 tok/s). MiniMax 2.5 was the intent, but a live check of the account's /v1/models returned only gpt-oss-120b and zai-glm-4.7, so the fast tier uses gpt-oss-120b. The backend is the same shape as the ScalarLM backend (base URL + API key + model id) and stays provider-agnostic — configured via FASTLANE_BASE_URL / FASTLANE_API_KEY / FASTLANE_MODEL, it repoints at any OpenAI-compatible host (or a MiniMax-serving provider) without code changes.

Robustness: if the fast backend errors (bad model id, Cerebras outage, rate limit), the router falls back to Claude (balanced) at request time rather than 502-ing — safe because both are general/external. So a flaky fast tier degrades latency, never availability.

Verify before relying on it for agentic traffic: the fast model must handle the tool-use wire format for Claude Code's agentic turns. If it doesn't, scope the fast tier to plain-chat general turns and start agentic general turns at balanced (see Open questions).

The signals (what "difficulty" and "stuck" mean)

Two independent scores, both computable statelessly from the request payload (agentic clients resend the full conversation each turn, so the recent history — including failures — is already in front of us):

1. Task-intrinsic difficulty — "the task requires it"

How hard the work is, independent of progress:

Client hints (free, high-signal). Claude Code already sets extended thinking and a max_tokens budget when it judges a task hard — a strong, zero-cost difficulty prior. A large thinking.budget_tokens ⇒ deep.
Content features. Code vs prose, math/proofs, multi-file/large context, many tools defined, explicit "carefully / think hard / prove" language.
Length / context size of the active spans.

2. Stuck / escalation — "progress is stuck"

Computed over the last K turns of the resent conversation:

Repeated failures. Extract an error signature from each recent tool_result (non-zero exit, Traceback, FAILED, Error:, assertion, "command not found"). The same signature recurring across ≥ N of the last M turns ⇒ the agent is looping on one failure ⇒ escalate.
Low action diversity. Near-duplicate tool calls repeated with no new successful results ⇒ stuck.
User-retry / frustration phrases in recent user turns ("still failing", "that didn't work", "try again", "no, that's wrong").
Depth without resolution. Many turns + recent errors + no recent success.

Because the failure history stays in the resent context while the task remains stuck, the stuck score persists naturally — no server-side session state is required, and hysteresis emerges from the data. (An optional light cache keyed by a conversation hash can dampen flapping and skip recompute, but is not needed for correctness.)

Decision policy

tier = base                       # e.g. "balanced" (or "fast" for cheap-first)
if difficulty >= difficulty_tau:  tier = max(tier, escalate)   # deep
if stuck      >= stuck_tau:       tier = max(tier, escalate)   # deep
if client requested big thinking: tier = max(tier, escalate)
resolve tier -> (backend, model)  # from model_tiers

Escalate-biased, monotone within a task. It's cheap to over-spend on a genuinely-easy turn but expensive to under-serve a stuck one, so the policy only ever ratchets up on the stuck/difficulty signals; it doesn't yank a mid-task agent back down to the fast tier on one quiet turn. Mid-session model switching has a real consistency cost (different model styles), so we change models rarely and deliberately.
Base = fast (the locked decision): start on the low-latency MiniMax tier and escalate only when difficulty/stuck demands it, maximizing the latency/cost win on the common easy traffic. tier_policy.base keeps this operator-tunable (e.g. to balanced for a more conservative posture).

How the signal is computed — three options

Heuristic, router-side (recommended v1). Regex/counters over the payload for the signals above; no model, no training, fully transparent and debuggable, ~free on the hot path. Ships the capability immediately.
ML difficulty head on the classifier service (v2). The classifier already MiniLM-encodes the prompt for novelty; add a second head (difficulty, and/or stuck) reusing that encoder — one extra cheap forward pass, retrainable via the existing bootstrap/flywheel. The /classify response gains difficulty / stuck alongside p_novel.
LLM-judge (rejected for the hot path). Asking ScalarLM "how hard is this?" is IP-safe but adds a full model call of latency per request. Keep it offline for generating training labels only.

The flywheel — learning the right tier

Mirror the novelty loop's "outcome, not document" principle (classifier.md). The label we want is "the cheapest tier that actually resolved the task":

Under-served signal. A fast/balanced turn followed (in the resent history of the next turn, or a continuation in the audit log) by the same error recurring, a user retry, or an escalation ⇒ that turn should have been deeper — a training example.
Right-sized signal. A fast turn whose result was accepted and not retried ⇒ fast was sufficient ⇒ confirm.

Operators can also label difficulty/tier in the UI (extend the Labels view), and the audit log already records enough (chosen tier, decision, continuation) to mine these post-hoc. The difficulty head trains on these like the novelty head trains on per-span labels.

Backends

Claude (existing) already selects an arbitrary claude-* model per call (backends/claude.py _effective_model + the model_provider from settings.json), so the Claude tiers need no new backend — just different model ids.
Fastlane (new) = an OpenAI-compatible backend, essentially the existing ScalarlmBackend generalized (configurable base_url/api_key/model), reusing translate.py's Anthropic↔OpenAI conversion that the ScalarLM path already uses. On the Anthropic ingress its responses are translated back to Anthropic shape, exactly like the ScalarLM path.

Opt-in / override semantics

Claude Code pins a concrete claude-* model per request, which the router currently honors. Tiering must decide when to override that:

A request/token routing mode (extends the existing per-token routing_mode: auto/claude/scalarlm, see token-routing.md) gains a tier-auto mode (or a model=router-tier alias) that opts into router model selection.
In tier-auto, the novelty gate runs as usual; on the general branch the router picks the tier/model and ignores the client's specific id.
Default stays honor the client's model so nothing changes for clients that don't opt in.

Headers & audit

For observability and the flywheel, add:

Split-Brain-Tier: fast | balanced | deep response header.
Audit fields: tier, difficulty_score, stuck_score, plus the existing backend_model (already records the model actually used). These let the Request explorer show why a tier was chosen and feed the outcome loop.

Configuration

Setting	Where	Notes
`model_tiers`, `tier_policy`	`settings.json` (admin-tunable)	ladder + thresholds; runtime-editable like the default model
`FASTLANE_BASE_URL` / `FASTLANE_API_KEY` / `FASTLANE_MODEL`	router env / Secret	the OpenAI-compatible fast tier (Cerebras / MiniMax host)

Open questions

Tool-use parity of the fast model. If the Cerebras/MiniMax fast model doesn't handle the agentic tool-use format well, the fast tier should be restricted to non-agentic (plain-chat) general turns, with agentic general turns starting at balanced. Needs a per-model capability flag.
Mid-task switching. How aggressively to ratchet, and whether to ever de-escalate within a session. Start escalate-only + monotone.
Base tier. fast-first (max savings, more escalations) vs balanced-first (safer). Likely per-token configurable.
Heuristic thresholds vs learned. v1 hand-tuned difficulty_tau / stuck_tau; v2 calibrated from the outcome flywheel.
Cost vs latency accounting. Track per-tier token spend (the daily quota already counts tokens) so operators can see the savings/escalation rate.

Phased plan

Phase 1 — heuristic tiering + fast backend + UI labeling. Add the OpenAI-compatible fastlane backend (MiniMax 2.5 on Cerebras), the model_tiers/tier_policy settings + admin UI, the per-token tier-auto mode as the default, a router-side heuristic difficulty/stuck scorer, the tier/score audit fields + Split-Brain-Tier header, and difficulty/tier labeling in the UI (seeds the Phase-3 flywheel). Ships end-to-end, no training.
Phase 2 — ML difficulty head. Add a difficulty/stuck head to the classifier service (reusing the encoder); /classify returns the scores; router consumes them instead of (or alongside) the heuristic.
Phase 3 — outcome flywheel. Mine the audit log + the Phase-1 UI labels for "cheapest tier that resolved it," retrain the head, calibrate thresholds.