Difficulty-based model tiering — design
Status: Approved, Phase 1 in progress. Adds a second routing stage that, for traffic already cleared to a non-proprietary frontier model, picks which model by task difficulty and whether progress is stuck — a cheap/fast model by default, the smartest model when the task needs it or the agent is looping. The fast tier is MiniMax 2.5 on Cerebras (low latency).
Decisions (locked):
- Opt-in: per-token routing mode; the default is tier-auto (general
traffic is tiered unless a token opts out to auto/claude/scalarlm).
- Base tier: fast — start cheap/fast, escalate up.
- Fast tier: Cerebras (OpenAI-compatible, ~3000 tok/s). MiniMax 2.5 was
the intent, but this Cerebras account serves only gpt-oss-120b and
zai-glm-4.7 — so the fast tier is gpt-oss-120b (swap via settings
if a MiniMax-capable endpoint becomes available).
- Scorer: heuristic first (router-side, no training); ML head is Phase 2.
- Labeling in the UI ships in Phase 1, to seed the Phase-3 flywheel.
- Fastlane fallback: a fast-tier backend error (bad model / Cerebras
outage) falls back to Claude at request time — IP-safe (both general),
so the fast tier can never break general traffic.
Where this fits (and the IP invariant)
split-brain already has one routing decision — novelty: general → a frontier model (Claude today); novel/uncertain → ScalarLM (see classifier.md, router.md). Tiering is a sub-decision of the general branch only:
request ─▶ novelty classifier (max p_novel over spans)
├─ novel / uncertain ─▶ ScalarLM (unchanged; IP-safe)
└─ general ───────────▶ TIER SELECTOR
├─ fast ─▶ Cerebras / MiniMax host (external)
├─ balanced ─▶ Claude Sonnet
└─ deep ─▶ Claude Opus
Critical: every tier on the general branch is an external model (Claude, Cerebras, etc.), so it may only ever receive content the novelty classifier labeled general. The fast tier is not an IP escape hatch — it sits behind the same novelty gate. Novel content never reaches any of these models; it still goes to ScalarLM. Tiering changes which general model, never whether proprietary content leaves.
The model ladder (tiers)
Tiers are configuration, not code — a named ladder mapping tier → (backend,
model), admin-tunable at runtime the same way the default model is (see
default-model.md; this generalizes settings.json):
{
"model_tiers": {
"fast": { "backend": "fastlane", "model": "gpt-oss-120b" },
"balanced": { "backend": "claude", "model": "claude-sonnet-4-6" },
"deep": { "backend": "claude", "model": "claude-opus-4-8" }
},
"tier_policy": { "base": "fast", "escalate": "deep",
"difficulty_tau": 0.6, "stuck_tau": 0.5 }
}
The ladder is ordered (fast < balanced < deep). An operator can add/rename
tiers, repoint a tier at a different model, or collapse the ladder to two
tiers without a code change.
The fast tier and "MiniMax 2.5 on Cerebras"
The fast tier runs on Cerebras's OpenAI-compatible endpoint
(https://api.cerebras.ai/v1) — wafer-scale, so very low latency (~3000
tok/s). MiniMax 2.5 was the intent, but a live check of the account's
/v1/models returned only gpt-oss-120b and zai-glm-4.7, so the fast tier
uses gpt-oss-120b. The backend is the same shape as the ScalarLM backend
(base URL + API key + model id) and stays provider-agnostic — configured via
FASTLANE_BASE_URL / FASTLANE_API_KEY / FASTLANE_MODEL, it repoints at any
OpenAI-compatible host (or a MiniMax-serving provider) without code changes.
Robustness: if the fast backend errors (bad model id, Cerebras outage, rate limit), the router falls back to Claude (balanced) at request time rather than 502-ing — safe because both are general/external. So a flaky fast tier degrades latency, never availability.
Verify before relying on it for agentic traffic: the fast model must
handle the tool-use wire format for Claude Code's agentic turns. If it
doesn't, scope the fast tier to plain-chat general turns and start agentic
general turns at balanced (see Open questions).
The signals (what "difficulty" and "stuck" mean)
Two independent scores, both computable statelessly from the request payload (agentic clients resend the full conversation each turn, so the recent history — including failures — is already in front of us):
1. Task-intrinsic difficulty — "the task requires it"
How hard the work is, independent of progress:
- Client hints (free, high-signal). Claude Code already sets extended
thinkingand amax_tokensbudget when it judges a task hard — a strong, zero-cost difficulty prior. A largethinking.budget_tokens⇒ deep. - Content features. Code vs prose, math/proofs, multi-file/large context, many tools defined, explicit "carefully / think hard / prove" language.
- Length / context size of the active spans.
2. Stuck / escalation — "progress is stuck"
Computed over the last K turns of the resent conversation:
- Repeated failures. Extract an error signature from each recent
tool_result(non-zero exit,Traceback,FAILED,Error:, assertion, "command not found"). The same signature recurring across ≥ N of the last M turns ⇒ the agent is looping on one failure ⇒ escalate. - Low action diversity. Near-duplicate tool calls repeated with no new successful results ⇒ stuck.
- User-retry / frustration phrases in recent user turns ("still failing", "that didn't work", "try again", "no, that's wrong").
- Depth without resolution. Many turns + recent errors + no recent success.
Because the failure history stays in the resent context while the task remains stuck, the stuck score persists naturally — no server-side session state is required, and hysteresis emerges from the data. (An optional light cache keyed by a conversation hash can dampen flapping and skip recompute, but is not needed for correctness.)
Decision policy
tier = base # e.g. "balanced" (or "fast" for cheap-first)
if difficulty >= difficulty_tau: tier = max(tier, escalate) # deep
if stuck >= stuck_tau: tier = max(tier, escalate) # deep
if client requested big thinking: tier = max(tier, escalate)
resolve tier -> (backend, model) # from model_tiers
- Escalate-biased, monotone within a task. It's cheap to over-spend on a genuinely-easy turn but expensive to under-serve a stuck one, so the policy only ever ratchets up on the stuck/difficulty signals; it doesn't yank a mid-task agent back down to the fast tier on one quiet turn. Mid-session model switching has a real consistency cost (different model styles), so we change models rarely and deliberately.
- Base =
fast(the locked decision): start on the low-latency MiniMax tier and escalate only when difficulty/stuck demands it, maximizing the latency/cost win on the common easy traffic.tier_policy.basekeeps this operator-tunable (e.g. tobalancedfor a more conservative posture).
How the signal is computed — three options
- Heuristic, router-side (recommended v1). Regex/counters over the payload for the signals above; no model, no training, fully transparent and debuggable, ~free on the hot path. Ships the capability immediately.
- ML difficulty head on the classifier service (v2). The classifier
already MiniLM-encodes the prompt for novelty; add a second head
(difficulty, and/or stuck) reusing that encoder — one extra cheap forward
pass, retrainable via the existing bootstrap/flywheel. The
/classifyresponse gainsdifficulty/stuckalongsidep_novel. - LLM-judge (rejected for the hot path). Asking ScalarLM "how hard is this?" is IP-safe but adds a full model call of latency per request. Keep it offline for generating training labels only.
The flywheel — learning the right tier
Mirror the novelty loop's "outcome, not document" principle (classifier.md). The label we want is "the cheapest tier that actually resolved the task":
- Under-served signal. A
fast/balancedturn followed (in the resent history of the next turn, or a continuation in the audit log) by the same error recurring, a user retry, or an escalation ⇒ that turn should have been deeper — a training example. - Right-sized signal. A
fastturn whose result was accepted and not retried ⇒ fast was sufficient ⇒ confirm.
Operators can also label difficulty/tier in the UI (extend the Labels view), and the audit log already records enough (chosen tier, decision, continuation) to mine these post-hoc. The difficulty head trains on these like the novelty head trains on per-span labels.
Backends
- Claude (existing) already selects an arbitrary
claude-*model per call (backends/claude.py_effective_model+ themodel_providerfromsettings.json), so the Claude tiers need no new backend — just different model ids. - Fastlane (new) = an OpenAI-compatible backend, essentially the existing
ScalarlmBackendgeneralized (configurablebase_url/api_key/model), reusingtranslate.py's Anthropic↔OpenAI conversion that the ScalarLM path already uses. On the Anthropic ingress its responses are translated back to Anthropic shape, exactly like the ScalarLM path.
Opt-in / override semantics
Claude Code pins a concrete claude-* model per request, which the router
currently honors. Tiering must decide when to override that:
- A request/
tokenrouting mode (extends the existing per-tokenrouting_mode:auto/claude/scalarlm, see token-routing.md) gains atier-automode (or amodel=router-tieralias) that opts into router model selection. - In
tier-auto, the novelty gate runs as usual; on the general branch the router picks the tier/model and ignores the client's specific id. - Default stays honor the client's model so nothing changes for clients that don't opt in.
Headers & audit
For observability and the flywheel, add:
Split-Brain-Tier: fast | balanced | deepresponse header.- Audit fields:
tier,difficulty_score,stuck_score, plus the existingbackend_model(already records the model actually used). These let the Request explorer show why a tier was chosen and feed the outcome loop.
Configuration
| Setting | Where | Notes |
|---|---|---|
model_tiers, tier_policy |
settings.json (admin-tunable) |
ladder + thresholds; runtime-editable like the default model |
FASTLANE_BASE_URL / FASTLANE_API_KEY / FASTLANE_MODEL |
router env / Secret | the OpenAI-compatible fast tier (Cerebras / MiniMax host) |
Open questions
- Tool-use parity of the fast model. If the Cerebras/MiniMax fast model
doesn't handle the agentic tool-use format well, the fast tier should be
restricted to non-agentic (plain-chat) general turns, with agentic general
turns starting at
balanced. Needs a per-model capability flag. - Mid-task switching. How aggressively to ratchet, and whether to ever de-escalate within a session. Start escalate-only + monotone.
- Base tier.
fast-first (max savings, more escalations) vsbalanced-first (safer). Likely per-token configurable. - Heuristic thresholds vs learned. v1 hand-tuned
difficulty_tau/stuck_tau; v2 calibrated from the outcome flywheel. - Cost vs latency accounting. Track per-tier token spend (the daily quota already counts tokens) so operators can see the savings/escalation rate.
Phased plan
- Phase 1 — heuristic tiering + fast backend + UI labeling. Add the
OpenAI-compatible fastlane backend (MiniMax 2.5 on Cerebras), the
model_tiers/tier_policysettings + admin UI, the per-tokentier-automode as the default, a router-side heuristic difficulty/stuck scorer, the tier/score audit fields +Split-Brain-Tierheader, and difficulty/tier labeling in the UI (seeds the Phase-3 flywheel). Ships end-to-end, no training. - Phase 2 — ML difficulty head. Add a difficulty/stuck head to the
classifier service (reusing the encoder);
/classifyreturns the scores; router consumes them instead of (or alongside) the heuristic. - Phase 3 — outcome flywheel. Mine the audit log + the Phase-1 UI labels for "cheapest tier that resolved it," retrain the head, calibrate thresholds.