Overall backend scheduling policy — design
Status: Phase 1 implemented (router + chart; scalarlm-fast wired to
https://gemma4e2b.scalarllm.com/v1). See Implementation notes at the end for
where the shipped behavior differs from the proposal below. Unifies the two
routing stages split-brain already has —
novelty (which branch) and tiering (which model) — into a single
scheduling policy that applies to both branches, and adds a fourth backend:
scalarlm-fast (a small, low-latency private model, Gemma 4 E2B at
https://gemma4e2b.scalarllm.com) as the fast tier of the IP/novel branch.
This is the natural generalization of model-tiering.md: tiering exists today, but only on the general (Claude) branch. The novel branch has exactly one model. Adding a small fast private model gives the novel branch its own two-rung ladder, and once both branches are ladders the decision collapses into one clean picture.
The shape of the decision: two independent axes
Every request is scheduled on two orthogonal axes:
| Axis | Question | Decided by | Nature |
|---|---|---|---|
| Novelty | May this content leave to an external model? | the classifier (p_novel over spans) |
a hard safety gate — the IP invariant |
| Difficulty / stuck | How much model does this task need? | the heuristic scorer (score_difficulty, score_stuck) |
a soft cost/latency optimizer |
These are already separate functions in the code; today they're only composed on one branch. The whole proposal is: compute both axes for every request, let novelty pick the ladder and difficulty/stuck pick the rung.
request
│
├─ score_difficulty(payload), score_stuck(payload) ← computed ONCE, branch-agnostic
│
└─ novelty classifier (k-th highest span p_novel)
│
├─ general ───────▶ GENERAL ladder (external models — Claude/Cerebras)
│ fast ─▶ fastlane (gpt-oss-120b on Cerebras)
│ balanced ─▶ Claude Sonnet
│ deep ─▶ Claude Opus
│
└─ novel / uncertain ─▶ IP ladder (private models — ScalarLM only)
fast ─▶ scalarlm-fast (Gemma 4 E2B) ← NEW
standard ─▶ scalarlm (Gemma ~31B)
The same difficulty/stuck scores choose the rung on whichever ladder the novelty gate selected. An easy novel turn now gets the 2B model (fast); a hard or stuck novel turn escalates to the big Gemma — exactly the win tiering already gives the general branch, now extended to the private side.
Decisions (locked):
- IP base tier = fast (Gemma 4 E2B). Novel traffic starts on the 2B model
and escalates to the 31B only on difficulty/stuck — fast-first, mirroring the
general branch.
- IP pricing is model-weighted: the small Gemma (scalarlm-fast) costs
1.0, the big Gemma (scalarlm) costs 2.0 — escalating to the big
model burns the ip bucket twice as fast, reflecting its higher serve cost.
The IP invariant is preserved — by construction
The dominant invariant (proprietary content never reaches an external model; see classifier.md) is untouched, because it lives entirely on the novelty axis:
- The general ladder contains only external models. The novelty gate is what admits a request to it. Tiering only chooses which external model.
- The IP ladder contains only private models (both Gemma sizes run on our
own ScalarLM infra at
*.scalarllm.com). Escalating fromscalarlm-fast→scalarlmis a private→private move; novel content never crosses to Claude/Cerebras.scalarlm-fastis not an IP escape hatch — it sits behind the same novelty gate, identically to how the general fast tier sits behind it.
So adding a rung to either ladder can never weaken the invariant: the gate between the ladders is the only place IP safety is decided, and it does not move.
The new backend: scalarlm-fast
- Host:
https://gemma4e2b.scalarllm.com, no API key (same as ScalarLM today —ScalarlmBackendalready sendsapi_key or "n/a"). - Model: Gemma 4 E2B — a small (~2B effective-param) model, much faster and
cheaper to serve than the standard ScalarLM Gemma (~31B). Configure
modelasautoto discover the served id from/v1/models, like ScalarLM. - Implementation: no new backend class. It is the existing
ScalarlmBackendwithname="scalarlm-fast"— exactly howfastlanereuses it (app.py:325). Thenamebecomes the wire prefix onmodel_id(scalarlm-fast:…) and theSplit-Brain-Backend-Modelheader, so it's distinguishable in the audit log. The Anthropic↔OpenAI translation the ScalarLM path already does carries over unchanged.
# app.py build_backends, mirroring the fastlane block:
if cfg.scalarlm_fast_base_url:
backends["scalarlm-fast"] = ScalarlmBackend(
base_url=cfg.scalarlm_fast_base_url,
api_key=cfg.scalarlm_fast_api_key, # may be empty
model=cfg.scalarlm_fast_model or "auto",
name="scalarlm-fast",
)
Per-branch ladders (generalizing model_tiers)
Today settings.json holds one ladder (model_tiers + tier_order +
tier_policy), consumed only on the general branch. We generalize it to a
ladder per branch, keeping the existing single-ladder form as the general
branch for backward compatibility:
{
"ladders": {
"general": {
"tiers": {
"fast": { "backend": "fastlane", "model": "gpt-oss-120b" },
"balanced": { "backend": "claude", "model": "claude-sonnet-4-6" },
"deep": { "backend": "claude", "model": "claude-opus-4-8" }
},
"order": ["fast", "balanced", "deep"],
"policy": { "base": "fast", "escalate": "deep",
"difficulty_tau": 0.6, "stuck_tau": 0.5 }
},
"ip": {
"tiers": {
"fast": { "backend": "scalarlm-fast", "model": "auto", "max_context": 128000 },
"standard": { "backend": "scalarlm", "model": "auto" }
},
"order": ["fast", "standard"],
"policy": { "base": "fast", "escalate": "standard",
"difficulty_tau": 0.6, "stuck_tau": 0.5 }
}
}
}
TierTarget/TierPolicy/choose_tierare already generic over the ladder — they take whatevertiers/order/policythey're handed. The only new plumbing isSettingsStorereturning a ladder keyed by branch and the scheduler pickingladders["ip"]vsladders["general"].- An operator who wants novel-always-big sets the IP ladder's
policy.base = "standard"— no code change. One who wants a more conservative general posture sets the generalbase = "balanced"(already supported). - Seed defaults live in router env (below) so the system works before anyone
writes
settings.json, exactly as the current ladder does (app.py:280).
The scheduling algorithm
One pass, in app.py, for both the OpenAI and Anthropic ingress:
1. difficulty = score_difficulty(payload) # once, branch-agnostic
stuck = score_stuck(payload.messages)
2. decision = decide(requested_model, classifier_result, threshold) # novelty axis
branch = "ip" if decision.backend == SCALARLM else "general"
3. if tier_auto: # per-token routing mode (default)
ladder = settings.ladder(branch)
td = choose_tier(payload, ladder.tiers, ladder.order, ladder.policy,
difficulty=difficulty, stuck=stuck)
backend, model = td.target.backend, td.target.model
else:
# honor the client / forced mode exactly as today
Two changes to existing code, both small:
choose_tiercurrently recomputes difficulty/stuck internally. Let it accept them as optional args (compute-once), so the same scores drive the ladder selection regardless of branch. Backward-compatible default keeps the old behavior.- Escalation is graduated (see
model-tiering.md § Decision policy):
difficulty climbs the ladder proportionally rather than jumping straight to
the top rung, so the middle rungs are reachable —
balancedon the general ladder andstandardon the IP ladder. This matters more once both branches are ladders: a moderate novel task lands on the big Gemma only when it warrants it, otherwise it stays on the fast 2B.
Context length is a capability gate (not a difficulty signal)
The fast Gemma 4 E2B caps at a 128k context window. A request whose estimated context exceeds that cannot be served by the fast rung at all — regardless of how easy it is — so context is a hard capability gate, kept separate from the soft difficulty/stuck optimizer:
- Each tier carries an optional
max_context(tokens;scalarlm-fast→ 128000, others unbounded).choose_tierestimates the inbound context (system+messages+tools, ~4 chars/token) plus the requested output budget, and escalates past any rung that can't hold it to the lowest rung that can. On the IP ladder a 150k-token novel request skips the 2B model straight to the big Gemma even at difficulty 0. - Because it's a capability requirement, this gate may climb above
policy.escalate— a model that can't fit the prompt is not an option no matter the cost policy. It's recorded ascontextin the tierreason. - The estimate is deliberately conservative (over-estimating escalates early), since serving a prompt that overflows the window fails hard at the backend.
- The tier branch guard at
app.py:688(tier_auto and decision.backend == "claude" and decision.decision == "general") widens to also fire on the IP branch, selecting that branch's ladder. The novel/uncertain path stops being a bare "send to scalarlm" and becomes "send to the IP ladder's chosen rung."
Everything else — the whole-payload span classifier, the uncertain band, the veto on forced-claude — is unchanged.
Token accounting & buckets
The two daily buckets (token-limits.md) map cleanly onto the
two ladders: general ladder → general bucket, IP ladder → ip bucket.
- Fix
bucket_for_backend. It is currently"ip" if backend == "scalarlm" else "general"(usage.py:98) — a literal string match that would billscalarlm-fastto the wrong (general) bucket. Better: derive the bucket from the branch/ladder that served the request rather than the backend name, which also future-proofs adding more private models. As a minimal fix, match the IP ladder's backends (scalarlm,scalarlm-fast). - Weights (model-keyed). The two private models are priced differently, so escalating to the big Gemma is visible in the bucket:
| Model | model_id prefix |
Per-token weight |
|---|---|---|
Gemma 4 E2B (scalarlm-fast) |
scalarlm-fast: |
1.0 |
Gemma ~31B (scalarlm) |
scalarlm: |
2.0 |
fastlane / unknown (default) |
— | 1.0 |
This needs one change to Pricing.weight() (usage.py): today it matches
opus/sonnet/haiku substrings else default. Add a scalarlm rule —
but match scalarlm-fast before scalarlm, since "scalarlm" is a
substring of "scalarlm-fast". The big-Gemma weight is a new env knob
(ROUTER_TOKEN_WEIGHT_SCALARLM, default 2.0); the small Gemma falls through
to default (1.0). Both still bill the ip bucket; the 2× weight just makes
the big model burn it faster, the same way Sonnet (3×) burns general faster
than the fast tier.
Fallbacks — symmetric on both ladders
Tiering already has two fallback mechanisms; both generalize:
- Within-ladder backend error. A fast-tier backend error falls back to the
ladder's safe rung at request time:
- general:
fastlaneerror → Claudebalanced(existing,app.py:739). - IP:scalarlm-fasterror →scalarlm(standard). Same shape, and IP-safe because both are private. So a flaky 2B endpoint degrades novel-branch latency, never availability — the standard Gemma is always a valid destination for novel content. - Quota fallback (
generalexhausted →ip). When a user burns thegeneralbucket, general traffic falls back to ScalarLM (existing). With the IP ladder this fallback now enters the IP ladder at the difficulty-scored rung (reusing the scores already computed) instead of hardcoding the big model — an easy general request that spills to IP lands on the fast 2B model. The IP invariant is untouched (the fallback only ever goes general→ip).
Headers & audit
Split-Brain-Tieralready exists; it now applies on both branches (fast | balanced | deepfor general,fast | standardfor IP).- Add
Split-Brain-Branch: general | ipso a reader can tell which ladder afastcame from without decoding the backend. - Audit fields
tier,difficulty_score,stuck_score,chosen_backend,backend_modelalready capture everything; the branch is derivable fromchosen_backend(and explicit once the header field is recorded). This keeps the Phase-3 flywheel (model-tiering.md) intact — now it can also learn "cheapest tier that resolved it" on the IP branch.
Configuration
| Setting | Where | Notes |
|---|---|---|
ladders (per-branch tiers/order/policy) |
settings.json (admin-tunable) |
generalizes today's model_tiers/tier_order/tier_policy; old single-ladder form read as general |
SCALARLM_FAST_BASE_URL |
router env | https://gemma4e2b.scalarllm.com |
SCALARLM_FAST_API_KEY |
router env | empty (no token) |
SCALARLM_FAST_MODEL |
router env | auto (discover Gemma 4 E2B from /v1/models) |
TIER_IP_BASE / TIER_IP_ESCALATE |
router env | seed IP-ladder policy (fast / standard) |
ROUTER_TOKEN_WEIGHT_SCALARLM |
router env | per-token weight of the big Gemma (default 2.0); the small Gemma uses default (1.0) |
Existing FASTLANE_* and TIER_* settings are unchanged; the new ones mirror
them.
Why this is the right generalization (not just "one more backend")
- The scorer is already branch-free.
score_difficulty/score_stuckread the payload, not the destination. They were built to be reused; this proposal just stops throwing the IP branch's scores away. choose_tieris already ladder-generic. No new selection logic — only a second ladder fed to the same function.- Symmetry makes the system legible. "Novelty picks the ladder, difficulty
picks the rung" is one sentence that describes all four backends. New
private or external models become new rungs, not new special cases in
app.py.
Open questions
- IP-ladder thresholds. Should novel traffic escalate on the same
difficulty_tau/stuck_tauas general, or be more eager to use the big Gemma (proprietary work may be higher-stakes)? Per-ladderpolicyalready allows divergence; pick the defaults empirically. - Tool-use parity of Gemma 4 E2B. Same caveat as the general fast tier: if
the 2B model handles Claude Code's agentic tool-use format poorly, scope its
fastrung to plain-chat novel turns and start agentic novel turns atstandard. Needs a per-model capability flag. - Bucket derivation. Switch
bucket_for_backendfrom a name match to a branch/ladder lookup (cleaner, future-proof) vs. just extending the match.
Implementation notes (Phase 1, shipped)
Where the shipped code differs from the proposal above:
- Anthropic ingress only. Tiering (both ladders) runs on
/v1/messages, exactly like the pre-existing general tiering. The OpenAI ingress (/v1/chat/completions) is untiered — novel traffic there goes to the default ScalarLM (standard). Extending tiering to the OpenAI ingress is future work. - Quota fallback enters at
standard, not the scored rung. Ageneral→ipquota fallback routes to the standard Gemma (conservative, IP-safe), rather than re-running the scorer to pick the IP rung. The primary novel/uncertain path is fully tiered; only the quota-spill path is fixed to standard. The scored-rung version is a cheap follow-up. - Bucket by name match, not branch derivation.
bucket_for_backendmatches the IP-branch backends (scalarlm,scalarlm-fast) →ip. Branch-derived bucketing (open question 3) was deferred as unnecessary for two private backends. - Per-branch ladders.
settings.jsonladders.{general,ip}overrides the seed ladders; the legacy top-levelmodel_tiers/tier_order/tier_policystill resolve as the general ladder (back-compat). - IP-safe last resort. If the IP fast tier is unconfigured, the ladder resolves within the IP branch (→ standard ScalarLM); the tier resolver's fallback target is branch-appropriate and never Claude.
Phased plan
- Phase 1 — wire the backend + IP ladder. Add
scalarlm-fasttobuild_backends, the per-branchladdersshape inSettingsStore(old form →general), seed IP-ladder env, widen the tier guard to the IP branch, thread compute-once difficulty/stuck intochoose_tier, and fixbucket_for_backend. AddSplit-Brain-Branchheader + audit. Reuses the existing heuristic scorer — no training. - Phase 2 — calibrate. Tune the IP ladder's thresholds and (optionally) the
scalarlm-fastweight from observed escalation/outcome rates. - Phase 3 — flywheel on both branches. Extend the Phase-3 outcome loop to learn the cheapest resolving tier on the IP branch too, not just general.