Overall backend scheduling policy — design

Status: Phase 1 implemented (router + chart; scalarlm-fast wired to https://gemma4e2b.scalarllm.com/v1). See Implementation notes at the end for where the shipped behavior differs from the proposal below. Unifies the two routing stages split-brain already has — novelty (which branch) and tiering (which model) — into a single scheduling policy that applies to both branches, and adds a fourth backend: scalarlm-fast (a small, low-latency private model, Gemma 4 E2B at https://gemma4e2b.scalarllm.com) as the fast tier of the IP/novel branch.

This is the natural generalization of model-tiering.md: tiering exists today, but only on the general (Claude) branch. The novel branch has exactly one model. Adding a small fast private model gives the novel branch its own two-rung ladder, and once both branches are ladders the decision collapses into one clean picture.

The shape of the decision: two independent axes

Every request is scheduled on two orthogonal axes:

Axis	Question	Decided by	Nature
Novelty	May this content leave to an external model?	the classifier (`p_novel` over spans)	a hard safety gate — the IP invariant
Difficulty / stuck	How much model does this task need?	the heuristic scorer (`score_difficulty`, `score_stuck`)	a soft cost/latency optimizer

These are already separate functions in the code; today they're only composed on one branch. The whole proposal is: compute both axes for every request, let novelty pick the ladder and difficulty/stuck pick the rung.

request
  │
  ├─ score_difficulty(payload), score_stuck(payload)      ← computed ONCE, branch-agnostic
  │
  └─ novelty classifier (k-th highest span p_novel)
        │
        ├─ general ───────▶ GENERAL ladder  (external models — Claude/Cerebras)
        │                     fast  ─▶ fastlane   (gpt-oss-120b on Cerebras)
        │                     balanced ─▶ Claude Sonnet
        │                     deep  ─▶ Claude Opus
        │
        └─ novel / uncertain ─▶ IP ladder  (private models — ScalarLM only)
                                  fast     ─▶ scalarlm-fast  (Gemma 4 E2B)   ← NEW
                                  standard ─▶ scalarlm        (Gemma ~31B)

The same difficulty/stuck scores choose the rung on whichever ladder the novelty gate selected. An easy novel turn now gets the 2B model (fast); a hard or stuck novel turn escalates to the big Gemma — exactly the win tiering already gives the general branch, now extended to the private side.

Decisions (locked): - IP base tier = fast (Gemma 4 E2B). Novel traffic starts on the 2B model and escalates to the 31B only on difficulty/stuck — fast-first, mirroring the general branch. - IP pricing is model-weighted: the small Gemma (scalarlm-fast) costs 1.0, the big Gemma (scalarlm) costs 2.0 — escalating to the big model burns the ip bucket twice as fast, reflecting its higher serve cost.

The IP invariant is preserved — by construction

The dominant invariant (proprietary content never reaches an external model; see classifier.md) is untouched, because it lives entirely on the novelty axis:

The general ladder contains only external models. The novelty gate is what admits a request to it. Tiering only chooses which external model.
The IP ladder contains only private models (both Gemma sizes run on our own ScalarLM infra at *.scalarllm.com). Escalating from scalarlm-fast → scalarlm is a private→private move; novel content never crosses to Claude/Cerebras. scalarlm-fast is not an IP escape hatch — it sits behind the same novelty gate, identically to how the general fast tier sits behind it.

So adding a rung to either ladder can never weaken the invariant: the gate between the ladders is the only place IP safety is decided, and it does not move.

The new backend: `scalarlm-fast`

Host: https://gemma4e2b.scalarllm.com, no API key (same as ScalarLM today — ScalarlmBackend already sends api_key or "n/a").
Model: Gemma 4 E2B — a small (~2B effective-param) model, much faster and cheaper to serve than the standard ScalarLM Gemma (~31B). Configure model as auto to discover the served id from /v1/models, like ScalarLM.
Implementation: no new backend class. It is the existing ScalarlmBackend with name="scalarlm-fast" — exactly how fastlane reuses it (app.py:325). The name becomes the wire prefix on model_id (scalarlm-fast:…) and the Split-Brain-Backend-Model header, so it's distinguishable in the audit log. The Anthropic↔OpenAI translation the ScalarLM path already does carries over unchanged.

# app.py build_backends, mirroring the fastlane block:
if cfg.scalarlm_fast_base_url:
    backends["scalarlm-fast"] = ScalarlmBackend(
        base_url=cfg.scalarlm_fast_base_url,
        api_key=cfg.scalarlm_fast_api_key,      # may be empty
        model=cfg.scalarlm_fast_model or "auto",
        name="scalarlm-fast",
    )

Per-branch ladders (generalizing `model_tiers`)

Today settings.json holds one ladder (model_tiers + tier_order + tier_policy), consumed only on the general branch. We generalize it to a ladder per branch, keeping the existing single-ladder form as the general branch for backward compatibility:

{
  "ladders": {
    "general": {
      "tiers": {
        "fast":     { "backend": "fastlane", "model": "gpt-oss-120b" },
        "balanced": { "backend": "claude",   "model": "claude-sonnet-4-6" },
        "deep":     { "backend": "claude",   "model": "claude-opus-4-8" }
      },
      "order":  ["fast", "balanced", "deep"],
      "policy": { "base": "fast", "escalate": "deep",
                  "difficulty_tau": 0.6, "stuck_tau": 0.5 }
    },
    "ip": {
      "tiers": {
        "fast":     { "backend": "scalarlm-fast", "model": "auto", "max_context": 128000 },
        "standard": { "backend": "scalarlm",      "model": "auto" }
      },
      "order":  ["fast", "standard"],
      "policy": { "base": "fast", "escalate": "standard",
                  "difficulty_tau": 0.6, "stuck_tau": 0.5 }
    }
  }
}

TierTarget / TierPolicy / choose_tier are already generic over the ladder — they take whatever tiers/order/policy they're handed. The only new plumbing is SettingsStore returning a ladder keyed by branch and the scheduler picking ladders["ip"] vs ladders["general"].
An operator who wants novel-always-big sets the IP ladder's policy.base = "standard" — no code change. One who wants a more conservative general posture sets the general base = "balanced" (already supported).
Seed defaults live in router env (below) so the system works before anyone writes settings.json, exactly as the current ladder does (app.py:280).

The scheduling algorithm

One pass, in app.py, for both the OpenAI and Anthropic ingress:

1. difficulty = score_difficulty(payload)          # once, branch-agnostic
   stuck      = score_stuck(payload.messages)

2. decision = decide(requested_model, classifier_result, threshold)   # novelty axis
   branch   = "ip" if decision.backend == SCALARLM else "general"

3. if tier_auto:                                   # per-token routing mode (default)
       ladder = settings.ladder(branch)
       td = choose_tier(payload, ladder.tiers, ladder.order, ladder.policy,
                        difficulty=difficulty, stuck=stuck)
       backend, model = td.target.backend, td.target.model
   else:
       # honor the client / forced mode exactly as today

Two changes to existing code, both small:

choose_tier currently recomputes difficulty/stuck internally. Let it accept them as optional args (compute-once), so the same scores drive the ladder selection regardless of branch. Backward-compatible default keeps the old behavior.
Escalation is graduated (see model-tiering.md § Decision policy): difficulty climbs the ladder proportionally rather than jumping straight to the top rung, so the middle rungs are reachable — balanced on the general ladder and standard on the IP ladder. This matters more once both branches are ladders: a moderate novel task lands on the big Gemma only when it warrants it, otherwise it stays on the fast 2B.

Context length is a capability gate (not a difficulty signal)

The fast Gemma 4 E2B caps at a 128k context window. A request whose estimated context exceeds that cannot be served by the fast rung at all — regardless of how easy it is — so context is a hard capability gate, kept separate from the soft difficulty/stuck optimizer:

Each tier carries an optional max_context (tokens; scalarlm-fast → 128000, others unbounded). choose_tier estimates the inbound context (system + messages + tools, ~4 chars/token) plus the requested output budget, and escalates past any rung that can't hold it to the lowest rung that can. On the IP ladder a 150k-token novel request skips the 2B model straight to the big Gemma even at difficulty 0.
Because it's a capability requirement, this gate may climb above policy.escalate — a model that can't fit the prompt is not an option no matter the cost policy. It's recorded as context in the tier reason.
The estimate is deliberately conservative (over-estimating escalates early), since serving a prompt that overflows the window fails hard at the backend.
The tier branch guard at app.py:688 (tier_auto and decision.backend == "claude" and decision.decision == "general") widens to also fire on the IP branch, selecting that branch's ladder. The novel/uncertain path stops being a bare "send to scalarlm" and becomes "send to the IP ladder's chosen rung."

Everything else — the whole-payload span classifier, the uncertain band, the veto on forced-claude — is unchanged.

Token accounting & buckets

The two daily buckets (token-limits.md) map cleanly onto the two ladders: general ladder → general bucket, IP ladder → ip bucket.

Fix bucket_for_backend. It is currently "ip" if backend == "scalarlm" else "general" (usage.py:98) — a literal string match that would bill scalarlm-fast to the wrong (general) bucket. Better: derive the bucket from the branch/ladder that served the request rather than the backend name, which also future-proofs adding more private models. As a minimal fix, match the IP ladder's backends (scalarlm, scalarlm-fast).
Weights (model-keyed). The two private models are priced differently, so escalating to the big Gemma is visible in the bucket:

Model	`model_id` prefix	Per-token weight
Gemma 4 E2B (`scalarlm-fast`)	`scalarlm-fast:`	1.0
Gemma ~31B (`scalarlm`)	`scalarlm:`	2.0
fastlane / unknown (`default`)	—	1.0

This needs one change to Pricing.weight() (usage.py): today it matches opus/sonnet/haiku substrings else default. Add a scalarlm rule — but match scalarlm-fast before scalarlm, since "scalarlm" is a substring of "scalarlm-fast". The big-Gemma weight is a new env knob (ROUTER_TOKEN_WEIGHT_SCALARLM, default 2.0); the small Gemma falls through to default (1.0). Both still bill the ip bucket; the 2× weight just makes the big model burn it faster, the same way Sonnet (3×) burns general faster than the fast tier.

Fallbacks — symmetric on both ladders

Tiering already has two fallback mechanisms; both generalize:

Within-ladder backend error. A fast-tier backend error falls back to the ladder's safe rung at request time: - general: fastlane error → Claude balanced (existing, app.py:739). - IP: scalarlm-fast error → scalarlm (standard). Same shape, and IP-safe because both are private. So a flaky 2B endpoint degrades novel-branch latency, never availability — the standard Gemma is always a valid destination for novel content.
Quota fallback (general exhausted → ip). When a user burns the general bucket, general traffic falls back to ScalarLM (existing). With the IP ladder this fallback now enters the IP ladder at the difficulty-scored rung (reusing the scores already computed) instead of hardcoding the big model — an easy general request that spills to IP lands on the fast 2B model. The IP invariant is untouched (the fallback only ever goes general→ip).

Headers & audit

Split-Brain-Tier already exists; it now applies on both branches (fast | balanced | deep for general, fast | standard for IP).
Add Split-Brain-Branch: general | ip so a reader can tell which ladder a fast came from without decoding the backend.
Audit fields tier, difficulty_score, stuck_score, chosen_backend, backend_model already capture everything; the branch is derivable from chosen_backend (and explicit once the header field is recorded). This keeps the Phase-3 flywheel (model-tiering.md) intact — now it can also learn "cheapest tier that resolved it" on the IP branch.

Configuration

Setting	Where	Notes
`ladders` (per-branch tiers/order/policy)	`settings.json` (admin-tunable)	generalizes today's `model_tiers`/`tier_order`/`tier_policy`; old single-ladder form read as `general`
`SCALARLM_FAST_BASE_URL`	router env	`https://gemma4e2b.scalarllm.com`
`SCALARLM_FAST_API_KEY`	router env	empty (no token)
`SCALARLM_FAST_MODEL`	router env	`auto` (discover Gemma 4 E2B from `/v1/models`)
`TIER_IP_BASE` / `TIER_IP_ESCALATE`	router env	seed IP-ladder policy (`fast` / `standard`)
`ROUTER_TOKEN_WEIGHT_SCALARLM`	router env	per-token weight of the big Gemma (default `2.0`); the small Gemma uses `default` (1.0)

Existing FASTLANE_* and TIER_* settings are unchanged; the new ones mirror them.

Why this is the right generalization (not just "one more backend")

The scorer is already branch-free. score_difficulty / score_stuck read the payload, not the destination. They were built to be reused; this proposal just stops throwing the IP branch's scores away.
choose_tier is already ladder-generic. No new selection logic — only a second ladder fed to the same function.
Symmetry makes the system legible. "Novelty picks the ladder, difficulty picks the rung" is one sentence that describes all four backends. New private or external models become new rungs, not new special cases in app.py.

Open questions

IP-ladder thresholds. Should novel traffic escalate on the same difficulty_tau/stuck_tau as general, or be more eager to use the big Gemma (proprietary work may be higher-stakes)? Per-ladder policy already allows divergence; pick the defaults empirically.
Tool-use parity of Gemma 4 E2B. Same caveat as the general fast tier: if the 2B model handles Claude Code's agentic tool-use format poorly, scope its fast rung to plain-chat novel turns and start agentic novel turns at standard. Needs a per-model capability flag.
Bucket derivation. Switch bucket_for_backend from a name match to a branch/ladder lookup (cleaner, future-proof) vs. just extending the match.

Implementation notes (Phase 1, shipped)

Where the shipped code differs from the proposal above:

Anthropic ingress only. Tiering (both ladders) runs on /v1/messages, exactly like the pre-existing general tiering. The OpenAI ingress (/v1/chat/completions) is untiered — novel traffic there goes to the default ScalarLM (standard). Extending tiering to the OpenAI ingress is future work.
Quota fallback enters at standard, not the scored rung. A general→ip quota fallback routes to the standard Gemma (conservative, IP-safe), rather than re-running the scorer to pick the IP rung. The primary novel/uncertain path is fully tiered; only the quota-spill path is fixed to standard. The scored-rung version is a cheap follow-up.
Bucket by name match, not branch derivation. bucket_for_backend matches the IP-branch backends (scalarlm, scalarlm-fast) → ip. Branch-derived bucketing (open question 3) was deferred as unnecessary for two private backends.
Per-branch ladders. settings.json ladders.{general,ip} overrides the seed ladders; the legacy top-level model_tiers/tier_order/tier_policy still resolve as the general ladder (back-compat).
IP-safe last resort. If the IP fast tier is unconfigured, the ladder resolves within the IP branch (→ standard ScalarLM); the tier resolver's fallback target is branch-appropriate and never Claude.

Phased plan

Phase 1 — wire the backend + IP ladder. Add scalarlm-fast to build_backends, the per-branch ladders shape in SettingsStore (old form → general), seed IP-ladder env, widen the tier guard to the IP branch, thread compute-once difficulty/stuck into choose_tier, and fix bucket_for_backend. Add Split-Brain-Branch header + audit. Reuses the existing heuristic scorer — no training.
Phase 2 — calibrate. Tune the IP ladder's thresholds and (optionally) the scalarlm-fast weight from observed escalation/outcome rates.
Phase 3 — flywheel on both branches. Extend the Phase-3 outcome loop to learn the cheapest resolving tier on the IP branch too, not just general.