Per-user daily token limits — design

Status: Implemented. Two per-user daily token quotas enforced in the router — general (Claude + the fast tier), default 2,000,000/day, and ip (the private ScalarLM backend), default 20,000,000/day — plus a UI admin page (restricted to @smasint.com) to view usage and adjust per-user limits per bucket.

Goals

Each authenticated user (identified by their API token's owner email) gets two daily budgets, billed by the backend a request actually uses:
general — Claude + fastlane (external frontier models). Default 2M/day.
ip — ScalarLM (the private backend). Default 20M/day.
Fallback: when a user exhausts general, general traffic falls back to ScalarLM (the ip bucket) instead of failing — IP-safe, since the private backend is always a valid destination for general content. Only when the relevant bucket has no room (and no fallback applies) is it a 429.
Both defaults are configurable (chart values), overridable per user, per bucket from the admin page.

Two buckets and why

general meters spend on the models that cost real money / are rate-limited (Claude, Cerebras); ip meters the private backend we run, which is cheaper and higher-volume — hence the 10× larger default. Splitting them lets an operator cap external spend tightly while letting private-model usage run freely, and makes the general→ip fallback natural: running out of the expensive budget downgrades you to the private model rather than cutting you off. The bucket of a request is bucket_for_backend(chosen_backend): scalarlm → ip, everything else → general.

Where enforcement lives

The router is the only choke point that sees every request, the token owner, and the resolved backend (which determines the bucket), so enforcement lives there — not the UI (the admin surface). Two checks:

Pre-classify precheck (_precheck_quota): reject 429 only when every bucket is exhausted — so even the general→ip fallback can't help. Cheap, avoids classifying for a fully-blocked user.
Post-routing check (_route_within_quota): once the backend is resolved (after classify + tiering), check that backend's bucket. If the general bucket is spent, fall back to ScalarLM when ip has room; otherwise 429 with Split-Brain-Quota-* headers. The fallback is recorded (quota_fallback in the audit, Split-Brain-Quota-Fallback: general->ip header).
Post-response accounting (_record_usage): add the request's tokens to the bucket of the backend actually used (rec.chosen_backend, kept current across every fallback), so a general→ip fallback correctly bills ip.

A request that crosses a threshold is allowed to finish (we only know its exact size afterward), so a user can overshoot by at most one in-flight request per pod — acceptable for a cost/fairness control. The IP invariant is never weakened: the fallback only ever goes general→ip (toward the private backend); novel/ip traffic over quota is a hard 429 and is never retried on Claude.

Storage (shared PVC)

The router mounts the whole PVC at /var/split-brain. The per-pod ledger design (below) is what lets usage tracking scale to N router replicas on a ReadWriteMany volume. The current Civo deployment ships RWO block storage, so router runs 1 replica (values-dev.yaml); the cross-pod sum is then a no-op and counts are exact. Two pieces of state:

Usage — per-pod daily ledgers (no write contention)

/var/split-brain/usage/{YYYY-MM-DD}/{pod}.json
    →  { "<email>": { "general": <tokens>, "ip": <tokens> }, ... }

(Legacy single-int entries from before the split are read as the general bucket — at most one UTC day stale, rolled over at midnight.) Mirrors the audit log's per-pod pattern ({pod} from HOSTNAME/POD_NAME). Each pod owns its file (single writer → safe). A pod's own counts live in memory and are flushed to its file after each request; on startup it reloads today's file (so a restart doesn't lose the count).

usage_today(email) = this pod's in-memory count + the sum of the other pods' files for today, read with a short TTL cache (~5 s) to bound IO. Staleness ≤ the cache TTL, so cross-pod overshoot is bounded by concurrent in-flight requests in that window. In dev (1 replica) it's exact.

The day key is UTC calendar day; at 00:00 UTC a new directory starts and usage effectively resets. Old day-dirs can be pruned lazily (optional).

Limits — one small file, UI writes, router reads

/var/split-brain/limits/limits.json  →  {
    "default":   { "general": 2000000, "ip": 20000000 },
    "overrides": { "<email>": { "general": <int>, "ip": <int> } }
}

The UI (replicaCount 1, single writer) writes it from the admin page; the router reads it with a short TTL cache. limit_for(email, bucket) = overrides[email][bucket] if present, else the file default[bucket], else the config seed (ROUTER_{GENERAL,IP}_DAILY_TOKEN_LIMIT). A bucket may be set independently; a missing bucket falls back per-bucket. A limit of 0 means unlimited. (Legacy single-int default/override values are read as the general limit.)

Mounts

The router already mounts the whole PVC RW, so it reads/writes both paths directly. The UI mounts subPaths, so it gains two new ones: usage (RO, for the admin page's usage view) and limits (RW, to persist limit changes) — same pattern as tokens/labels.

What counts / window / scope (defaults)

Counts (cost units, not raw tokens): the quota bills weight(model) × (uncached + cache_mult × cached), where uncached = prompt_tokens + completion_tokens (minus cache reads — see below) and cached is the prompt-cache read count. A heavier model burns the bucket faster:

Model	Per-token weight
Opus	5
Sonnet	3
Haiku	1
ScalarLM / fastlane / unknown (`default`)	1

Cached (prompt-cache read) tokens cost cache_mult (default 0.1, i.e. 10%) of an uncached token at the same model weight. - Cache-read handling: Anthropic reports input_tokens excluding cache reads, so on the Claude path cached is billed in addition to prompt. OpenAI/ScalarLM fold cache reads into prompt_tokens, so they're subtracted out first (max(0, prompt - cached)) before billing the 10% rate — never billed twice. This lives in Pricing.billed (usage.py); the resulting cost is stamped on the audit record as billed_tokens. - Window: UTC calendar day, reset at 00:00 UTC. - Scope: every routed request counts — toward general (Claude/fastlane) or ip (ScalarLM), by the backend used.

Admin page

Route: GET /admin (UI), gated by require_identity and an admin check: email's domain ∈ UI_ADMIN_DOMAINS (default smasint.com). Non-admins get 403; the nav shows Admin only to admins.
View: a table of users (union of token owners + anyone with usage today) with today's tokens / effective limit / % used per bucket (general and ip side by side). Plus the current general and ip global defaults.
Edit: POST /admin/limits with a bucket field — set a per-user, per- bucket override (or reset to clear the whole user back to defaults) and set each global default. Writes limits.json.

Router surface

usage.py: UsageTracker (per-pod, per-bucket ledger + cross-pod sum + TTL cache), LimitStore (cached limits.json, per-bucket), and bucket_for_backend. Wired into app.state.
Both ingresses + streaming paths: _precheck_quota (all-buckets-exhausted early 429), _route_within_quota (post-routing per-bucket check + general→ip fallback), and _record_usage (bills the used backend's bucket).
429 carries Split-Brain-Quota-Bucket, -Limit, -Used, -Reset (seconds to next UTC midnight). A fallback adds Split-Brain-Quota-Fallback: general->ip to the (200) response.

Config

Setting	Service	Default
`ROUTER_GENERAL_DAILY_TOKEN_LIMIT`	router	`2000000`
`ROUTER_IP_DAILY_TOKEN_LIMIT`	router	`20000000`
`ROUTER_TOKEN_WEIGHT_OPUS`	router	`5.0`
`ROUTER_TOKEN_WEIGHT_SONNET`	router	`3.0`
`ROUTER_TOKEN_WEIGHT_HAIKU`	router	`1.0`
`ROUTER_TOKEN_WEIGHT_DEFAULT`	router	`1.0`
`ROUTER_CACHED_TOKEN_MULTIPLIER`	router	`0.1`
`ROUTER_USAGE_DIR`	router	`/var/split-brain/usage`
`ROUTER_LIMITS_PATH`	router	`/var/split-brain/limits/limits.json`
`UI_USAGE_DIR`	ui	`/var/split-brain/usage`
`UI_LIMITS_PATH`	ui	`/var/split-brain/limits/limits.json`
`UI_ADMIN_DOMAINS`	ui	`smasint.com`

Both router limits are set in the chart (router.config.generalDailyTokenLimit / ipDailyTokenLimit) → the router configmap, as are the model weights (router.config.tokenWeight{Opus,Sonnet,Haiku,Default} / cachedTokenMultiplier).

Failure mode

If the usage store can't be read (transient FS error), the pre-flight check fails open (allows the request) and logs — a storage glitch shouldn't take down all routing. This is a cost/fairness control, not the IP invariant, so fail-open is the right bias (unlike the classifier, which fails closed). Writes that fail are logged and retried on the next request.

Testing

UsageTracker: add/sum within a pod; cross-pod sum from multiple files; day rollover; reload-on-restart; TTL cache.
LimitStore: default, per-user override, 0=unlimited, missing file.
Router: pre-flight 429 when over; tokens recorded after a request; both ingresses + streaming; fail-open on unreadable usage.
UI admin: non-admin → 403; admin sees the table; POST /admin/limits sets an override + default and re-renders; usage reflected from ledgers.

Resolved design notes

A user's own probe/UI traffic does not count — only router API traffic has token usage; the UI probe never hits a backend.
Old usage/{date} dirs are left in place (no retention sweep yet); add one later if they grow.