Per-user daily token limits — design
Status: Implemented. Two per-user daily token quotas enforced in the
router — general (Claude + the fast tier), default 2,000,000/day, and
ip (the private ScalarLM backend), default 20,000,000/day — plus a UI
admin page (restricted to @smasint.com) to view usage and adjust per-user
limits per bucket.
Goals
- Each authenticated user (identified by their API token's owner email) gets two daily budgets, billed by the backend a request actually uses:
- general — Claude + fastlane (external frontier models). Default 2M/day.
- ip — ScalarLM (the private backend). Default 20M/day.
- Fallback: when a user exhausts general, general traffic falls back to ScalarLM (the ip bucket) instead of failing — IP-safe, since the private backend is always a valid destination for general content. Only when the relevant bucket has no room (and no fallback applies) is it a 429.
- Both defaults are configurable (chart values), overridable per user, per bucket from the admin page.
Two buckets and why
general meters spend on the models that cost real money / are rate-limited
(Claude, Cerebras); ip meters the private backend we run, which is cheaper
and higher-volume — hence the 10× larger default. Splitting them lets an
operator cap external spend tightly while letting private-model usage run
freely, and makes the general→ip fallback natural: running out of the
expensive budget downgrades you to the private model rather than cutting you
off. The bucket of a request is bucket_for_backend(chosen_backend):
scalarlm → ip, everything else → general.
Where enforcement lives
The router is the only choke point that sees every request, the token owner, and the resolved backend (which determines the bucket), so enforcement lives there — not the UI (the admin surface). Two checks:
- Pre-classify precheck (
_precheck_quota): reject 429 only when every bucket is exhausted — so even the general→ip fallback can't help. Cheap, avoids classifying for a fully-blocked user. - Post-routing check (
_route_within_quota): once the backend is resolved (after classify + tiering), check that backend's bucket. If thegeneralbucket is spent, fall back to ScalarLM wheniphas room; otherwise 429 withSplit-Brain-Quota-*headers. The fallback is recorded (quota_fallbackin the audit,Split-Brain-Quota-Fallback: general->ipheader). - Post-response accounting (
_record_usage): add the request's tokens to the bucket of the backend actually used (rec.chosen_backend, kept current across every fallback), so a general→ip fallback correctly billsip.
A request that crosses a threshold is allowed to finish (we only know its
exact size afterward), so a user can overshoot by at most one in-flight
request per pod — acceptable for a cost/fairness control. The IP invariant
is never weakened: the fallback only ever goes general→ip (toward the private
backend); novel/ip traffic over quota is a hard 429 and is never retried on
Claude.
Storage (shared PVC)
The router mounts the whole PVC at /var/split-brain. The per-pod ledger
design (below) is what lets usage tracking scale to N router replicas on
a ReadWriteMany volume. The current Civo deployment ships RWO block
storage, so router runs 1 replica (values-dev.yaml); the cross-pod sum
is then a no-op and counts are exact. Two pieces of state:
Usage — per-pod daily ledgers (no write contention)
/var/split-brain/usage/{YYYY-MM-DD}/{pod}.json
→ { "<email>": { "general": <tokens>, "ip": <tokens> }, ... }
(Legacy single-int entries from before the split are read as the general
bucket — at most one UTC day stale, rolled over at midnight.) Mirrors the audit
log's per-pod pattern ({pod} from HOSTNAME/POD_NAME).
Each pod owns its file (single writer → safe). A pod's own counts live in
memory and are flushed to its file after each request; on startup it reloads
today's file (so a restart doesn't lose the count).
usage_today(email) = this pod's in-memory count + the sum of the
other pods' files for today, read with a short TTL cache (~5 s) to bound
IO. Staleness ≤ the cache TTL, so cross-pod overshoot is bounded by
concurrent in-flight requests in that window. In dev (1 replica) it's exact.
The day key is UTC calendar day; at 00:00 UTC a new directory starts and usage effectively resets. Old day-dirs can be pruned lazily (optional).
Limits — one small file, UI writes, router reads
/var/split-brain/limits/limits.json → {
"default": { "general": 2000000, "ip": 20000000 },
"overrides": { "<email>": { "general": <int>, "ip": <int> } }
}
The UI (replicaCount 1, single writer) writes it from the admin page; the
router reads it with a short TTL cache. limit_for(email, bucket) =
overrides[email][bucket] if present, else the file default[bucket], else
the config seed (ROUTER_{GENERAL,IP}_DAILY_TOKEN_LIMIT). A bucket may be set
independently; a missing bucket falls back per-bucket. A limit of 0 means
unlimited. (Legacy single-int default/override values are read as the
general limit.)
Mounts
The router already mounts the whole PVC RW, so it reads/writes both paths
directly. The UI mounts subPaths, so it gains two new ones: usage (RO,
for the admin page's usage view) and limits (RW, to persist limit
changes) — same pattern as tokens/labels.
What counts / window / scope (defaults)
- Counts (cost units, not raw tokens): the quota bills
weight(model) × (uncached + cache_mult × cached), whereuncached = prompt_tokens + completion_tokens(minus cache reads — see below) andcachedis the prompt-cache read count. A heavier model burns the bucket faster:
| Model | Per-token weight |
|---|---|
| Opus | 5 |
| Sonnet | 3 |
| Haiku | 1 |
ScalarLM / fastlane / unknown (default) |
1 |
Cached (prompt-cache read) tokens cost cache_mult (default 0.1, i.e. 10%)
of an uncached token at the same model weight.
- Cache-read handling: Anthropic reports input_tokens excluding cache
reads, so on the Claude path cached is billed in addition to prompt.
OpenAI/ScalarLM fold cache reads into prompt_tokens, so they're subtracted
out first (max(0, prompt - cached)) before billing the 10% rate — never
billed twice. This lives in Pricing.billed (usage.py); the resulting
cost is stamped on the audit record as billed_tokens.
- Window: UTC calendar day, reset at 00:00 UTC.
- Scope: every routed request counts — toward general (Claude/fastlane)
or ip (ScalarLM), by the backend used.
Admin page
- Route:
GET /admin(UI), gated byrequire_identityand an admin check:email's domain ∈UI_ADMIN_DOMAINS(defaultsmasint.com). Non-admins get 403; the nav shows Admin only to admins. - View: a table of users (union of token owners + anyone with usage today) with today's tokens / effective limit / % used per bucket (general and ip side by side). Plus the current general and ip global defaults.
- Edit:
POST /admin/limitswith abucketfield — set a per-user, per- bucket override (orresetto clear the whole user back to defaults) and set each global default. Writeslimits.json.
Router surface
usage.py:UsageTracker(per-pod, per-bucket ledger + cross-pod sum + TTL cache),LimitStore(cachedlimits.json, per-bucket), andbucket_for_backend. Wired intoapp.state.- Both ingresses + streaming paths:
_precheck_quota(all-buckets-exhausted early 429),_route_within_quota(post-routing per-bucket check + general→ip fallback), and_record_usage(bills the used backend's bucket). - 429 carries
Split-Brain-Quota-Bucket,-Limit,-Used,-Reset(seconds to next UTC midnight). A fallback addsSplit-Brain-Quota-Fallback: general->ipto the (200) response.
Config
| Setting | Service | Default |
|---|---|---|
ROUTER_GENERAL_DAILY_TOKEN_LIMIT |
router | 2000000 |
ROUTER_IP_DAILY_TOKEN_LIMIT |
router | 20000000 |
ROUTER_TOKEN_WEIGHT_OPUS |
router | 5.0 |
ROUTER_TOKEN_WEIGHT_SONNET |
router | 3.0 |
ROUTER_TOKEN_WEIGHT_HAIKU |
router | 1.0 |
ROUTER_TOKEN_WEIGHT_DEFAULT |
router | 1.0 |
ROUTER_CACHED_TOKEN_MULTIPLIER |
router | 0.1 |
ROUTER_USAGE_DIR |
router | /var/split-brain/usage |
ROUTER_LIMITS_PATH |
router | /var/split-brain/limits/limits.json |
UI_USAGE_DIR |
ui | /var/split-brain/usage |
UI_LIMITS_PATH |
ui | /var/split-brain/limits/limits.json |
UI_ADMIN_DOMAINS |
ui | smasint.com |
Both router limits are set in the chart (router.config.generalDailyTokenLimit
/ ipDailyTokenLimit) → the router configmap, as are the model weights
(router.config.tokenWeight{Opus,Sonnet,Haiku,Default} /
cachedTokenMultiplier).
Failure mode
If the usage store can't be read (transient FS error), the pre-flight check fails open (allows the request) and logs — a storage glitch shouldn't take down all routing. This is a cost/fairness control, not the IP invariant, so fail-open is the right bias (unlike the classifier, which fails closed). Writes that fail are logged and retried on the next request.
Testing
UsageTracker: add/sum within a pod; cross-pod sum from multiple files; day rollover; reload-on-restart; TTL cache.LimitStore: default, per-user override,0=unlimited, missing file.- Router: pre-flight 429 when over; tokens recorded after a request; both ingresses + streaming; fail-open on unreadable usage.
- UI admin: non-admin → 403; admin sees the table;
POST /admin/limitssets an override + default and re-renders; usage reflected from ledgers.
Resolved design notes
- A user's own probe/UI traffic does not count — only router API traffic has token usage; the UI probe never hits a backend.
- Old
usage/{date}dirs are left in place (no retention sweep yet); add one later if they grow.