Architecture
Goal
Route LLM inference traffic between two backends based on the novelty of the request:
- Claude (Anthropic's best current model) handles requests whose content is general-purpose — knowledge and tasks well represented on the public internet.
- ScalarLM (a privately hosted open-weights model fine-tuned on proprietary data) handles requests that are novel — terminology, patterns, or knowledge not well covered by public pre-training data.
The router presents a single OpenAI-compatible endpoint to clients, exposed to the internet through a Cloudflare tunnel so that the Kubernetes cluster does not need public ingress.
Components
+-------------------+
| Anthropic API |
| (Claude) |
+---------+---------+
^
| HTTPS (egress)
|
+---------+ HTTPS +------------+ HTTP +---------+---------+
| client +------------>+ cloudflared+----------->+ router (FastAPI) |
+---------+ (public) +------------+ (in-mesh) | |
tunnel pod | 1. parse request |
| 2. call classifier|
| 3. pick backend |
| 4. stream reply |
+---------+---------+
|
| HTTP
v
+-------------------+
| classifier (svc) |
| small encoder LM |
+---------+---------+
|
| label
v
(router)
|
| HTTPS (if novel)
v
+-------------------+
| ScalarLM |
| (external svc) |
+-------------------+
ScalarLM and the Anthropic API are both external to this chart.
ScalarLM ships its own Helm chart and is deployed independently (likely
by a different team, possibly in a different namespace or cluster); we
configure SCALARLM_BASE_URL and a credential. Anthropic is a SaaS.
The four pieces this project deploys — router, classifier, ui,
cloudflared — live in a single Kubernetes namespace (split-brain)
and never get a NodePort or a public LoadBalancer; cloudflared is
the only inbound path from the internet.
Request flow
- Client sends an OpenAI-style
POST /v1/chat/completionstohttps://router.<domain>(resolved to the Cloudflare tunnel). - cloudflared forwards the request to the in-cluster
routerService. - The router extracts the user message and (if present) recent conversation history.
- The router calls the classifier Service synchronously and gets
back a label
{general, novel}plus a confidence score. - The router selects a backend:
-
generaland confidence ≥ τ → Claude -noveland confidence ≥ τ → ScalarLM - confidence < τ → policy-defined default (see classifier.md) - The router translates the request to the backend's native protocol (Anthropic Messages API for Claude, OpenAI-compatible for ScalarLM) and streams tokens back to the client unchanged.
- The router records the decision, latency, and token counts to the per-request audit log (per-pod JSONL on the shared PVC) and emits structured logs for warnings/errors.
Design constraints
- No public ingress on the cluster. Cloudflare tunnel is the only
path in from the internet. The router never gets a
NodePortor a cloud LB. - Classifier latency budget: 50 ms p99. Any slower and we erode the value of routing.
- IP never reaches Claude. Anything classified novel (or uncertain) is served by ScalarLM or by no backend at all — never silently downgraded to Claude. If the classifier or ScalarLM is unhealthy, the router returns 5xx; it does not fall back to Claude. This is the dominant safety invariant; everything else here is subordinate to it. See classifier.md and the audit log integration in router.md.
- All traffic logged. Every request gets a per-request audit record (full prompt, full response, classifier decision, backend, latency) on the shared PVC. See router.md.
- Stateless router. No session affinity, no in-memory caches that affect correctness. Horizontal scaling is the only knob.
Out of scope (initial version)
- Multi-region failover.
- Token-level billing / metering per client.
- Rate limiting beyond what cloudflared provides at the edge.
- Online learning loop back into ScalarLM (the platform supports it; we will wire it up in a later iteration).
Document map
- router.md — router service internals and API surface.
- classifier.md — classifier model, training data, decision policy.
- kubernetes.md — manifests, namespaces, GPU scheduling, secrets.
- cloudflare-tunnel.md — tunnel setup, ingress rules, DNS.
- docker.md — image inventory and build process.
- ui.md — internal operator and ML-engineer UI.
- cli.md —
./split-brainoperator CLI.