Architecture

Goal

Route LLM inference traffic between two backends based on the novelty of the request:

Claude (Anthropic's best current model) handles requests whose content is general-purpose — knowledge and tasks well represented on the public internet.
ScalarLM (a privately hosted open-weights model fine-tuned on proprietary data) handles requests that are novel — terminology, patterns, or knowledge not well covered by public pre-training data.

The router presents a single OpenAI-compatible endpoint to clients, exposed to the internet through a Cloudflare tunnel so that the Kubernetes cluster does not need public ingress.

Components

                                                  +-------------------+
                                                  |  Anthropic API    |
                                                  |  (Claude)         |
                                                  +---------+---------+
                                                            ^
                                                            | HTTPS (egress)
                                                            |
+---------+    HTTPS    +------------+    HTTP    +---------+---------+
| client  +------------>+ cloudflared+----------->+ router (FastAPI)  |
+---------+   (public)  +------------+  (in-mesh) |                   |
                          tunnel pod              | 1. parse request  |
                                                  | 2. call classifier|
                                                  | 3. pick backend   |
                                                  | 4. stream reply   |
                                                  +---------+---------+
                                                            |
                                                            | HTTP
                                                            v
                                                  +-------------------+
                                                  | classifier (svc)  |
                                                  | small encoder LM  |
                                                  +---------+---------+
                                                            |
                                                            | label
                                                            v
                                                       (router)
                                                            |
                                                            | HTTPS (if novel)
                                                            v
                                                  +-------------------+
                                                  | ScalarLM          |
                                                  | (external svc)    |
                                                  +-------------------+

ScalarLM and the Anthropic API are both external to this chart. ScalarLM ships its own Helm chart and is deployed independently (likely by a different team, possibly in a different namespace or cluster); we configure SCALARLM_BASE_URL and a credential. Anthropic is a SaaS. The four pieces this project deploys — router, classifier, ui, cloudflared — live in a single Kubernetes namespace (split-brain) and never get a NodePort or a public LoadBalancer; cloudflared is the only inbound path from the internet.

Request flow

Client sends an OpenAI-style POST /v1/chat/completions to https://router.<domain> (resolved to the Cloudflare tunnel).
cloudflared forwards the request to the in-cluster router Service.
The router extracts the user message and (if present) recent conversation history.
The router calls the classifier Service synchronously and gets back a label {general, novel} plus a confidence score.
The router selects a backend: - general and confidence ≥ τ → Claude - novel and confidence ≥ τ → ScalarLM - confidence < τ → policy-defined default (see classifier.md)
The router translates the request to the backend's native protocol (Anthropic Messages API for Claude, OpenAI-compatible for ScalarLM) and streams tokens back to the client unchanged.
The router records the decision, latency, and token counts to the per-request audit log (per-pod JSONL on the shared PVC) and emits structured logs for warnings/errors.

Design constraints

No public ingress on the cluster. Cloudflare tunnel is the only path in from the internet. The router never gets a NodePort or a cloud LB.
Classifier latency budget: 50 ms p99. Any slower and we erode the value of routing.
IP never reaches Claude. Anything classified novel (or uncertain) is served by ScalarLM or by no backend at all — never silently downgraded to Claude. If the classifier or ScalarLM is unhealthy, the router returns 5xx; it does not fall back to Claude. This is the dominant safety invariant; everything else here is subordinate to it. See classifier.md and the audit log integration in router.md.
All traffic logged. Every request gets a per-request audit record (full prompt, full response, classifier decision, backend, latency) on the shared PVC. See router.md.
Stateless router. No session affinity, no in-memory caches that affect correctness. Horizontal scaling is the only knob.

Out of scope (initial version)

Multi-region failover.
Token-level billing / metering per client.
Rate limiting beyond what cloudflared provides at the edge.
Online learning loop back into ScalarLM (the platform supports it; we will wire it up in a later iteration).

Document map

router.md — router service internals and API surface.
classifier.md — classifier model, training data, decision policy.
kubernetes.md — manifests, namespaces, GPU scheduling, secrets.
cloudflare-tunnel.md — tunnel setup, ingress rules, DNS.
docker.md — image inventory and build process.
ui.md — internal operator and ML-engineer UI.
cli.md — ./split-brain operator CLI.