Classifier
The classifier answers a single question per request:
Is this prompt about something the public internet already knows well, or is it about something novel (proprietary terminology, private data, recent internal events, domain jargon that does not show up in Common Crawl)?
If general → route to Claude. If novel → route to ScalarLM.
Implementation status
What is implemented today (the rest of this doc mixes in forward-looking design — sections that describe unbuilt machinery are marked Planned):
- The MiniLM encoder + 2-layer MLP head, served at
POST /classify, with a hotPOST /reloadto pick up a freshly trained head (see § Model, § API). - A UI-served bootstrap (not a one-off Kubernetes Job): the operator
stages docs and clicks Train in the UI; the same pipeline lives in
classifier/src/classifier/bootstrap/andui/src/ui/bootstrap_pipeline/. - Synthetic "novel" prompts from a deterministic heuristic generator (templated questions over each chunk's salient terms); a ScalarLM-backed generator is available but optional.
- The data flywheel: operators label real traffic
general/novelfrom the Request explorer (per-span — see ui.md); those labels are oversampled so a handful aren't swamped by the corpus, and folded into the next bootstrap.
Planned / not yet built: ScalarLM-generated synthetic prompts as the default, hard-negative mining, the curated human anchor set, outcome-based (LLM-judge) labels, the daily retraining CronJob, shadow routing, and ScalarLM↔classifier release coupling. Those sections below describe intent.
Why a separate model instead of a heuristic
We considered keyword lists and embedding-distance to a corpus of "private" docs. Both work for narrow domains but degrade as the proprietary vocabulary grows and have no calibrated notion of confidence. A small fine-tuned encoder gives us probabilities we can threshold and a clear retraining loop.
We also considered asking Claude itself to classify. Rejected because:
- it puts Claude in the request path even for queries that should never reach Claude (defeats data-locality);
- it adds 500 ms+ to every request;
- it costs money per classification.
Model
- Base:
sentence-transformers/all-MiniLM-L6-v2(22M params, permissively licensed, ~5 ms CPU inference for 256 tokens). - Head: A 2-layer MLP (384→128→2, ReLU) on top of the mean-pooled
sentence embedding, output dim 2 (
general,novel).p_novelissoftmax(logits)[1]. The head is trained from the UI bootstrap and saved ashead.safetensorson the PVC; the service loads it at startup and onPOST /reload. - Why this base: small enough to serve on CPU with the latency budget, large enough to capture semantic features. Larger backbones (DeBERTa-v3-base, etc.) buy 2-3 points of accuracy but push p99 over budget without a GPU.
We will revisit this choice once we have ≥10k labeled examples; at that scale a distilled domain-specific model trained from scratch is viable.
Training data
Two sources, balanced 50/50:
- General-purpose corpus — sampled prompts from publicly released
conversation datasets (LMSYS-Chat-1M, ShareGPT, WildChat).
Labeled
general. - Novel corpus — synthetic prompts generated from the proprietary
documents that ScalarLM is fine-tuned on, plus real internal
support tickets and engineering questions with PII scrubbed.
Labeled
novel.
The synthetic prompts are generated by ScalarLM, with the instruction "write a realistic user question that requires the attached document to answer correctly." We hold out 10% of source documents and never generate training prompts from them, so the eval set measures generalization, not memorization.
Using ScalarLM here is a hard requirement, not a stylistic preference. The source documents are proprietary; sending them to Claude to produce questions about them would itself be an IP leak — exactly the failure mode the classifier exists to prevent. By the same logic, every other LLM-touching step in the retraining pipeline (LLM-as-judge for outcome labels, paraphrase augmentation, hard-negative mining) also runs on ScalarLM. The retraining CronJob is deployed without Anthropic credentials mounted; the Helm chart fails to render if those credentials are referenced from the classifier subchart.
Target dataset size for v1: 5k labeled examples, growing to 20k by v2.
Bootstrapping
At cold start there is no production traffic and no user feedback. The first
classifier is trained from a staged set of documents plus a vendored public
corpus. As built, this runs from the UI (/bootstrap): the operator
uploads docs, clicks Train, and the UI trains the head in-process and calls
the classifier's /reload — there is no separate Kubernetes Job. The
pipeline below is the design; the implemented subset is: chunk → generate
novel prompts (heuristic) → sample general from the corpus → fold in
oversampled human labels → hold out an eval slice → train the head →
save + reload. Steps marked Planned (ScalarLM generation, hard negatives,
human anchor set) are not yet built.
Inputs
- Source documents. A directory of proprietary design docs
on the shared PVC at
/var/split-brain/bootstrap/(staged by the operator before the bootstrap Job runs). Hard requirement: the source set must be a subset of ScalarLM's fine-tuning corpus. If it is not, the classifier will route prompts about these docs to ScalarLM, but ScalarLM will not have learned them — we route into a void. - Public corpus. A vendored slice (~10k examples) of LMSYS-Chat-1M, WildChat, or ShareGPT, committed to the classifier repo. Vendored so the bootstrap is reproducible without external network access.
Pipeline (one-off Job)
- Chunk each source document to ~500 tokens with ~50-token overlap. Chunks, not full docs, are the unit of generation.
- Generate novel prompts. For each chunk, ScalarLM produces K
prompts (default K=10) varying along four axes: persona
(engineer / support / executive), intent (info-seeking /
troubleshooting / comparison / instructional), length
(one-liner / multi-paragraph with context), and specificity
(uses exact terms / uses generic language about the concept).
Label
novel. - Sample general prompts. Draw N matching prompts from the
vendored public corpus, balanced 50/50 with the novel set.
Label
general. - Fold in human labels (oversampled). Operator-curated
general/novellabels (the data flywheel) are added and repeated up to the generated-novel count so a small label set isn't drowned out by the corpus (bootstrap_pipeline/dataset.py::_oversample_factor). - Generate hard negatives. (Planned.) For a fraction of the general prompts, have ScalarLM rewrite them to use proprietary vocabulary while keeping the answer reachable from public knowledge, so the classifier doesn't learn "mentions our jargon → novel" as a shortcut.
- Hold out the eval slice. Reserve a fraction of source documents from generation; prompts derived from them form the generalization eval set.
- Curate the human anchor set. (Planned.) A domain expert hand-labels a small fixed set reused for judge-drift detection.
- Train. Train the MLP head on frozen encoder features; report eval accuracy + novel-F1.
- Save + reload. Write
head.safetensorsto the PVC andPOST /reloadso the live classifier serves it immediately. The UI then auto-evaluates the new head against a probe set.
Cold-start threshold
Planned. The intent is to ship the first deploy with a wider uncertain
band (e.g. CLASSIFIER_THRESHOLD = 0.3 → uncertain band 0.3–0.7) so more
cold-start traffic routes to the safe default (ScalarLM), then narrow τ as
the model earns trust. The current default is 0.4 in all environments;
the cold-start widening is not automated — an operator would set
CLASSIFIER_THRESHOLD by hand.
Risks the bootstrap deliberately accepts
- No outcome data. The bootstrap labels are entirely synthetic or distantly supervised. The classifier's accuracy on real production traffic is unknown until the first day of audit logs is processed.
- Surface-feature shortcut. The hard-negative step is the mitigation but not a guarantee. The human anchor set is the detection mechanism — if F1 on the anchor set is much higher than F1 on production a week in, the shortcut happened.
- Domain drift between bootstrap docs and live traffic. Documents describe what the system should be; live prompts often describe what is actually broken in production. The bootstrap classifier will under-represent error messages, log excerpts, and operational vocabulary. Expect to top this up in the first retraining cycle.
Implementation pointers
The bootstrap pipeline lives in classifier/src/classifier/bootstrap/ and is
mirrored in ui/src/ui/bootstrap_pipeline/ (inlined into the UI image so the
UI can train without depending on the classifier package):
bootstrap/
chunk.py # split staged docs into ~250-word chunks
generate.py # HeuristicPromptGenerator + ScalarlmPromptGenerator
dataset.py # build train/eval split; balance + oversample labels
train.py # train the MLP head, save head.safetensors
build_general_corpus.py# build/refresh the vendored public_general.jsonl
eval_probes.py # score a probe set
main.py # CLI entrypoint (`./split-brain bootstrap`)
The vendored public corpus ships at
ui/src/ui/bootstrap_pipeline/corpus/public_general.jsonl. The UI drives the
same code in-process (see ui/src/ui/bootstrap.py); the CLI path
(./split-brain bootstrap) exists for offline runs.
Decision policy
Let p = P(novel | prompt) from the classifier softmax.
if p >= 1 - τ: backend = scalarlm # confidently novel
elif p <= τ: backend = claude # confidently general
else: backend = scalarlm # uncertain → safe default
Default τ = 0.4 (so the "confidently general" zone is p ≤ 0.4 and the "confidently novel" zone is p ≥ 0.6; the 0.4–0.6 uncertain band routes to ScalarLM). The uncertain default is asymmetric in service of the system's dominant invariant: proprietary traffic must never reach Claude.
- Routing a general prompt to ScalarLM costs degraded answer quality (ScalarLM is narrower).
- Routing an IP-containing prompt to Claude costs an IP leak.
These two failures are not symmetric, and we accept the cheaper one. A future revision can revisit this once classifier calibration is trustworthy enough that the uncertain band is rare — until then, the safe default is ScalarLM.
τ is hot-reloadable via the CLASSIFIER_THRESHOLD env on the router
(the classifier itself returns raw probabilities; thresholding lives
in the router). τ controls the width of the uncertain band, not
which side of the band routes where. The uncertain-band destination
is a code-level decision, not a config knob — flipping it back to
Claude would silently violate the IP invariant.
Latency budget
- p50: ~15 ms (single warm classify)
- p99: ~50 ms
- Hard timeout (router-side):
CLASSIFIER_TIMEOUT_MS, default 2000 ms. On timeout or any classifier error the router fails closed (503) — it does not fall back to Claude, since a routing failure must never let proprietary content reach Claude (the IP invariant).
The timeout is generous because the Anthropic ingress fans out one classify call per payload span (Claude Code sends many), each CPU-bound; a tight 100 ms cap timed those out and 503'd legitimate traffic. Inference is batch-size 1 per call; we do not batch across requests.
API
The classifier is its own HTTP service. Two endpoints:
POST /classify
Content-Type: application/json
{
"text": "<prompt to classify>" # accepts large input; the model
} # only reads the first ~8192 chars
200 OK
{
"label": "general" | "novel",
"p_novel": 0.0..1.0,
"model_version": "minilm/sentence-transformers/all-MiniLM-L6-v2+head-trained"
}
POST /reload # re-read head.safetensors from the PVC (called by the
# UI after a bootstrap finishes); 200 on success.
The router caps each span at 8000 chars before calling /classify, and the
model truncates to ~8192 chars internally, so oversized agentic spans (whole
file reads) neither leak signal nor trip input-size limits. model_version
reports whether a trained head is loaded (vs random/missing).
model_version is logged by the router so we can attribute
decisions to a specific checkpoint when reviewing eval data. The
router records label, p_novel, and model_version in the
per-request audit log; that log is the source of truth for the
retraining loop and must contain enough information to reconstruct
the decision the classifier made for any historical request.
Retraining loop (Planned)
Status: design, not yet implemented. Today retraining is the manual flywheel: label real traffic in the UI → it's oversampled into the next bootstrap → Train →
/reload. The automated, outcome-labeled loop below (daily CronJob, LLM-judge, shadow routing, ScalarLM release coupling) is the intended evolution; none of it ships yet. The one hard invariant that the current code already enforces is § 2 — no classifier/training step ever calls Claude.
The classifier must keep pace with ScalarLM as ScalarLM is fine-tuned and self-improves. We achieve this with three properties.
1. Outcome-based labels, not document-based
We do not assume a prompt is "novel" because it derives from a proprietary document. We assume it's novel because ScalarLM answered it well — measured from production data. The retraining set is built from the router's per-request audit log (see router.md), which contains the full prompt, full response, chosen backend, classifier label/confidence/version, and any user feedback signal. The label fed into the next training run is "the better backend," derived from:
- User feedback when present (explicit thumbs, accept/reject, follow-up patterns).
- Heuristic outcome signals on the response (refusal phrases, "I don't know" templates, very short or very repetitive outputs).
- Shadow-routed comparisons (see below).
This is the most important property of the loop. A document-based classifier learns the training corpus; an outcome-based classifier learns the actual capability of whatever ScalarLM is today.
2. All LLM-touching steps use ScalarLM
Synthetic prompt generation, paraphrase augmentation, and any
LLM-as-judge call run on ScalarLM. Claude is never a judge or a
data source for the classifier — every step has access to prompts
that may contain IP. The CronJob's pod spec mounts no Anthropic
credentials and a NetworkPolicy blocks egress to
api.anthropic.com.
For LLM-as-judge specifically, we use a separate prompt template on ScalarLM ("which of these two responses better answers the user's question?") and accept the bias that comes from the candidate model also being the judge. We mitigate the bias with a small human-labeled subset that is re-scored on every release to detect judge drift; if the judge's agreement with humans falls below a threshold, the retrain is blocked until the judge prompt or the labeled subset is refreshed.
3. Release coupling with ScalarLM
A new ScalarLM checkpoint cannot be promoted without:
- Regenerating the classifier eval set against the new checkpoint (outcome labels for some held-out prompts may flip: old ScalarLM lost, new ScalarLM wins, the label is now ScalarLM).
- Retraining the classifier head on the updated set.
- Confirming no regression on either model's eval, including the "definitely-novel" slice that must stay routed to ScalarLM.
The two release pipelines are joined; neither moves alone. Bumping ScalarLM in the Helm chart without a paired classifier image tag is a CI failure.
Shadow routing
For a small sample (default 1%) of confidently general requests — prompts the classifier labels general with p ≤ 0.1 — we additionally send the request to ScalarLM in the background. The response is judged by the ScalarLM-judge prompt described above, and the label is fed back into the training set. This is how we discover prompts that have crossed from "general" into "ScalarLM now handles this better."
We never shadow-route in the other direction. Sending novel prompts to Claude as a shadow would violate the IP invariant — the whole point of the system is that those prompts do not reach Claude under any circumstance.
Schedule and promotion
Daily Kubernetes CronJob:
- Reads the previous day's audit log from the shared PVC
(
/var/split-brain/audit/, mounted read-only). - Builds the outcome-labeled training delta.
- Retrains the head only (the encoder stays frozen).
- Runs eval against the held-out + human-labeled sets.
- If F1 improves ≥ 1 pt and no regressions on the "definitely-novel" slice, publishes a new image tag.
Promotion of the new tag to production is not automatic — a human approves the image bump in the chart's values file. Combined with the release-coupling rule, this gives us human gating on both ScalarLM and classifier promotions. The classifier is the only ML system in the routing loop; we keep a human in the release path while we build trust in the eval.
Code layout
classifier/
pyproject.toml
src/classifier/
__init__.py
app.py # FastAPI service: /classify, /reload, /healthz
model.py # Prediction + heuristic (v0) model
minilm.py # MiniLM encoder + MLP head, load_head/predict
schema.py # request/response models
config.py # env parsing
bootstrap/ # chunk / generate / dataset / train / eval (see above)
tests/
The encoder ships inside the image; the trained head
(head.safetensors) lives on the PVC at /var/split-brain/heads/, written
by the UI bootstrap and loaded on startup / /reload.