Classifier

The classifier answers a single question per request:

Is this prompt about something the public internet already knows well, or is it about something novel (proprietary terminology, private data, recent internal events, domain jargon that does not show up in Common Crawl)?

If general → route to Claude. If novel → route to ScalarLM.

Implementation status

What is implemented today (the rest of this doc mixes in forward-looking design — sections that describe unbuilt machinery are marked Planned):

The MiniLM encoder + 2-layer MLP head, served at POST /classify, with a hot POST /reload to pick up a freshly trained head (see § Model, § API).
A UI-served bootstrap (not a one-off Kubernetes Job): the operator stages docs and clicks Train in the UI; the same pipeline lives in classifier/src/classifier/bootstrap/ and ui/src/ui/bootstrap_pipeline/.
Synthetic "novel" prompts from a deterministic heuristic generator (templated questions over each chunk's salient terms); a ScalarLM-backed generator is available but optional.
The data flywheel: operators label real traffic general/novel from the Request explorer (per-span — see ui.md); those labels are oversampled so a handful aren't swamped by the corpus, and folded into the next bootstrap.

Planned / not yet built: ScalarLM-generated synthetic prompts as the default, hard-negative mining, the curated human anchor set, outcome-based (LLM-judge) labels, the daily retraining CronJob, shadow routing, and ScalarLM↔classifier release coupling. Those sections below describe intent.

Why a separate model instead of a heuristic

We considered keyword lists and embedding-distance to a corpus of "private" docs. Both work for narrow domains but degrade as the proprietary vocabulary grows and have no calibrated notion of confidence. A small fine-tuned encoder gives us probabilities we can threshold and a clear retraining loop.

We also considered asking Claude itself to classify. Rejected because:

it puts Claude in the request path even for queries that should never reach Claude (defeats data-locality);
it adds 500 ms+ to every request;
it costs money per classification.

Model

Base: sentence-transformers/all-MiniLM-L6-v2 (22M params, permissively licensed, ~5 ms CPU inference for 256 tokens).
Head: A 2-layer MLP (384→128→2, ReLU) on top of the mean-pooled sentence embedding, output dim 2 (general, novel). p_novel is softmax(logits)[1]. The head is trained from the UI bootstrap and saved as head.safetensors on the PVC; the service loads it at startup and on POST /reload.
Why this base: small enough to serve on CPU with the latency budget, large enough to capture semantic features. Larger backbones (DeBERTa-v3-base, etc.) buy 2-3 points of accuracy but push p99 over budget without a GPU.

We will revisit this choice once we have ≥10k labeled examples; at that scale a distilled domain-specific model trained from scratch is viable.

Training data

Two sources, balanced 50/50:

General-purpose corpus — sampled prompts from publicly released conversation datasets (LMSYS-Chat-1M, ShareGPT, WildChat). Labeled general.
Novel corpus — synthetic prompts generated from the proprietary documents that ScalarLM is fine-tuned on, plus real internal support tickets and engineering questions with PII scrubbed. Labeled novel.

The synthetic prompts are generated by ScalarLM, with the instruction "write a realistic user question that requires the attached document to answer correctly." We hold out 10% of source documents and never generate training prompts from them, so the eval set measures generalization, not memorization.

Using ScalarLM here is a hard requirement, not a stylistic preference. The source documents are proprietary; sending them to Claude to produce questions about them would itself be an IP leak — exactly the failure mode the classifier exists to prevent. By the same logic, every other LLM-touching step in the retraining pipeline (LLM-as-judge for outcome labels, paraphrase augmentation, hard-negative mining) also runs on ScalarLM. The retraining CronJob is deployed without Anthropic credentials mounted; the Helm chart fails to render if those credentials are referenced from the classifier subchart.

Target dataset size for v1: 5k labeled examples, growing to 20k by v2.

Bootstrapping

At cold start there is no production traffic and no user feedback. The first classifier is trained from a staged set of documents plus a vendored public corpus. As built, this runs from the UI (/bootstrap): the operator uploads docs, clicks Train, and the UI trains the head in-process and calls the classifier's /reload — there is no separate Kubernetes Job. The pipeline below is the design; the implemented subset is: chunk → generate novel prompts (heuristic) → sample general from the corpus → fold in oversampled human labels → hold out an eval slice → train the head → save + reload. Steps marked Planned (ScalarLM generation, hard negatives, human anchor set) are not yet built.

Inputs

Source documents. A directory of proprietary design docs on the shared PVC at /var/split-brain/bootstrap/ (staged by the operator before the bootstrap Job runs). Hard requirement: the source set must be a subset of ScalarLM's fine-tuning corpus. If it is not, the classifier will route prompts about these docs to ScalarLM, but ScalarLM will not have learned them — we route into a void.
Public corpus. A vendored slice (~10k examples) of LMSYS-Chat-1M, WildChat, or ShareGPT, committed to the classifier repo. Vendored so the bootstrap is reproducible without external network access.

Pipeline (one-off Job)

Chunk each source document to ~500 tokens with ~50-token overlap. Chunks, not full docs, are the unit of generation.
Generate novel prompts. For each chunk, ScalarLM produces K prompts (default K=10) varying along four axes: persona (engineer / support / executive), intent (info-seeking / troubleshooting / comparison / instructional), length (one-liner / multi-paragraph with context), and specificity (uses exact terms / uses generic language about the concept). Label novel.
Sample general prompts. Draw N matching prompts from the vendored public corpus, balanced 50/50 with the novel set. Label general.
Fold in human labels (oversampled). Operator-curated general/novel labels (the data flywheel) are added and repeated up to the generated-novel count so a small label set isn't drowned out by the corpus (bootstrap_pipeline/dataset.py::_oversample_factor).
Generate hard negatives. (Planned.) For a fraction of the general prompts, have ScalarLM rewrite them to use proprietary vocabulary while keeping the answer reachable from public knowledge, so the classifier doesn't learn "mentions our jargon → novel" as a shortcut.
Hold out the eval slice. Reserve a fraction of source documents from generation; prompts derived from them form the generalization eval set.
Curate the human anchor set. (Planned.) A domain expert hand-labels a small fixed set reused for judge-drift detection.
Train. Train the MLP head on frozen encoder features; report eval accuracy + novel-F1.
Save + reload. Write head.safetensors to the PVC and POST /reload so the live classifier serves it immediately. The UI then auto-evaluates the new head against a probe set.

Cold-start threshold

Planned. The intent is to ship the first deploy with a wider uncertain band (e.g. CLASSIFIER_THRESHOLD = 0.3 → uncertain band 0.3–0.7) so more cold-start traffic routes to the safe default (ScalarLM), then narrow τ as the model earns trust. The current default is 0.4 in all environments; the cold-start widening is not automated — an operator would set CLASSIFIER_THRESHOLD by hand.

Risks the bootstrap deliberately accepts

No outcome data. The bootstrap labels are entirely synthetic or distantly supervised. The classifier's accuracy on real production traffic is unknown until the first day of audit logs is processed.
Surface-feature shortcut. The hard-negative step is the mitigation but not a guarantee. The human anchor set is the detection mechanism — if F1 on the anchor set is much higher than F1 on production a week in, the shortcut happened.
Domain drift between bootstrap docs and live traffic. Documents describe what the system should be; live prompts often describe what is actually broken in production. The bootstrap classifier will under-represent error messages, log excerpts, and operational vocabulary. Expect to top this up in the first retraining cycle.

Implementation pointers

The bootstrap pipeline lives in classifier/src/classifier/bootstrap/ and is mirrored in ui/src/ui/bootstrap_pipeline/ (inlined into the UI image so the UI can train without depending on the classifier package):

bootstrap/
  chunk.py               # split staged docs into ~250-word chunks
  generate.py            # HeuristicPromptGenerator + ScalarlmPromptGenerator
  dataset.py             # build train/eval split; balance + oversample labels
  train.py               # train the MLP head, save head.safetensors
  build_general_corpus.py# build/refresh the vendored public_general.jsonl
  eval_probes.py         # score a probe set
  main.py                # CLI entrypoint (`./split-brain bootstrap`)

The vendored public corpus ships at ui/src/ui/bootstrap_pipeline/corpus/public_general.jsonl. The UI drives the same code in-process (see ui/src/ui/bootstrap.py); the CLI path (./split-brain bootstrap) exists for offline runs.

Decision policy

Let p = P(novel | prompt) from the classifier softmax.

if   p >= 1 - τ:    backend = scalarlm         # confidently novel
elif p <= τ:        backend = claude           # confidently general
else:               backend = scalarlm         # uncertain → safe default

Default τ = 0.4 (so the "confidently general" zone is p ≤ 0.4 and the "confidently novel" zone is p ≥ 0.6; the 0.4–0.6 uncertain band routes to ScalarLM). The uncertain default is asymmetric in service of the system's dominant invariant: proprietary traffic must never reach Claude.

Routing a general prompt to ScalarLM costs degraded answer quality (ScalarLM is narrower).
Routing an IP-containing prompt to Claude costs an IP leak.

These two failures are not symmetric, and we accept the cheaper one. A future revision can revisit this once classifier calibration is trustworthy enough that the uncertain band is rare — until then, the safe default is ScalarLM.

τ is hot-reloadable via the CLASSIFIER_THRESHOLD env on the router (the classifier itself returns raw probabilities; thresholding lives in the router). τ controls the width of the uncertain band, not which side of the band routes where. The uncertain-band destination is a code-level decision, not a config knob — flipping it back to Claude would silently violate the IP invariant.

Latency budget

p50: ~15 ms (single warm classify)
p99: ~50 ms
Hard timeout (router-side): CLASSIFIER_TIMEOUT_MS, default 2000 ms. On timeout or any classifier error the router fails closed (503) — it does not fall back to Claude, since a routing failure must never let proprietary content reach Claude (the IP invariant).

The timeout is generous because the Anthropic ingress fans out one classify call per payload span (Claude Code sends many), each CPU-bound; a tight 100 ms cap timed those out and 503'd legitimate traffic. Inference is batch-size 1 per call; we do not batch across requests.

API

The classifier is its own HTTP service. Two endpoints:

POST /classify
Content-Type: application/json

{
  "text": "<prompt to classify>"        # accepts large input; the model
}                                        # only reads the first ~8192 chars

200 OK
{
  "label": "general" | "novel",
  "p_novel": 0.0..1.0,
  "model_version": "minilm/sentence-transformers/all-MiniLM-L6-v2+head-trained"
}

POST /reload      # re-read head.safetensors from the PVC (called by the
                  # UI after a bootstrap finishes); 200 on success.

The router caps each span at 8000 chars before calling /classify, and the model truncates to ~8192 chars internally, so oversized agentic spans (whole file reads) neither leak signal nor trip input-size limits. model_version reports whether a trained head is loaded (vs random/missing).

model_version is logged by the router so we can attribute decisions to a specific checkpoint when reviewing eval data. The router records label, p_novel, and model_version in the per-request audit log; that log is the source of truth for the retraining loop and must contain enough information to reconstruct the decision the classifier made for any historical request.

Retraining loop (Planned)

Status: design, not yet implemented. Today retraining is the manual flywheel: label real traffic in the UI → it's oversampled into the next bootstrap → Train → /reload. The automated, outcome-labeled loop below (daily CronJob, LLM-judge, shadow routing, ScalarLM release coupling) is the intended evolution; none of it ships yet. The one hard invariant that the current code already enforces is § 2 — no classifier/training step ever calls Claude.

The classifier must keep pace with ScalarLM as ScalarLM is fine-tuned and self-improves. We achieve this with three properties.

1. Outcome-based labels, not document-based

We do not assume a prompt is "novel" because it derives from a proprietary document. We assume it's novel because ScalarLM answered it well — measured from production data. The retraining set is built from the router's per-request audit log (see router.md), which contains the full prompt, full response, chosen backend, classifier label/confidence/version, and any user feedback signal. The label fed into the next training run is "the better backend," derived from:

User feedback when present (explicit thumbs, accept/reject, follow-up patterns).
Heuristic outcome signals on the response (refusal phrases, "I don't know" templates, very short or very repetitive outputs).
Shadow-routed comparisons (see below).

This is the most important property of the loop. A document-based classifier learns the training corpus; an outcome-based classifier learns the actual capability of whatever ScalarLM is today.

2. All LLM-touching steps use ScalarLM

Synthetic prompt generation, paraphrase augmentation, and any LLM-as-judge call run on ScalarLM. Claude is never a judge or a data source for the classifier — every step has access to prompts that may contain IP. The CronJob's pod spec mounts no Anthropic credentials and a NetworkPolicy blocks egress to api.anthropic.com.

For LLM-as-judge specifically, we use a separate prompt template on ScalarLM ("which of these two responses better answers the user's question?") and accept the bias that comes from the candidate model also being the judge. We mitigate the bias with a small human-labeled subset that is re-scored on every release to detect judge drift; if the judge's agreement with humans falls below a threshold, the retrain is blocked until the judge prompt or the labeled subset is refreshed.

3. Release coupling with ScalarLM

A new ScalarLM checkpoint cannot be promoted without:

Regenerating the classifier eval set against the new checkpoint (outcome labels for some held-out prompts may flip: old ScalarLM lost, new ScalarLM wins, the label is now ScalarLM).
Retraining the classifier head on the updated set.
Confirming no regression on either model's eval, including the "definitely-novel" slice that must stay routed to ScalarLM.

The two release pipelines are joined; neither moves alone. Bumping ScalarLM in the Helm chart without a paired classifier image tag is a CI failure.

Shadow routing

For a small sample (default 1%) of confidently general requests — prompts the classifier labels general with p ≤ 0.1 — we additionally send the request to ScalarLM in the background. The response is judged by the ScalarLM-judge prompt described above, and the label is fed back into the training set. This is how we discover prompts that have crossed from "general" into "ScalarLM now handles this better."

We never shadow-route in the other direction. Sending novel prompts to Claude as a shadow would violate the IP invariant — the whole point of the system is that those prompts do not reach Claude under any circumstance.

Schedule and promotion

Daily Kubernetes CronJob:

Reads the previous day's audit log from the shared PVC (/var/split-brain/audit/, mounted read-only).
Builds the outcome-labeled training delta.
Retrains the head only (the encoder stays frozen).
Runs eval against the held-out + human-labeled sets.
If F1 improves ≥ 1 pt and no regressions on the "definitely-novel" slice, publishes a new image tag.

Promotion of the new tag to production is not automatic — a human approves the image bump in the chart's values file. Combined with the release-coupling rule, this gives us human gating on both ScalarLM and classifier promotions. The classifier is the only ML system in the routing loop; we keep a human in the release path while we build trust in the eval.

Code layout

classifier/
  pyproject.toml
  src/classifier/
    __init__.py
    app.py              # FastAPI service: /classify, /reload, /healthz
    model.py            # Prediction + heuristic (v0) model
    minilm.py           # MiniLM encoder + MLP head, load_head/predict
    schema.py           # request/response models
    config.py           # env parsing
    bootstrap/          # chunk / generate / dataset / train / eval (see above)
  tests/

The encoder ships inside the image; the trained head (head.safetensors) lives on the PVC at /var/split-brain/heads/, written by the UI bootstrap and loaded on startup / /reload.