split-brain

Sign in

Helm charts

We ship the system as a single umbrella Helm chart, split-brain, with one subchart per component. Helm is the only supported deployment path — there are no raw manifests to apply by hand.

Why Helm

  • One helm install (or Argo CD Application) brings the whole system up; one helm upgrade rolls it forward.
  • Per-environment values (dev, staging, prod) live in values-<env>.yaml. The structure is identical so diffs between environments are obvious.
  • helm template gives reviewers a deterministic rendered diff in CI before any change merges.

ScalarLM is not in this chart. It is a separately deployed service with its own upstream Helm chart; we configure the router to talk to it via SCALARLM_BASE_URL + a credential Secret. The operator can deploy ScalarLM in another namespace, another cluster, or any way they like — split-brain just needs the URL.

Chart layout

charts/split-brain/
  Chart.yaml                  # umbrella
  values.yaml                 # baseline defaults
  values-dev.yaml             # per-env overlays (in repo)
  values-staging.yaml
  values-prod.yaml
  templates/
    namespace.yaml
    pvc.yaml                  # split-brain-data (RWX default; dev overlay uses RWO)
    secrets.yaml              # anthropic / scalarlm / cloudflared / google Secrets
    _helpers.tpl              # naming, labels, image refs
  charts/                     # first-party subcharts in this repo
    router/
      Chart.yaml
      values.yaml
      templates/
        deployment.yaml
        service.yaml
        configmap.yaml
        serviceaccount.yaml
        hpa.yaml              # gated on .Values.autoscaling.enabled
        pdb.yaml
        networkpolicy.yaml
        _helpers.tpl
    classifier/
      ...                     # mirrors router/
    ui/
      ...                     # mirrors router/
    cloudflared/
      Chart.yaml
      values.yaml
      templates/
        deployment.yaml       # token mode: TUNNEL_TOKEN env, no config file
        serviceaccount.yaml
        pdb.yaml
        networkpolicy.yaml

Only the router subchart ships an hpa.yaml (gated on autoscaling.enabled); classifier and ui scale by fixed replicaCount. cloudflared has no ConfigMap — it runs in token mode and pulls its ingress rules from the Cloudflare dashboard at runtime (see cloudflare-tunnel.md).

All four first-party subcharts live in-repo. They are not published independently — the umbrella is the only release artifact.

Values structure

The umbrella values.yaml exposes a flat top-level key per subchart. Operators override values like this:

global:
  image:
    registry: ghcr.io/<org>/split-brain
    pullPolicy: IfNotPresent
    pullSecrets: []
  domain: router.example.com    # used by cloudflared ingress rules
  storage:
    pvcName: split-brain-data   # shared PVC for audit log, tokens, labels, usage, settings...
    accessMode: ReadWriteMany   # default; the dev overlay sets ReadWriteOnce (Civo single-node)
    storageClassName: ""        # RWX class (NFS/CephFS/EFS/...), or a block class for RWO
    size: 500Gi                 # audit log dominates; size for your traffic + retention
  secrets:
    # The chart creates the three Secrets the subcharts consume.
    # Material lives in a separate *.secrets.yaml file (gitignored);
    # see the Secrets section below for the threat model.
    create: true
    anthropic:
      apiKey: ""                # supplied via values-*.secrets.yaml
    scalarlm:
      apiKey: ""                # optional; only when router.config.scalarlmBaseUrl set
    cloudflared:
      tunnelToken: ""           # base64 token from Cloudflare's "Install connector" panel

router:
  replicaCount: 3
  image:
    repository: router          # joined with global.image.registry
    tag: ""                     # defaults to .Chart.AppVersion
  resources:
    requests: {cpu: 250m, memory: 512Mi}
    limits:   {cpu: 1,    memory: 1Gi}
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
  config:
    classifierThreshold: 0.4
    maxInflight: 256
    anthropicModel: claude-sonnet-4-6   # seed default; live default is admin-tunable (settings.json)
    scalarlmBaseUrl: https://scalarlm.internal.example.com/v1
    scalarlmModel: auto   # discover the served model from ScalarLM's /v1/models; or pin an explicit id
  # No per-subchart `secrets:` block — the umbrella's global.secrets
  # (above) is the single source of truth; subcharts reference the
  # generated Secret names by their fixed names (anthropic-api-key,
  # scalarlm-credentials, cloudflared-credentials).

classifier:
  replicaCount: 2
  config:
    keywordsPath: ""            # optional; if set, mount a ConfigMap with the keyword list

ui:
  replicaCount: 2
  config:
    # Cloudflare Access (edge auth) — optional
    cfTeamDomain: ""            # https://<team>.cloudflareaccess.com
    cfJwtAudience: ""
    devMode: false
    # Google sign-in (app-level auth — see docs/google-auth.md). Active once
    # googleClientId is set AND the google-oauth Secret exists.
    googleClientId: ""
    oauthRedirectUrl: "https://split-brain-ui.scalarxlm.com/auth/google/callback"
    allowedDomains: "smasint.com,relational.ai"
    allowedEmails: "[email protected]"
    sessionMaxAgeDays: "30"

cloudflared:
  replicaCount: 2
  tunnel:
    credentialsSecret: cloudflared-credentials
  # No ingress list here — token mode pulls hostnames/services from the
  # Cloudflare dashboard at runtime (see docs/cloudflare-tunnel.md).

Single-writer / RWO note. The base values show router.replicaCount: 3 and ui.replicaCount: 2, but the components that mount the PVC must run 1 replica on a ReadWriteOnce volume — the dev overlay sets replicaCount: 1 for router/classifier/ui. The UI in particular is the single writer of labels.jsonl / limits.json / settings.json, so it must stay at 1 while those live on the PVC. There are no ServiceMonitor / PodMonitor resources in the chart.

Conventions

  • global.image.registry is joined with each component's image.repository. No component references a full image string.
  • Image tags default to .Chart.AppVersion — bumping the umbrella's appVersion rolls all four images forward as a unit. Operators can pin individual components by setting <component>.image.tag.
  • Secrets are referenced by name, never inlined. The chart will refuse to render if a referenced secret name is empty.

Secrets

The chart creates the Secrets the subcharts consume — operators put the secret material in the values file. There are two modes, selected by global.secrets.create:

Chart-creates-Secrets mode (default, create: true)

The chart renders three Secret resources from values:

Secret name Keys Source value Required?
anthropic-api-key ANTHROPIC_API_KEY global.secrets.anthropic.apiKey yes
cloudflared-credentials TUNNEL_TOKEN global.secrets.cloudflared.tunnelToken (token mode — used verbatim as the TUNNEL_TOKEN env) yes
scalarlm-credentials SCALARLM_API_KEY global.secrets.scalarlm.apiKey only if router.config.scalarlmBaseUrl is set
google-oauth clientSecret, sessionSecret global.secrets.google.* only if ui googleClientId is set

If a required value is empty, the chart refuses to render with a clear message naming the missing field — silent broken-Secret installs are not allowed.

The scalarlm-credentials Secret is created only when its value is non-empty (ScalarLM deployments without auth are valid).

Where the secret material lives

NEVER commit a values file containing real secret material to git. The convention this repo uses:

values-prod.yaml              # ← committed; structure + non-secret config
values-prod.secrets.yaml      # ← gitignored (*.secrets.yaml); secret values only

Install/upgrade passes both files; Helm deep-merges them with later files winning:

helm upgrade --install split-brain charts/split-brain \
    -f values-prod.yaml \
    -f values-prod.secrets.yaml

*.secrets.yaml and *.secrets.yml are in the repo's .gitignore. Operators store the production secret overlay in their team's credential vault (1Password, AWS Secrets Manager, etc.) and check it out only at deploy time.

A minimal values-prod.secrets.yaml:

global:
  secrets:
    anthropic:
      apiKey: sk-ant-...
    cloudflared:
      # Paste the base64 token verbatim from the Cloudflare dashboard
      # (Zero Trust → Networks → Tunnels → <your tunnel> → Install
      # connector → the string after `cloudflared service install`). The
      # chart stores it as the `TUNNEL_TOKEN` env (token mode); cloudflared
      # fetches its ingress rules from Cloudflare at runtime.
      tunnelToken: eyJh...
    scalarlm:
      apiKey: sk-...

Bring-your-own-Secret mode (create: false)

For teams that manage secrets out of band (sealed-secrets, external-secrets-operator, vault-injector, or kubectl create secret manually), set global.secrets.create: false. The chart skips the Secret resources entirely and the subcharts continue to reference the same names as in the table above — the operator is responsible for having created Secrets with those exact names.

Router-client bearer tokens are not Helm secrets

The sbk_<…> tokens clients put in Authorization: Bearer … are not Kubernetes Secrets. They are self-served by authenticated users through the UI's Tokens view (see ui.md) and persisted as sha256(token) in one file per token on the shared PVC. This keeps human credential lifecycle (create / list / revoke) out of the operator's path and removes a class of "the team is sharing one key in a chat" failures.

Storage

All durable state — audit log, router tokens, bootstrap source documents — lives on a single shared PVC mounted at /var/split-brain/ in each pod that needs it. The chart does not use S3 or any object store; this is a project-wide choice (see global.storage in the values structure).

Path Writers Readers
/var/split-brain/audit/ router ui (RO)
/var/split-brain/tokens/ ui, router (last-used) router
/var/split-brain/bootstrap/ ui (upload) ui (training)
/var/split-brain/heads/ ui (trained head) classifier (/reload)
/var/split-brain/labels/ ui ui (training)
/var/split-brain/limits/ ui (admin) router (quota)
/var/split-brain/settings/ ui (admin) router (default model)
/var/split-brain/usage/ router (per-pod) ui (admin, RO)

The router mounts the whole PVC; the UI mounts the individual subPaths above (some RO) because its root filesystem is read-only.

The PVC's access mode is set by global.storage.accessMode. ReadWriteMany (NFS, CephFS, EFS, …) lets the PVC-mounting components scale horizontally. The current Civo deployment only has ReadWriteOnce block storage, so the dev overlay sets accessMode: ReadWriteOnce and pins router/classifier/ui to a single replica (all scheduled to the same node via WaitForFirstConsumer).

Backup is the operator's responsibility (PV snapshots or external sync) — the chart does not implement it.

The classifier subchart must not receive anthropic-api-key. A pre-install validation template fails rendering if it is referenced from the classifier subchart's secret map — enforcing the project invariant that no LLM step in classifier retraining ever calls Claude (see classifier.md).

A pre-install validation template fails fast with a clear message if any are missing.

Templating discipline

  • Every resource gets a name from include "split-brain.fullname" and labels from include "split-brain.labels". Both live in the umbrella _helpers.tpl and are reused by subcharts.
  • No string concatenation for image refs — use the split-brain.image helper that takes a component name.
  • Conditional resources (e.g. ServiceMonitor, HorizontalPodAutoscaler) are guarded by a single boolean in values. No multi-level conditional nesting.
  • We do not template Secret resources. Ever.
  • helm lint --strict and helm template ... | kubeconform run in CI on every PR.

Release process

Current reality: there is no CI image build or Argo CD in this repo yet. Releases are driven by the ./split-brain CLI — ./split-brain build --remote (builds amd64 images on the build host and pushes them) then ./split-brain deploy <env> (helm upgrade + rollout). See cli.md and deploy.md. The flow below is the target.

  1. Bump appVersion and version in Chart.yaml. SemVer for the chart, image tag for appVersion. Bump both even if only one component changed — the umbrella is the unit of release.
  2. CI builds and pushes all four images at the new tag.
  3. CI publishes the chart to an OCI registry (oci://ghcr.io/<org>/split-brain-charts). We do not publish to the legacy index.yaml repo format.
  4. Argo CD (or helm upgrade) picks up the new chart version per environment, gated by whatever approval policy the environment has.

Chart.lock is committed; subchart upgrades are deliberate PRs, not silent floats.

Local development

helm dep update charts/split-brain
helm template demo charts/split-brain \
  -f charts/split-brain/values-dev.yaml > /tmp/rendered.yaml
kubectl apply --dry-run=server -f /tmp/rendered.yaml

For a real local install:

helm install demo charts/split-brain \
  -n split-brain --create-namespace \
  -f charts/split-brain/values-dev.yaml

The dev overlay points scalarlmBaseUrl at a developer-local ScalarLM (or a shared dev one) so the whole stack can come up on a kind/minikube cluster without provisioning GPUs.

What lives in this repo vs upstream

In this repo Upstream / out of repo
Umbrella chart and four first-party subcharts (router, classifier, ui, cloudflared) ScalarLM (separate Helm chart, separate deployment)
Per-env values overlays (dev/staging/prod) Secret material
CI for lint, template, kubeconform Argo CD Application manifests (separate infra repo)
Cloudflare DNS records and Access policies