split-brain

Sign in

Cloudflare tunnel

We expose the router over the internet using cloudflared running as a Kubernetes Deployment. The cluster has no public ingress: all inbound traffic to the router arrives through an outbound TLS connection that cloudflared opens to Cloudflare's edge.

Why a tunnel instead of a LoadBalancer

  • The cluster does not need a public IP, an Ingress controller, or a cloud LB. One fewer attack surface.
  • DDoS, WAF, and bot management run at Cloudflare's edge before the request ever reaches our pods.
  • Tunnel credentials can be revoked without touching DNS or firewall rules.

Trade-off: requests carry a small additional hop. In practice this adds ~5–15 ms median; acceptable for an LLM endpoint where p50 inference dominates.

One-time setup (operator)

We run cloudflared in token mode (a "remotely-managed" tunnel): the tunnel and its ingress rules live in the Cloudflare dashboard, and a single token encodes the tunnel identity. There is no local credentials JSON and no config.yaml ConfigMap.

In the Cloudflare dashboard (Zero Trust → Networks → Tunnels):

  1. Create a tunnel named split-brain; copy the install token (the base64 string shown after cloudflared service install …).
  2. Add Public Hostnames mapping each external hostname to its in-cluster service, e.g.: - split-brain-router.<zone>http://<release>-router:8080 - split-brain-ui.<zone>http://<release>-ui:8080

The token is supplied to the chart as global.secrets.cloudflared.tunnelToken (in a gitignored values-*.secrets.yaml); the umbrella renders it into the cloudflared-credentials Secret under the key TUNNEL_TOKEN, which the Deployment reads as the TUNNEL_TOKEN env. See helm.md § Secrets.

Ingress rules

Ingress rules are not in the cluster — cloudflared runs with just tunnel --metrics 0.0.0.0:2000 run and fetches its hostname→service mapping from Cloudflare's control plane at runtime (the dashboard's Public Hostnames). To change what's exposed, edit the tunnel in the dashboard; no redeploy.

Two hostnames are exposed today — the router and the UI — each mapped to its in-cluster Service. The router carries the OpenAI/Anthropic API; the UI is the operator console (gated by Google sign-in, see google-auth.md).

Put a Cloudflare Access application in front of router.<zone> that requires a service token. Clients send both:

  • CF-Access-Client-Id / CF-Access-Client-Secret headers (edge).
  • Authorization: Bearer <key> header (router).

This gives us defense in depth: edge revocation if a token leaks, router-side bearer check independent of Cloudflare.

For service-to-service callers that cannot easily send extra headers, we can scope an Access bypass policy by IP range. Avoid this for human-facing clients.

Replicas and health

We run two cloudflared replicas so a pod restart does not drop the tunnel. Cloudflare maintains independent connections from each replica to the edge and load-balances across them automatically.

The metrics: 0.0.0.0:2000 line exposes /metrics (Prometheus) and /ready (used by Kubernetes probes).

Hazard: the tunnel token defines the origin, not the cluster. Any cloudflared that starts with this tunnel's credentials becomes a live replica and Cloudflare load-balances the public hostnames across all of them. If the chart is ever deployed to a second cluster with the same cloudflared-credentials secret, that cluster starts serving (or failing) real production traffic — see the 2026-06-04 wrong-cluster incident in deploy.md. The deploy-time cluster guard exists to prevent exactly this; be deliberate about where the tunnel secret is installed.

Egress requirements

cloudflared needs egress to region1.argotunnel.com:7844 and region2.argotunnel.com:7844 (TCP). Both are covered by allowing *.argotunnel.com:7844 and *.cloudflare.com:7844 in the NetworkPolicy. No inbound ports.

Failure modes

Failure Result Recovery
Single cloudflared pod crash Other replica carries traffic. k8s restart.
Both replicas down Hostname returns 530 at the edge. k8s rolls them.
Cloudflare edge outage Hostname unreachable. Wait or fail over to a secondary tunnel in another zone (out of scope v1).
Credentials revoked / rotated Tunnel fails to start. Recreate Secret, restart Deployment.

What lives in this repo

  • The cloudflared subchart under charts/split-brain/charts/cloudflared/: Deployment, NetworkPolicy, PodDisruptionBudget, ServiceAccount. No ConfigMap (token mode) and no Dockerfile — the chart pulls the upstream cloudflare/cloudflared image pinned to a specific tag (image.tag in the subchart values), not a wrapper image we build.

What is not in the repo: the tunnel token, the DNS records, the Public Hostname mappings, or any Access policies. Those are operator-managed in the Cloudflare dashboard.