Why I'm Building Another Orchestrator
Everyone tells you not to build your own cluster manager. They're mostly right. Kubernetes exists, it works, it used all over the place...better job prospects if you use K8s. Building another one is the kind of decision that gets you eyerolls in conversation.
So let me undermine that immediately: I'm building one anyway, it's called Orbit, and the reason isn't that Kubernetes is bad. The reason is that we now have something we didn't have in 2014...a *decade of written-down regret*. Borg's authors published what they'd do differently. Omega showed the way out of the single-scheduler trap. Kubernetes ran the experiment at planetary scale and the scar tissue is all in public. You can sit down today, read every retrospective, and design the system you'd build if you got to pick each tradeoff on purpose instead of inheriting it.
That's the whole pitch. Not "Kubernetes is wrong." Just: *we know more now, and almost nobody has gone back to the drawing board with the knowledge in hand.*
This is the first post in a series where I build that system out loud, one subsystem at a time, in the order you'd actually build them. This post is the why. The rest are the how.
Three things Borg got right
Orbit keeps three of Borg's properties. I don't think you can drop any of them and still claim to have improved on it.
The first is utilization. Borg packs machines tightly, over-commits on purpose, and reclaims the difference between what a job reserves and what it actually touches from moment to moment. Around a fifth of the workload runs in that reclaimed gap. At fleet scale the gap is worth a data center.
The second is the cell as a failure domain. A cell runs a few thousand to a few tens of thousands of machines as one unit. Bigger cells would have scaled fine; Google capped them anywaet was the goal, not a side effect of some scaling limit.
The third is that the control plane can be down without the cluster going down with it. The Borgmaster dies and the tasks already running keep running. The agent on each machine doesn't start killing containers because it lost its uplink. You want the heart beating while the brain reboots.
Drop any of these and you've gone backwards.
Three things Borg got wrong, and admitted
The same people were honest about what didn't work, which saves me from guessing.
Grouping was the first miss. The job was the only container for related work, so a service built from several jobs had nowhere to say those jobs belonged together. People wrote the relationship into the job names and enforced it with documentation and habit. Nothing in the system understood it.
Networking was the second. Every task shared its machine's IP and carved a port range out of a shared space, which dragged port allocation into the scheduler and made it everybody's problem.
Configuration was the third. BCL grew to something like 230 parameters over its life. It served the power user who wanted a knob for everything and punished the much larger group who wanted to start a web server without first reading the manual.
These are documented regrets, not my hot takes. Repeating them now would be hard to defend.
What Kubernetes did to itself
Kubernetes took Borg's lessons and fixed several of them, then picked up a different set of problems that look avoidable from here.
The expensive one is that the consistent store holds everything. Pod status, node heartbeats, events, the entire high-churn stream that nobody needs to be linearizable, all of it lands in the same etcd that holds the handful of facts that genuinely must be agreed. Put that load on Raft and watch fan-out and they become the ceiling. Clusters rarely fall over from lack of compute. They fall over because they ran out of consensus.
Scaling is pinned to the cluster, so the day you outgrow one you find yourself running a federation, plus the ecosystem of tooling that exists to make several clusters impersonate one. Borg met the same pressure by making the cell bigger.
Configuration drifted back into a mess from the other direction: YAML with templating bolted on, which is how the industry ended up with Helm and Kustomize and a lot of engineers who can debug indentation under deadline. Extension took a parallel path. The CRD-and-operator pattern is genuinely powerful, but once it's the only first-class way to extend anything, every feature outside the core arrives as one more controller you install and babysit. And namespaces spent years dressed up as a security boundary they were never built to be.
None of these is a dumb call on its own, which is what makes them worth studying. They rhyme. Each is the same instinct reached for again: make the system uniform. One store for all state, the cluster as the sole scaling unit, one configuration language, one extension mechanism, one undifferentiated pool of objects addressed by labels. Uniformity is easy to explain and looks clean on a slide. The invoice shows up later, paid by whichever parts of the system never wanted to be uniform.
Avoidable versus inherent
That hands me the distinction the rest of the design runs on. The avoidable problems are almost all over-uniformity. The hard problems are a different kind of thing: tradeoffs, and a tradeoff has no correct answer, only a chosen one.
A few of them never go away regardless of how good the engineering is:
- consistency against churn
- utilization against predictability
- centralization against blast radius
- expressiveness against approachability
No design wins all four. What you can do is settle each one on purpose, put the decision where it actually belongs instead of applying a single answer everywhere, and then spend most of your effort hiding the consequences so an ordinary user never has to think about them. That hiding is the real work, and most of this series is about it.
Most of the architecture follows from that
Commit to placing tradeoffs instead of unifying them and a lot of Orbit stops being a choice. It turns into a consequence.
State splits in two, because the churny health-and-usage data and the must-be-agreed commitments share nothing except that etcd kept them in one box. Orbit holds commitments in a small linearizable ledger and pushes everything that changes constantly into a separate, eventually consistent plane. The firehose never touches consensus. That split is the next post, and the one I'd defend hardest.
Scheduling stops being a single component. Omega's contribution was the observation that several schedulers can share one optimistic view of cluster state, each proposing placements, with collisions resolved at commit. Borg took half that step. Orbit treats running many schedulers at once as the ordinary case rather than an advanced configuration.
A few more fall straight out of the lens. Grouping gets to be hierarchical and queryable at the same time, so you're not forced to choose between Borg's job-only world and Kubernetes' bag of labels. Networking keys on a cryptographic identity issued to each workload, which means routing and policy and naming stop caring what address something landed on, and the overlay-and-IPAM machinery becomes unnecessary weight. Sizing moves to a control loop that watches real usage and sets the numbers, because asking every user to hand-pick CPU and memory produces either waste or a pager going off at 3am. Machine learning gets gang scheduling, topology-aware placement, and checkpointing as first-class concerns, because the bin-packing model comes apart the moment the thing deciding throughput is the network between the accelerators rather than the cores on any one box.
The boundary between cells stays hard on purpose. A Constellation fails by itself, and the layer above it can suggest where work should go without the authority to order anything. That preserves Borg's containment without sliding into the accidental multi-cluster management Kubernetes leaves you doing by hand.
Rust sits underneath all of it, and the reasons are concrete rather than fashionable: no garbage collector putting a pause in the tail latency, no dangling pointer in the one process you cannot afford to corrupt, and a type system sharp enough that an entire class of distributed-systems bug refuses to compile. Each of those maps to a specific lesson, and they'll keep showing up post after post as the thing quietly doing the work.
The names
Orbit wears a spaceflight vocabulary, partly because the old satellite-operations acronym TT&C (Telemetry, Tracking & Command) lines up almost exactly with what a control plane does, and partly because names you can say out loud make a system easier to argue about. The glossary, so the rest of the series reads cleanly:
- **Constellation**: a set of machines run as one unit and one independent failure domain. Our cell. Target size 10kâ50k machines.
- **Ephemeris**: the strongly-consistent commitment ledger. In astronomy an ephemeris tells you where bodies will be; ours records what's been promised. Kept deliberately tiny.
- **Telemetry plane**: the high-churn, eventually-consistent observation plane. Health, usage, liveness. The firehose that stays out of the ledger.
- **Airlock**: admission control and the API gateway. Nothing enters without passing through it.
- **Navigator**: a scheduler. Several specialized ones run at once.
- **Stationkeeper**: the closed-loop right-sizer. Stationkeeping is the small continuous thrust that holds a satellite in its slot. Same job here.
- **Transponder**: a workload's cryptographic identity. **Beacon** is the discovery service built on it.
- **Satellite**: the per-machine agent. Runs containers, enforces limits, reports telemetry, and carries on if the control plane vanishes.
- **Fleet plane**: the thin advisory layer sitting above many Constellations.
The object model runs top to bottom. A **Mission** is a whole application, the thing Borg had no name for. It contains **Modules**, each a set of identical replicas of one binary. Modules schedule as **Capsules**, a resource envelope on a single machine, roughly a pod. Inside a Capsule run **Payloads**, the actual containers. A **Gang** is a set of Capsules that schedule all-or-nothing. That's the whole noun list, and it's enough to read every post that follows.
The honest part
None of this is free, and a few of the open questions keep me up. Better to name them now than get caught later pretending they were solved.
Keeping the ledger small is the entire plan for making it scale, but some Constellations will see north of 10,000 task arrivals a minute, and the commit path can't quietly become the bottleneck I just spent a section mocking etcd for.
Exact global batch assignment across 50,000 nodes is unproven without serious work on incremental re-solving. I have a shape for it, not a victory.
GPU-memory checkpointing costs more than checkpointing plain CPU state, and if it runs too slow then preemptible training is a marketing line rather than a feature.
And a typed config language is only as good as the message it prints when you get it wrong. Botch that and I've shipped a second thing for people to hate.
Next
Next time we make the first cut: separating cluster state into two planes by what each part actually needs from consistency, and why running the firehose through consensus is the one inheritance Orbit refuses. That's where this stops reading like a manifesto and starts behaving like a system.
That's the why. Thanks for reading. See you in the ledger.