Two Planes, Not One Store

Last post I said the most important decision in Orbit is splitting cluster state into two planes, and then I made you wait for it. Here it is. If you only read one post in this series, read this one, because almost everything else is downstream of this single cut.

The cut is this: some cluster state has to be agreed and durable, and the rest of it is a firehose that nobody needs to be exactly right about. Kubernetes stores both in the same place. Orbit refuses to.

What etcd is actually doing

A Kubernetes control plane keeps its state in etcd, which is a Raft-replicated key-value store. Raft is a right tool for the things that genuinely need consensus. The trouble is what else ends up in there.

Every node sends a heartbeat. Every pod reports status as it starts, gets ready, goes unhealthy, restarts. Controllers write events. The kubelet updates node conditions. All of it lands in etcd, and all of it goes through the same Raft log as the handful of facts that actually need to be linearizable. On top of that, controllers and kubelets set up watches, so every write fans out to everyone who cares, and a lot of them care.

Now picture the cluster getting bigger. The volume of must-be-agreed state grows roughly with the number of distinct things you're running. The volume of churn grows with that times how often each thing changes its mind, which at scale is constantly. The second number dwarfs the first, and they share a Raft group. You don't run out of CPU on the API server. You run out of write throughput on a consensus protocol, most of which is being spent agreeing, durably and in total order, on a CPU-usage number that was stale before it committed.

That's the wound. It's not that any one piece is wrong. It's that two kinds of data with completely different needs were put in one box because one box is simpler to explain.

Sort the state by what it needs

So before writing a line of storage code, I sorted Orbit's state into two piles.

The first pile is small. It's the set of things that, if two parts of the system disagree about them, you get a real bug: double-booked machines, two workloads holding the same identity, quota handed out twice. These are promises. A placement is a promise that a capsule lives on a node. A quota grant is a promise that a principal may consume some resources. A lease, an identity binding. There aren't many of these per workload, and they change rarely once made. They need consensus and durability, and they need to be small enough that one Raft group can serve an entire Constellation without breaking a sweat.

The second pile is enormous and disposable. How much CPU a node is using right now. Which capsules are healthy. Memory pressure. GPU temperature. This changes every few seconds on every machine, and being a second or two stale costs you nothing, because by the time you've acted on it the number has moved anyway. It needs to be fast and cheap and sheddable. It does not need consensus, and putting it behind consensus is how you set the whole thing on fire.

The first pile is the **Ephemeris**. The second is the **Telemetry plane**. They are different code, different storage, different guarantees, and they never share a backend.

Two traits that admit what they are

The cleanest way to see the split is to put the two interfaces next to each other. Here's the ledger:

#[async_trait]
pub trait Ephemeris: Send + Sync {
    /// Append a commitment if the caller's precondition still holds.
    async fn propose(
        &self,
        c: Commitment,
        precondition: Precondition,
    ) -> Result<LogIndex, LedgerError>;

    /// Linearizable read of the active committed state.
    async fn snapshot(&self) -> Result<LedgerSnapshot, LedgerError>;

    /// Tail future commitment-log entries.
    fn watch(&self) -> CommitmentStream;
}

And here's the firehose:

#[async_trait]
pub trait TelemetryPlane: Send + Sync {
    /// Satellites push deltas; the shard layer dedups and aggregates.
    async fn ingest(&self, node: NodeId, delta: NodeStateDelta) -> Result<(), TelemetryError>;

    /// Observe a (cheap, possibly-stale) view for a specific node.
    async fn get_node(&self, node: &NodeId) -> Option<NodeView>;

    /// Schedulers and dashboards observe via queries.
    fn observe(&self, q: ObservationQuery) -> ObservationStream;
}

Look at propose. It doesn't just take a thing to write, it takes a Precondition, and it can come back with LedgerError::Conflict. That's the ledger saying "the world changed under you, try again." Every write to the consistent store is a careful, checked, ordered append. We'll spend the whole next-but-one post on that.

Now look at ingest. It takes a node and a blob of current state and returns a Result whose error case includes this:

pub enum TelemetryError {
    // ...
    Overloaded { in_flight: usize, limit: usize },
    // ...
}

The telemetry plane is allowed to say "I'm too busy, drop it on the floor." There is a wrapper, BackpressuredTelemetryPlane, that does exactly that: it holds a fixed number of permits, and when they're gone, ingest returns Overloaded instead of queuing. A dropped telemetry sample is a non-event, because another one is along in two seconds. Try writing that line for the Ephemeris and you'll feel your stomach drop, which is the correct reaction, and the reason the two live in different files.

The commitment log is deliberately boring

The thing that makes the Ephemeris cheap is that there's so little in it. The entire vocabulary of the consistent store is one enum:

pub enum Commitment {
    QuotaGrant { principal: Principal, resources: ResourceVec, priority: Priority, until: DateTime<Utc> },
    Placement { capsule: CapsuleId, node: NodeId, envelope: ResourceEnvelope, epoch: Epoch },
    GangPlacement { gang: GangId, members: Vec<(CapsuleId, NodeId)>, epoch: Epoch },
    Lease { holder: TransponderId, node: NodeId, ttl: Duration },
    IdentityBind { capsule: CapsuleId, id: TransponderId },
    ModuleAdjustment { mission: MissionId, module: ModuleName, adjustment: ModuleAdjustment, epoch: Epoch },
    Revocation { what: CommitmentRef },
}

That's it. Seven variants, every one a promise. There is no pod status in there. No heartbeat. No CPU number that decays the moment it lands. When a node tells the cluster it's using 3.2 cores, that goes to telemetry and dies there. When a scheduler decides a capsule belongs on a node, that's a Placement, and it goes through Raft, because if we lose it or disagree about it, two workloads end up fighting over the same memory.

Telemetry's payload, by contrast, is fat and fully disposable:

pub struct NodeStateDelta {
    pub heartbeat: DateTime<Utc>,
    pub usage: ResourceVec,
    pub capsules: Vec<(CapsuleId, CapsuleHealth)>,
    pub pressure: Pressure,
    pub gpu_health: Vec<GpuHealthSample>,
}

A node ships one of these on every poll. There can be fifty thousand nodes. None of it is durable past a short window, and that's the point.

How does scheduling work if half the data is stale?

Here's the obvious objection. A scheduler needs both kinds of state to do its job: it needs the committed placements (consistent) and the live usage (telemetry, stale) to decide where the next capsule goes. If the usage data is a second or two old, won't it make bad decisions?

Sometimes, yes. And that's fine, because of how the commit works.

A scheduler builds a **scheduling view** by joining the two planes: authoritative commitments from the Ephemeris, best-effort observations from telemetry. It's allowed to be stale. It reasons over that stale view, picks a node, and proposes a `Placement` with a precondition that says, roughly, "only if this node still has the room I think it has." If the view was current, the commit lands. If another scheduler grabbed that capacity in the meantime, the precondition fails, the commit bounces, and the scheduler tries again with fresher eyes.

So staleness in the telemetry plane never produces a wrong outcome. The worst it produces is a retry. That's the deal that lets the firehose be sloppy: the consistent store is standing behind it catching anything that slips. The full mechanism is the subject of the scheduling post, but the shape of it is already visible here, in the fact that `propose` takes a precondition and `ingest` doesn't take anything at all.

What this bought, and what it costs

Splitting the planes is what lets one Raft group serve tens of thousands of machines, because that Raft group is no longer carrying the firehose. It's the difference between the consistent store being a tiny ledger of promises and being a general-purpose database that the whole cluster hammers. Borg figured this out the hard way and aggregated telemetry through a tree of shards rather than a central store; Twine went further and gave each subsystem its own storage. Orbit bakes the separation into the type system, so you cannot accidentally write a heartbeat into the ledger. There's no method for it.

The cost is honesty about a risk I flagged last time. Keeping the Ephemeris tiny is the entire plan for making it scale, and a busy Constellation can still throw a lot of legitimate commitments at it: a burst of placements, a wave of quota grants. The commit path is careful and ordered by design, and careful ordered things have a throughput ceiling. I'll dig into how the ledger holds up under that, `propose`, preconditions, and whether a promises-only log keeps pace with a busy minute, but not yet, because first we need the nouns those commitments are made of. So the next post builds the object model: Mission, Module, Capsule, and a state machine where a forgotten transition is a compile error instead of a 2am page.

That's the cut. Everything after this is just being disciplined about which side of it each piece of state lives on.

Two Planes, Not One Store

What etcd is actually doing

Sort the state by what it needs

Two traits that admit what they are

The commitment log is deliberately boring

How does scheduling work if half the data is stale?

What this bought, and what it costs

Read more

Why I'm Building Another Orchestrator

The Log is the Database

Deep dive: TrueTime

Thoughts: Building Tools that Serve, not Extract