The Commitment Ledger

Share

The Ephemeris is the part of Orbit I was most nervous about, because it's the part that has to be correct. Telemetry can drop a sample and shrug. The ledger cannot drop a placement, can't double-book a node, and can't disagree with itself across replicas. It's also the part I claimed would stay small enough to scale. This post is how `propose` actually works, and why a promises-only log can survive a busy Constellation.

The whole interface is three methods:

#[async_trait]
pub trait Ephemeris: Send + Sync {
    async fn propose(&self, c: Commitment, precondition: Precondition) -> Result<LogIndex, LedgerError>;
    async fn snapshot(&self) -> Result<LedgerSnapshot, LedgerError>;
    fn watch(&self) -> CommitmentStream;
}

propose appends a promise if a condition still holds. snapshot reads the current set of active promises, linearizably. watch tails the log. That's the entire surface a scheduler or an autoscaler ever touches.

Propose is a compare-and-swap on the world

The interesting argument is the second one. A Precondition is the caller's bet about the state of the world, and the ledger only appends the commitment if the bet still holds:

pub enum Precondition {
    None,
    NodeFreeOf { node: NodeId, at_least: ResourceVec },
    GangUnplaced { gang: GangId },
    ModuleAdjustmentVersion { mission: MissionId, module: ModuleName, epoch: Epoch },
}

This is the mechanism the whole optimistic scheduling design rests on, so it's worth being precise about what it means. A scheduler builds a view of the cluster, possibly stale, decides to put a capsule on node n, and proposes a Placement guarded by NodeFreeOf { node: n, at_least: <what the capsule needs> }. The precondition says: only commit this if node n hasn't already had that capacity claimed since I looked.

If the scheduler's view was current, the bet holds and the placement lands. If some other scheduler grabbed that capacity first, the bet fails:

Precondition::NodeFreeOf { node, at_least } => {
    if self.active_commitments().iter().any(|c| active_on_node(c, node, at_least)) {
        return Err(LedgerError::Conflict {
            reason: format!("node {node} no longer has requested resources free: {at_least:?}"),
        });
    }
    Ok(())
}

A Conflict isn't a crash and isn't data loss. It's the ledger telling the caller "you were looking at an old photo." The caller rebuilds its view and tries again. Two schedulers racing for the same machine is therefore a correct, expected event: one wins, the other gets a Conflict and retries. No locks, no leader election among schedulers, no coordination beyond the ledger itself.

The conflict check is per-dimension, not all-or-nothing. Two capsules can land on the same node in the same instant as long as they don't contend for the same resource, one wants CPU, the other wants a GPU, and both commits succeed:

fn vectors_overlap(a: &ResourceVec, b: &ResourceVec) -> bool {
    (a.cpu_millis > 0 && b.cpu_millis > 0)
        || (a.memory_bytes > 0 && b.memory_bytes > 0)
        || (a.disk_bytes > 0 && b.disk_bytes > 0)
        || (a.disk_bw_bps > 0 && b.disk_bw_bps > 0)
        || (!a.accelerators.is_empty() && !b.accelerators.is_empty())
}

That falls straight out of the resource model from the object-model post: because the dimensions are independent, contention is per-dimension too.

The ledger refuses to corrupt itself

Preconditions are the caller's optimistic guard. They're advisory in the sense that the caller chose them. But there's a second layer of checks the ledger runs on every commitment no matter what precondition the caller supplied, and those aren't optional. They're the invariants the ledger will not violate even if a buggy scheduler asks it to:

fn check_commitment(&self, commitment: &Commitment) -> Result<(), LedgerError> {
    match commitment {
        Commitment::Placement { capsule, envelope, .. } => {
            if !envelope.is_valid() {
                return Err(LedgerError::Conflict { reason: "placement reservation exceeds limit".into() });
            }
            self.ensure_capsule_unplaced(capsule)
        }
        Commitment::GangPlacement { gang, members, .. } => {
            if self.active_gang(gang) { /* reject: already placed */ }
            for (capsule, _) in members { self.ensure_capsule_unplaced(capsule)?; }
            Ok(())
        }
        Commitment::IdentityBind { capsule, .. } => {
            // reject if this capsule already has an identity binding
        }
        Commitment::ModuleAdjustment { adjustment, .. } => {
            // reject a scale-to-zero-replicas adjustment
        }
        // QuotaGrant, Lease: always structurally fine
        Commitment::Revocation { what } => {
            // reject revoking something that isn't active
        }
    }
}

A capsule can't be placed twice. A gang can't be half-committed and then committed again. A capsule can't hold two identities. A reservation can't exceed its limit. A revocation can't point at a commitment that isn't there. These hold regardless of who's proposing, which means a scheduler bug can produce a rejected proposal but never a corrupt ledger. The consistent store defends its own invariants rather than trusting its callers to, and I'd rather pay for those checks on every write than debug a double-placed capsule in production.

Snapshot reads active state; the log keeps history

The ledger is append-only, but most callers don't want the full history, they want "what's true right now." So a Revocation doesn't delete anything; it marks an index revoked, and snapshot filters:

fn active_commitments(&self) -> Vec<Commitment> {
    self.commitments
        .iter()
        .filter(|(idx, c)| !self.revoked.contains(idx) && !matches!(c, Commitment::Revocation { .. }))
        .map(|(_, c)| c.clone())
        .collect()
}

You get the set of live promises, with the revoked ones and the revocation records themselves filtered out. The append-only history is still there for anyone who needs to replay it, and the simulator very much does, which is a later post, but the common read is just the active set.

watch is a thin broadcast of commitments as they land, and it tolerates a slow consumer by skipping ahead rather than blocking the writer:

match rx.recv().await {
    Ok(c) => return Some((c, rx)),
    Err(broadcast::error::RecvError::Lagged(_)) => continue, // fell behind; resync
    Err(broadcast::error::RecvError::Closed) => return None,
}

A consumer that can't keep up gets a "you lagged" signal and resyncs from a snapshot, instead of applying backpressure to the thing committing promises. The ledger's job is to commit, not to wait for its slowest reader.

Where Raft comes in

Everything above is from InMemEphemeris, the single-process implementation. It's the reference semantics and it's what the dev setup and the simulator run on. The production ledger wraps the same logic in Raft using `openraft`, and the way the two fit together is the part I'm happiest with.

A write becomes a Raft entry carrying both the commitment and its precondition:

pub struct EphemerisWrite {
    pub commitment: Commitment,
    pub precondition: Precondition,
}

propose submits that through openraft's linearizable client_write path. The precondition is *not* evaluated when the proposal is made; it's evaluated when the committed entry is applied to the state machine, in log order, on every replica. That ordering is what makes the optimism safe under replication. Two proposals for the same node both enter the log. When they apply, the first sees a free node and is Accepted; the second sees the capacity now consumed and is Rejected with a Conflict. Every replica applies the same entries in the same order and reaches the same verdict, so the replicas never disagree about who won the race.

match response.data {
    EphemerisWriteResult::Accepted { index } => Ok(index),
    EphemerisWriteResult::Rejected { error } => Err(error.into()),
}

Reads go through ensure_linearizable before touching state, so a snapshot never serves stale committed data even if you happen to ask a node that just lost leadership. And the storage and network are injected rather than baked in, so the same Raft node runs on in-memory stores in a test and on file-backed, crash-recoverable stores in production. The state machine itself is the in-memory ledger, reused verbatim. The replication layer adds durability and agreement; it does not get a second, subtly different copy of the commitment rules to drift out of sync with the first.

The honest part

The bet of this whole design is that the ledger stays small. A promises-only log grows with the number of live placements, quota grants, and identity bindings, not with how often anything changes its mind, which is what kept it off the etcd cliff. But a Constellation can still have a busy minute, a big mission rolling out, a wave of preemptions rescheduling at once, and every one of those is a real commit through Raft with a precondition check.

Two things keep that honest, and both are real work rather than wishful thinking. Commits batch: openraft pipelines entries, so a burst amortizes the consensus cost rather than paying it per-promise. And conflicts are cheap to retry, because a Conflict is a fast rejection rather than a held lock, so a thundering herd of schedulers racing for the same capacity resolves in a few rounds instead of deadlocking. Whether that holds at fifty thousand nodes with ten thousand arrivals a minute is exactly the kind of claim I don't get to make from a design doc. It's the load test I most want to run, and when I've run it I'll write it up, wins and losses both.

Next: the other plane. Telemetry, where being wrong is allowed, and the trick is making fifty thousand machines' worth of churn cost almost nothing.