Many Schedulers, One Commit Loop
A scheduler is where an orchestrator earns its keep, and it's also where Borg admitted it had painted itself into a corner. The scheduling policy lived in one place, it grew over years to serve every workload Google had, and the paper is candid that the single scheduler became unwieldy. Omega proposed the way out, Borg adopted part of it, and Orbit takes it the whole way: scheduling isn't a component, it's a population.
Why one scheduler is the wrong shape
A latency-sensitive web service and a 5,000-task batch sweep want opposite things from a scheduler. The service wants its replicas spread across failure domains, placed fast, and never starved. The batch sweep wants global throughput: pack as much work as possible, and don't agonize over any single task. A trained ML gang wants something else entirely, all of its members placed together on the right network fabric or none of them placed at all.
You can serve all three from one scheduler, and that's exactly what produces the unwieldy thing. Every new workload class bolts another mode onto the same code path, and the modes interact, and eventually nobody can reason about the whole. Kubernetes lived this too: one default scheduler, then Volcano and YuniKorn and Kueue bolted alongside it for the workloads it served badly.
The alternative is to let several schedulers run at once, each specialized, sharing one view of cluster state and resolving collisions when they commit. Omega called it shared-state optimistic concurrency. The ledger post already built the machinery for it without naming it: that's what the precondition on propose is for.
Navigator claims work and proposes placements
In Orbit a scheduler is a Navigator, and the trait is deliberately small:
#[async_trait::async_trait]
pub trait Navigator: Send + Sync {
/// Which pending capsules this Navigator owns.
fn claims(&self, module_name: &ModuleName, appclass: AppClass) -> bool;
/// Produce placement proposals from the scheduling view.
async fn plan(&self, view: &SchedView, pending: &[PendingCapsule]) -> Vec<Proposal>;
}
claims is how the population divides the work without a coordinator. Each Navigator looks at a piece of pending work and answers "mine?" The Services Navigator claims latency-sensitive and best-effort capsules; the Batch Navigator claims batch; the Gang Navigator claims gangs. The split is by appclass, so a capsule has exactly one owner and the schedulers don't step on each other's intended work, they only ever race for the same machine, never the same capsule.
plan turns pending work into proposals, and a Proposal knows how to become a ledger write. This is the join between the scheduler and the consistent store:
impl Proposal {
pub fn to_commitment(&self) -> Commitment { /* Placement or GangPlacement */ }
pub fn precondition(&self) -> Precondition {
match self {
Self::Placement { node_id, envelope, .. } => Precondition::NodeFreeOf {
node: node_id.clone(),
at_least: envelope.reservation.clone(),
},
Self::Gang { gang_id, .. } => Precondition::GangUnplaced { gang: gang_id.clone() },
}
}
}
A placement proposal carries its own optimism. It says "commit me only if this node still has my reservation free." A gang proposal says "commit me only if this gang isn't already placed." The scheduler doesn't lock anything to make those true; it just states the bet, and the ledger settles it.
The loop is the same for every Navigator
Every Navigator, whatever its placement strategy, runs in the same engine loop. This is the entire thing:
loop {
let view = match SchedView::build(&*eph, &*tel).await {
Ok(v) => v,
Err(_) => { backoff.increase(); backoff.wait().await; continue; }
};
let pending: Vec<_> = view.pending.iter()
.filter(|p| nav.claims(&p.module_name, p.appclass))
.cloned().collect();
if pending.is_empty() { backoff.wait().await; continue; }
let proposals = nav.plan(&view, &pending).await;
let mut any_success = false;
for proposal in proposals {
match eph.propose(proposal.to_commitment(), proposal.precondition()).await {
Ok(_) => any_success = true,
Err(LedgerError::Conflict { .. }) => {} // stale view; reconsidered next pass
Err(_) => {}
}
}
if any_success { backoff.reset(); } else { backoff.increase(); }
backoff.wait().await;
}
Build a view by joining the two planes. Filter to the work this Navigator claims. Plan. Propose each result. And here's the line the whole design turns on:
Err(LedgerError::Conflict { .. }) => {} // stale view; reconsidered next pass
A conflict is nothing. It's not logged as an error, it's not retried in a tight loop, it's not a failure. It means another Navigator committed to that node first, the precondition caught it, and this capsule will simply be re-planned on the next pass against a fresher view. Two schedulers racing for the same machine is the expected case, and it resolves correctly because the ledger serializes the commit. The schedulers never talk to each other. They don't need to. The only thing they share is the ledger, and the ledger is the referee.
The backoff makes the loop polite. A pass that committed something resets to the fast interval, because there's clearly work to do. A pass that committed nothing, no pending work, or everything conflicted, backs off toward a slower interval, so an idle Constellation isn't spinning a dozen schedulers at full tilt. It's the crudest possible adaptive scan rate, and it's enough.
What a real Navigator does inside plan
The Services Navigator is the one most people picture when they say "scheduler": greedy feasibility plus scoring, the same hybrid shape Borg used. For each pending capsule it walks the feasible nodes and picks the best by a score where lower is better:
fn score(node: &NodeInfo, pc: &PendingCapsule, envelope: &ResourceEnvelope) -> f64 {
// average remaining-resource fraction after placing, prefer a tight fit
let stranded = /* mean over nonzero dims of remaining/available */;
// more capsules already here -> higher penalty (spread)
let spread_penalty = node.placed_capsules.len() as f64 * 5.0;
// soft `prefer` predicates that match -> bonus
let preference_bonus: f64 = pc.prefer.iter()
.filter(|(predicate, _)| node.satisfies(predicate))
.map(|(_, weight)| *weight).sum();
// node in a failure domain the capsule wants to spread across -> bonus
let spread_bonus = if pc.spread_domains.contains(&node.failure_domain) { 2.5 } else { 0.0 };
(stranded + spread_penalty - preference_bonus - spread_bonus).max(0.0)
}
Four forces, and they're the Borg scorer in miniature. Minimize stranded resources, so a node ends up tightly packed rather than littered with unusable slivers. Penalize piling capsules onto one machine, which spreads load. Reward the soft preferences the workload expressed. Reward landing in the failure domains the workload asked to spread across. Hard require predicates aren't in the score at all, they filter the node out before scoring, because a requirement that doesn't hold makes the node infeasible, full stop. That hard-versus-soft split came from the object model, and here's where it pays off.
One subtlety that's easy to miss: a single plan pass reserves capacity locally as it goes. When it places a capsule on a node, it adds that reservation to its working copy of the node before scoring the next capsule:
node.usage = node.usage.add(&envelope.reservation);
node.placed_capsules.push(pc.capsule_id.clone());
So one Navigator won't oversubscribe a node within its own pass, it tracks what it's already promised. The ledger's precondition handles contention between Navigators; this handles contention within one. There's a property test that hammers this with random node capacities and capsule sizes and asserts no node ever gets planned past its CPU capacity. Two layers of safety, one optimistic and one local, and neither requires a lock.
What's deliberately not here yet
Borg made greedy scheduling fast at scale with three tricks, and I want to be straight that Orbit has the shape for them but not all the muscle. Score caching: don't rescore a node whose state didn't change. Equivalence classes: score one capsule for a whole group of identical replicas instead of all N. Relaxed randomization: examine machines in random order and stop once you've found enough feasible ones, rather than scoring the entire Constellation. Borg reported that without these a from-scratch schedule went from a few hundred seconds to over three days, so they aren't optional at fifty thousand nodes.
The PlacementEngine trait exists precisely so these can slot in as building blocks without touching the engine loop or the Navigator contract. The greedy engine I've shown is correct and property-tested; it is not yet the version that's been proven to schedule a full Constellation in seconds. That's real work I haven't finished, and the batch case, global throughput optimization rather than greedy-one-at-a-time, is harder still, enough that it gets called out as an open risk in its own right and shares a later post with the ML scheduler.
What I'm confident in is the architecture. The hard part of multi-scheduler systems isn't writing a second scheduler; it's making two schedulers safe to run against the same cluster without a distributed lock manager. Optimistic commit against a small consistent ledger is that safety, and it's the same five lines of conflict-handling whether you have one Navigator or ten. The next post steps back up to the front door, admission control, where work has to prove it's allowed in before any Navigator ever sees it.