The Airlock

Share

Every system needs a front door, and the front door is where you decide what "allowed in" means before anything inside has to care. In Orbit that's the Airlock. Nothing enters a Constellation without passing it: the client proves who they are, the mission proves it's well-formed, and the request proves there's quota to pay for it. Get the front door right and the schedulers, the ledger, and the Satellites all get to assume their inputs are already sane.

The name is doing real work. An airlock is the one place where you check pressure before opening the inner door, and it's a chokepoint on purpose. So is this one.

What admission actually checks

The contract is one trait. The method that matters is admit:

#[async_trait]
pub trait Airlock: Send + Sync {
    async fn admit(&self, client_svid: &Svid, mission: &Mission) -> Result<Admission, AdmissionError>;
    async fn reject(&self, mission_id: &MissionId) -> Result<(), AdmissionError>;
    async fn verify(&self, svid: &Svid) -> Result<Principal, AdmissionError>;
    async fn admitted_missions(&self) -> Result<Vec<Mission>, AdmissionError>;
    async fn adjust_module(&self, mission: &MissionId, module: &ModuleName,
                           adjustment: ModuleAdjustment, expected_epoch: Epoch)
        -> Result<LogIndex, AdmissionError>;
    // ...
}

admit takes the caller's cryptographic identity (an Svid, which gets its own post later) and a mission, and either returns an Admission receipt or one of a small set of refusals:

pub enum AdmissionError {
    Unauthorized(String),
    SchemaError(String),
    InvalidMission(String),
    QuotaExceeded(ResourceVec, ResourceVec),
    Overloaded { in_flight: usize, limit: usize },
    DuplicateModule(String),
    GangTooLarge(u32, u32),
    // ...
}

Each of those is a different reason the door stays shut, and the order they're checked in is the interesting part.

The admit path, in order

Here's the spine of admission, lightly trimmed. Read it top to bottom, because the sequence is the design:

async fn admit_verified_principal(&self, principal: Principal, client: TransponderId, mission: &Mission)
    -> Result<Admission, AdmissionError>
{
    // 1. shed load before doing any work
    let _permit = AdmissionPermit::acquire(&self.in_flight_admissions, self.config.admission_limit)?;

    // 2. validate the mission's shape
    Self::validate(mission)?;

    let _op = self.op_lock.lock().await;
    let needs = Self::compute_needs(mission)?;

    // 3. idempotency: already admitted?
    if let Some(record) = self.admitted.read().unwrap().get(&mission.id) {
        if record.principal != principal {
            return Err(AdmissionError::Unauthorized(/* owned by someone else */));
        }
        return Ok(record.admission.clone());
    }

    // 4. charge quota
    self.quota.write().unwrap().get_mut(&principal)
        .ok_or(/* no quota */)?
        .try_consume(&needs)?;

    // 5. record the grant durably
    let grant_index = self.eph.propose(quota_grant, Precondition::None).await /* with rollback */;

    // 6. open the network path
    self.install_dataplane_policies(&dataplane_policies).await /* with rollback */;

    // 7. remember it
    self.admitted.write().unwrap().insert(mission.id.clone(), record);
    Ok(admission)
}

Load shedding comes first, before any real work, which is the same instinct as the telemetry plane's backpressure. The airlock holds a count of in-flight admissions and refuses new ones past a limit:

fn acquire(counter: &AtomicUsize, limit: usize) -> Result<Self, AdmissionError> {
    let mut current = counter.load(Ordering::Acquire);
    loop {
        if current >= limit {
            return Err(AdmissionError::Overloaded { in_flight: current, limit });
        }
        match counter.compare_exchange_weak(current, current + 1, AcqRel, Acquire) {
            Ok(_) => return Ok(Self { counter }),
            Err(actual) => current = actual,
        }
    }
}

This is Borg's lesson about admission control as a load-shedding valve, made concrete. When the plane is under pressure, the airlock turns work away at the door with a clear Overloaded rather than letting it pile up inside and take everything down. A rejected admission is a client's problem to retry. An overwhelmed control plane is everyone's problem.

Idempotency is an availability story

Step three is the one I want to dwell on, because it's load-bearing for the four-nines claim. If a mission is already admitted, admission doesn't grant it a second time. It returns the same receipt, as long as it's the same principal asking:

if let Some(record) = admitted.get(&mission.id) {
    if record.principal != principal {
        return Err(AdmissionError::Unauthorized(/* ... */));
    }
    return Ok(record.admission.clone());
}

Picture a client that submits a mission, the admission succeeds, and then the network drops the response on the way back. The client has no idea whether it worked. The only safe thing it can do is resubmit. In a system where admit isn't idempotent, that resubmission double-charges quota and double-grants, and now you've got a phantom workload nobody asked for. Here, resubmitting returns the receipt from the first attempt and changes nothing. The client can retry as many times as it likes.

This is the whole reason the design leans on declarative desired state and idempotent commitments: a failed client can safely resubmit, and at scale clients fail constantly. Make retries safe and a huge class of operational pain just evaporates. Make them unsafe and you spend your nines cleaning up duplicates.

Unwinding when a step fails

The middle steps each touch a different subsystem, quota in memory, the grant in the ledger, the policy in the dataplane, and any of them can fail. Admission has to be all-or-nothing, so each step knows how to undo the ones before it. If the ledger write fails, the quota gets released:

let grant_index = match self.eph.propose(grant, Precondition::None).await {
    Ok(index) => index,
    Err(error) => {
        self.quota.write().unwrap().get_mut(&principal).map(|p| p.release(&needs));
        return Err(AdmissionError::Internal(error.to_string()));
    }
};

And if the dataplane policy install fails after the grant landed, it revokes the grant and releases the quota:

if let Err(error) = self.install_dataplane_policies(&dataplane_policies).await {
    self.eph.propose(Commitment::Revocation { what: CommitmentRef { index: grant_index } },
                     Precondition::None).await.ok();
    self.quota.write().unwrap().get_mut(&principal).map(|p| p.release(&needs));
    return Err(error);
}

This is a saga, and it's only tractable because the ledger's commitments are reversible by construction. A QuotaGrant is undone by a Revocation pointing at its index; that's exactly what the ledger post built, and here's where it earns its place. Rejecting an admitted mission later runs the same unwinding in reverse: revoke the grant, tear down the policies, release the quota. There's no special cleanup code; reject is just admit's saga played backward.

The airlock holds desired state

One method on that trait is quieter than the others and matters more than it looks: admitted_missions. The airlock remembers everything it has admitted, and that admitted set *is* the Constellation's desired state. The scheduling post showed Navigators running run_with_airlock; this is the other end of that wire. The Navigator asks the airlock "what should be running?", diffs it against what the ledger says is running, and schedules the gap. The airlock is the boundary between "a human asked for this" and "the system is making it true."

adjust_module is the same door for a different caller. When the autoscaler decides a module needs more memory or fewer replicas, a future post, it doesn't poke the ledger directly. It goes through the airlock, which validates the target and checks the expected epoch so two adjustments can't silently clobber each other. Everything that mutates desired state comes through the one door, which is the only way the load-shedding and the validation stay meaningful.

The other half: the gateway and its certificates

Admission is the logical front door. The airlock is also the *physical* one: it terminates TLS for the public Constellation API, and that turns certificate lifecycle into part of its correctness surface rather than an ops chore bolted on the side. The gateway speaks ACME, Let's Encrypt for public endpoints, a private ACME provider for internal ones, and the rules it has to honor are unforgiving in the specific way security things are. Certificates have to be valid for every DNS name the API answers on. Private keys stay server-side and never leave. Renewal has to happen before the cert enters its safety window, not after it's expired. And a *failed* renewal has to keep serving the last good certificate rather than dropping the gateway, because "my cert renewal flaked" should never become "my control plane is unreachable."

That's a lot of fiddly correctness, and the reason it lives in the airlock is the same reason admission does: it's the one place every request already passes through, so it's the right place to enforce the things that must hold for all traffic. I'll save the identity machinery underneath it, how an Svid is issued and verified in the first place, for the identity post, because it deserves its own.

The honest part

The in-memory airlock serializes its mutating operations behind a single op_lock, which keeps the quota-charge / ledger-write / policy-install saga atomic and easy to reason about, and is also exactly the kind of thing that becomes a contention point under a burst of admissions. Admission isn't on the hot path the way scheduling is, you admit a mission once and run it for a long time, so I think a serialized door is the right starting trade, but it's a trade, and a Constellation getting hammered with new missions would feel it.

The other open edge is that quota here is per-principal within one Constellation. Global quota, the kind a team consumes across a whole fleet of Constellations, lives in the advisory Fleet plane, and that's the last architectural post in the series. For now the door checks the quota it can see, which is the quota for the Constellation it guards.

Next we go through the door and out to the machines, to the agent that runs the containers and, crucially, keeps running them when everything I've described so far falls over.