The Firehose

Share

The telemetry plane is the easy half of the state split to describe and the easy half to get quietly wrong. It carries everything the cluster observes about itself: usage, health, pressure, liveness, GPU temperature. It's high-volume, it's allowed to be stale, and the entire engineering problem is making fifty thousand machines' worth of churn cost almost nothing while still being useful to a scheduler. Borg solved this with a trick. Orbit turns the trick into a layer.

Full state in, diffs out

A Satellite reports its whole state on every poll. Not "here's what changed", here's everything, the complete NodeStateDelta:

pub struct NodeStateDelta {
    pub heartbeat: DateTime<Utc>,
    pub usage: ResourceVec,
    pub capsules: Vec<(CapsuleId, CapsuleHealth)>,
    pub pressure: Pressure,
    pub gpu_health: Vec<GpuHealthSample>,
}

Reporting full state every time sounds wasteful, and naively it would be. The payoff is resilience: a fresh report fully describes the node, so a shard that restarts or a consumer that just connected doesn't need a history to catch up. It reconstructs the world from the next round of reports. There's no replay, no "please re-send everything since sequence 4012," no special recovery path. The recovery path is the normal path.

The waste gets removed one layer up, by a diff-aggregating shard. It keeps the last view it saw for each node, and when a report comes in identical to the last one, it forwards nothing:

async fn ingest(&self, node: NodeId, delta: NodeStateDelta) -> Result<(), TelemetryError> {
    let current = view_from_delta(node.clone(), delta)?;
    let previous = self.state.read().unwrap().get(&node).cloned();

    if previous.as_ref() == Some(&current) {
        return Ok(()); // unchanged, nothing to propagate
    }

    self.upstream.ingest(node.clone(), delta_from_view(&current)).await?;
    // record new state + the diff we forwarded
}

Full state from the edge, diffs toward the center. This is Borg's link-shard idea generalized into an architectural layer: the leaves are resilient because they're stateless about history, and the interior is cheap because most reports don't change anything. A node sitting at steady load forwards exactly nothing upward until something actually moves. Stack these shards into a tree and a Constellation of fifty thousand machines produces a trickle at the root, not a flood, even though every machine is reporting constantly.

The plane is allowed to say no

Telemetry's superpower, the thing that justifies keeping it out of the ledger, is that it can drop data under load without anyone getting hurt. That's not an accident of the implementation; it's a wrapper you opt into:

pub struct BackpressuredTelemetryPlane {
    inner: Arc<dyn TelemetryPlane>,
    permits: Arc<Semaphore>,
    limit: usize,
}

#[async_trait]
impl TelemetryPlane for BackpressuredTelemetryPlane {
    async fn ingest(&self, node: NodeId, delta: NodeStateDelta) -> Result<(), TelemetryError> {
        let permit = self.permits.clone().try_acquire_owned()
            .map_err(|_| TelemetryError::Overloaded { in_flight: self.limit, limit: self.limit })?;
        let result = self.inner.ingest(node, delta).await;
        drop(permit);
        result
    }
    // get_node / observe pass straight through
}

It holds a fixed number of permits. When they're all out, ingest returns Overloaded immediately instead of queueing work it can't keep up with. The Satellite shrugs and tries again on its next poll with newer data anyway, so a shed sample isn't a gap, it's a slightly coarser sampling rate during a spike. A telemetry plane that sheds load is doing its job. A consensus log that shed load would be a disaster. That asymmetry is the entire reason these two planes are different code, and it's nice to see it show up as one plane having a try_acquire and the other not.

Reads are cheap and explicitly stale

Consumers read telemetry two ways. A point lookup, get_node, and a filtered stream, observe:

pub struct ObservationQuery {
    pub nodes: Option<Vec<NodeId>>,
    pub min_usage: Option<ResourceVec>,
}

You can ask for specific nodes, or for every node above some usage threshold, which is the query an autoscaler or a "what's hot right now" dashboard wants. Neither read touches consensus, and both are explicitly allowed to be a little behind.

There's a wrinkle that matters for how controllers actually use this. A scheduler or autoscaler runs a tight loop, and you don't want each pass making blocking network calls to a telemetry service scattered across the Constellation. So controllers keep a local CachedTelemetryPlane and refresh it once at the top of each pass:

pub async fn refresh_from(&self, client: &GrpcTelemetryClient, query: ObservationQuery)
    -> Result<(), TelemetryError>
{
    let views = client.observe_remote(query).await?;
    let mut state = self.state.write().unwrap();
    state.clear();
    for view in views {
        state.insert(view.node.clone(), view);
    }
    Ok(())
}

Pull a fresh batch, then run the whole control loop against an in-memory snapshot through the ordinary synchronous observe interface. The loop never blocks on the network mid-decision, and the staleness is bounded by how often it refreshes, which is a knob the controller owns rather than a property of the transport. This is the "possibly stale view" the scheduling design keeps promising is safe, and it's safe because, as the ledger post covered, the commit catches any decision made on an out-of-date view.

GPU health is a first-class citizen here

The one place telemetry gets genuinely rich is accelerators, and that's deliberate, because the ML story later in the series depends on it. A GPU isn't healthy-or-not the way a process is; it degrades in ways that matter long before it dies. So a GPU sample carries the things that actually predict a bad training run:

pub struct GpuHealthSample {
    pub identity: GpuIdentity,
    pub capsule: Option<CapsuleId>,
    pub health_score: u8,
    pub state: GpuHealthState, // Unknown | Healthy | Degraded | Unhealthy
    pub ecc_single_bit_errors: u64,
    pub ecc_double_bit_errors: u64,
    pub retired_pages: u64,
    pub xid_events: Vec<u32>,
    pub temperature_celsius: Option<f64>,
    pub thermal_throttle_active: bool,
    pub nvlink_error_count: u64,
    pub memory_used_bytes: Option<u64>,
    pub memory_total_bytes: Option<u64>,
    // ...
}

Double-bit ECC errors, retired memory pages, Xid events, NVLink errors, thermal throttling. A card throwing ECC errors and thermal-throttling is still "up" by any liveness check, and it will quietly halve the throughput of a job pinned to it. Surfacing this as telemetry is what lets a scheduler avoid a sick GPU and an operator find the one bad card in a rack of them.

The identity field is a small detail I care about more than its size suggests:

pub struct GpuIdentity {
    pub uuid: String,
    pub mig_uuid: Option<String>,
    pub pci_bus_id: Option<String>,
    pub product_name: Option<String>,
}

A GPU is identified by its stable UUID, not by /dev/nvidia3, because Linux device numbering shuffles across reboots and driver reloads, and "GPU 3 is sick" is useless if GPU 3 is a different physical card after the next reboot. The MIG UUID is in there too, because a card sliced into MIG instances has health per slice. Even though telemetry is the sloppy plane, the data in it is validated on the way in, finite temperatures, non-zero PCIe link widths, used memory never exceeding total, because sloppy-about-freshness shouldn't mean sloppy-about-shape.

What telemetry doesn't do

Worth being clear about the boundary. The telemetry plane observes; it doesn't decide. It carries a heartbeat timestamp on every view, but it doesn't itself declare a node dead. Liveness is a judgment a controller makes by looking at how stale a heartbeat has gotten, and the consequence of that judgment, evicting a capsule, rescheduling work, is a *commitment*, so it goes through the ledger, not telemetry. The plane's whole job is to be a cheap, current-ish, sheddable view of reality. Acting on that view is somebody else's job, on the other side of the state split.

That somebody is next. We've got two planes of state; now we need the thing that reads both, decides where work should go, and races its peers to commit the answer. The scheduler.