The Address Space Manager: A Deep Dive
I've been knee-deep in building a distributed storage system and wanted to talk about a component that doesn't get much love in the literature: The address space manager. It's one of those pieces that sits quietly in the middle of everything, just doing its job so you never wonder about it, until it goes wrong.
The Problem We're Solving
Imagine you have a distributed storage system with multiple physical storage devices; could be SSDs, could be CXL-attached persistent memory, could be anything. Your application wants to read and write bytes at logical addresses like it's one big contiguous chunk of storage. Simple, right?
Except, it's not simple at all. Those bytes need to live on some physical medium. And what if you add a new device to the cluster, or need to remove one for maintenance? The application shouldn't care, it just wants to read from address 0xdce8000, 1000 bytes and get the right data back.
That's what an address space manager does: It maintains the mapping between what applications think they're talking to and where the data actually lives.
If you understand virtual memory, this should feel familiar. Your CPUs MMU translates virtual addresses to physical page frames. We're doing something similar, just at a different layer of the stack.
The Core Abstraction: Segments
The fundamental unit here is a segment. A segment is a contiguous range of logical addresses that maps to a contiguous range on a specific physical device. Think of it like a page table entry, but for storage.
class Segment:
def __init__(self, lgb, loge, devid, phys_offset):
self.logical_start = lgb
self.logical_end = lge
self.device_id = devid
self.physical_offset = phys_offset
self.flags = SegmentFlags()A few things to note here. First off, it's Python code because it's common and short enough for a blog post; you probably don't want to see a lot of Zig or Rust in a blog post. Next, the logical_end is exclusive, which makes the math cleaner when calculating sizes and checking containment. This is the half-open interval pattern that you see everywhere in systems programming literature. If you've ever dealt with off-by-one errors in range calculations, you know why this convention exists.
The flags track operational state:
class SegmentFlags:
def __init__(self):
self.active = True
self.migrating = False
self.read_only = FalseYou may wonder why not just use an enum here, then have a single flag for migrating and read_only, but because migration is a process with multiple phases, it's not an exclusive relationship--the source remains active for reads while you copy data to the target, and only after the migration completes do you flip the active flag and deactivate the source. More on this later.
Segment Operations
Before I get into resolution, there are some helper functions that will serve us well as we go, handling common operations that every segment needs, and are often in the hot path.
def size(self):
return self.logical_end - self.logical_start
def contains(self, offset):
return self.logical_start <= offset < logical_end
def overlaps(self, start, end):
return start < self.logical_end and end > self.logical_startThe overlaps helper is worth pausing on. With half-open intervals, two ranges overlap if and only if each starts before the other ends. This is one of those patterns that looks obvious once you see it, but getting it wrong causes sutble bugs that only manifest with specific offset combinations.
As an aside, I've seen systems where someone wrote start <= self.logical_end instead of start < self.logical_end, and it worked fine until there were two segments that were exactly adjacent. Then suddenly reads at the boundary would match both segments. If you think debugging is fun, you've probably never run into this sort of case.
The Resolution Algorithm
The most common operation in a mapping function, is the actual resolution algorithm. It happens on every read and every write. It might look something like this:
def resolve(self, offset, length):
"""Translate logical address range to physical locations.
Returns a list of PhysicalRange objects."""
if offset >= self.logical_capacity:
raise OutOfBoundsError()
end_offset = offset + length
if end_offset > self.logical_capacity:
raise RangeExceedsCapacityError()
ranges = []
current_offset = offset
for segment in self.segments:
if not segment.flags.active:
continue
if segment.overlaps(offset, end_offset):
range_start = max(current_offset, segment.logical_start)
range_end = min(end_offset, segment.logical_end)
if range_start >= range_end:
continue
segment_offset = range_start - segment.logical_start
range_length = range_end - range_start
ranges.append(PhysicalRange(
device_id=segment.device_id,
offset=segment.physical_offset + segment_offset,
length=range_length
))
current_offset = range_end
if current_offset >= end_offset:
break
if not ranges:
raise NoSegmentFoundError()
return rangesNotice that this resolution function can return multiple physical ranges. If a read spans two segments (say, crossing from one device to another), you get back two ranges that you need to read from separately. The caller has to handle this, and take whatever action it makes sense for it.
You do need to be careful about:
- Ranges that span segment boundaries
- Segments that aren't active (migrations in progress, etc.)
- Gaps in the logical address space (sparse allocation)
The linear scan through segments might make those of you who are computer science students, or early on in your career... I can hear you now, "O(n) per lookup? Really?" First off, shut up, secondly, fair point. In practice, you can use a more sophisticated data structure here, say an interval tree... but for systems with a reasonable number of segments, (hundreds, not millions), n is small, and so who gives a shit? It's easier to reason about.
Tirade, over, let's continue.
Concurrency, Getting the Locks Right
Here's where real systems get interesting... As you can imagine, in a distributed storage system, reads are happening all the time, many concurrently, and writes to the segment table only happen when you're adding/removing devices or migrating data. For simplicity, let's use a lock.
class AddressSpaceManager:
def __init__(self):
self.lock = RWLock()
self.segments = []
self.devices = []
self.logical_capacity = 0
def resolve(self, offset, length):
with self.lock.read():
# ... resolution logic
def allocate_segment(self, device_id, size):
with self.lock.write():
# ... allocation logicWhy a reader-writer lock specifically? Because the access pattern leans this way. Resolution happens constantly, potentially thousands of times a second. Segment allocation? Maybe a few times a day. If you use a regular mutex, every resolution blocks every other resolution, which will waste time and IOPS. With an RWLock, readers can be concurrent, writers will acquire exclusive access when needed.
So resolution only blocks when the segment table is being updated.
But there is still something hiding in the weeds that we need to know about: When you call resolve(), you get back a list of PhysicalRange objects. But what happens if someone starts a migration right after you release the read lock? You're now holding references to physical locations that might be getting copied somewhere else!
The truth is, not our problem. It's the caller's responsibility to complete the I/O operation before the migration completes. In practice, this works because:
- Migrations are rare and take time to copy gigabytes of data
- Individual I/O operations are fast, in the microsecond to millisecond range typcially
- The migration process waits for in-flight I/O to drain before finalizing
If you ever find yourself building something like this, you could add a reference counting scheme to tackle in-flight operations. Or you might decide the race condition is acceptable given your access patterns. There are no right answers, because it's a tradeoff, and engineering is all about tradeoffs, after all.
Device Management
Before we can allocate segments, we need to know about the devices in our system. The address space manager maintains a registry of devices with their capacities:
def add_device(self, device_id, capacity, numa_node=-1):
with self.lock.write():
for device in self.devices:
if device.device_id == device_id:
raise DeviceAlreadyExistsError()
self.devices.append(DeviceInfo(
device_id=device_id,
capacity=capacity,
numa_node=numa_node
))The numa_node parameter hints at something important to real systems: Not all devices are equal distance from all CPUs. A NUMA-aware allocator would try and place data close to the threads that access it most frequently, but that's a whole other can of worms to explore... one day.
Removing a device is a bit more involved. You can't just delete it from the list, there might be segments living on it:
def remove_device(self, device_id):
self.lock.write():
for segment in self.segments:
if segment.device_id == device_id and segment.flags.active:
raise DeviceHasActiveSegmentsError()
self.devices = [d for d in self.devices if d.device_id != device_id]This is where the migration logic comes in. If you want to decommission a device, you first migrate all its segments to other devices, then remove it. The DeviceHasActiveSegmentsErroracts as a safety interlock, preventing accidental data loss.
Migration, It Must Be Done
Adding capacity is easy. You register a new device, allocate a segment on it, and the new logical address space is immediately available. Boom, done.
Removing capacity? Well, that's where it gets tricky.
You can't just yank a device out, there could be data on it! You need to move it first, and in case it's not clear, this is what a migration is; in our case, it's got multiple states it must go through, so let's dig in.
- Migration begins; on the source: active=true, migrating=true, read_only=true. The target is: active=false, and migrating=true, and read_only=false
- Data is then copied from source to target. During this time, reads still go to the source, no new writes to the source, and bytes get copied in chunks.
- Migration completes, and we must set source: active=false, migrating=false; target: active=true, migrating=false, atomically.
The key thing to remember in this migration, is that source and target segments cover the same logical address range. During migration, they coexist in the segment table. The source is active (for reads) but read-only. The target is inactive (not participating in resolution) but being filled with data.
Let's see some code:
def begin_migration(self, offset, target_dev):
with self.lock.write():
source_idx = self.find_segment_containing(offset)
if source_idx = None
raise SegmentNotFoundError()
source = self.segments[source_idx]
if source.flags.migrating:
raise MigrationInProgressError()
target_capacity = self.get_device_capacity(target_dev)
target_used = self.calculate_device_used(target_dev)
if target_used + source.size() > target_capacity:
raise InsufficientCapacityError()
source.flags.migrating = True
source.flags.read_only = True
target = Segment(
logical_start=source.logical_start,
logical_end=source.logical_end,
device_id=target_dev,
phys_offset=target_used,
)
target.flags.active = False
target.flags.migrating = True
self.segments.append(target)
return MigrationHandle(source_idx, target_dev, len(self.segments) - 1)The actual data copy happens outside the lock, orchestrated by whoever holds the MigrationHandle. This is important! We don't want to stop the world for address resolution until we've copied gigabytes of data from host to host. Blocking all I/O is not acceptable.
Once the copy is complete:
def complete_migration(self, handle):
with self.lock.write():
source = self.segments[handle.source_segment_index]
target = self.segments[handle.target_segment_index]
target.flags.active = True
target.flags.migrating = False
source.flags.active = False
source.flags.migrating = FalseAfter this, any new resolve() calls will find the target segment instead of the source. The source is now a zombie; it can be garbage collected once you're sure no in-flight operations are still using it.
House Cleaning: Things I Glossed Over
While this post presents a functional address space manager, it's not suited for production. Here are some things I glossed over:
- Persistence: If the system crashes, you need to recover the segments table. In the real implementation I've built in Zig, segment metadata changes get logged to a log before taking effect. On recovery, you replay the log and reconstruct the segment table.
- O(log n) lookups: The linear scan works fine for small segment counts, but if you're managing thousands of segments, you want an interval tree or similar datastructure. A red-black tree where each node stores the maximum endpoint in its subtree gives you O(log n) point queries and O(log n + k) range queries where k is the number of overlapping segments.
- NUMA awareness: When you have multiple memory devices, some are closer to certain CPUs than others. A smart allocator would try to place segments on devices that are local to the threads accessing them. This gets particularly important with CXL-attached memory where remote accesses might add 100+ nanoseconds of latency, best case.
- Fragmentation: After enough migrations, you might end up with a Swiss cheese of allocated and free space on each device. The segment model as presented assumes dense packing on each device, but in reality you need free space tracking and potentially compaction within devices.
- Sparse address spaces: The current design allocates segments contiguously at the end of the logical address space. A more sophisticated system might support gaps, allocating at arbitrary offsets and tracking free ranges. This is useful if you want to reserve address space for future expansion, or if you're exposing the logical addresses to applications that expect specific layouts.
- Observability: For load balancing and capacity planning, you'll want to know how full each device is and how hot each segment is. The implementation has hooks for this, but the actual metrics collection and decision-making is a whole separate subsystem with its own complexity.
Wrapping Up
The address space manager is one of the many types of software that align with this truism: The hard part isn't making it work, it's making it work correctly. In our case today, that means concurrent reads, failures, and operational changes.
Next time, I might dig into the erasure coding layer that sits beneath this, and that's where things get really fun.