NUMA-Aware Allocation: Making Memory Local Again
I've been building a storage system on CXL-attached persistent memory, and one component that took more iteration than expected was the NUMA-aware allocator. The concept is simple--bind memory to specific NUMA nodes--but the implementation has some sharp edges that aren't obvious until you're debugging mysterious latency spikes in production.
Let me walk us through what I've learned.
Philosophy
The first decision was whether to use libnuma or go straight to using syscalls. I went with direct system calls.
const SYS_mbind = 237;
const SYS_get_mempolicy = 239;The reasoning comes down to my philosophy building Basalt: libnuma is a fine library, but it is another dependency, and I don't want many of them. The syscall interface is straightforward and Zig's std.os.linux.syscall6() compiles down to a single syscall call with register setup. No PLT indirection, no dynamic linking concerns.
The tradeoff is portability. When I get around to testing on ARM hardware, I'll need new numbers for these, so I'll need to make a table of sorts that returns the right values on the right architectures. But for now, this is cleaner.
Semantics That Matter
The mbind() signature has more going on than it looks like, so let's get acquainted:
fn mbind(
addr: *anyopaque,
len: usize,
mode: c_int, // MPOL_BIND, MPOL_PREFERRED, etc.
nodemask: *const c_ulong,
maxnode: c_ulong,
flags: c_uint // MPOL_MF_STRICT, MPOL_MF_MOVE, etc.
) c_int;The nodemask encoding is a bitmask where bit N represents node N, passed as a pointer to an array of unsigned long. For bindings to a single node, you'd think maxnode should be the node ID. It's not, it's the number of valid bits in the mask, which is node_id + 1:
const node: u32 = @intCast(self.numa_node);
var nodemask: c_ulong = @as(c_ulong, 1) << @intCast(node);
const maxnode: c_ulong = node + 1;If you get maxnode wrong, the kernel ignores bits beyond that position. Set it too low and your binding silently becomes a no-op for higher numbered nodes. On systems with more than one node, this will cause ever more latency the more hops away from your node they are. So beware.
The flags parameter is where the real behaviour differences live:
MPOL_MF_STRICTmeans if any page in the range can't be specified on the specific node, then fail. This is what you want when binding matters for correctness.MPOL_MF_MOVEmeans you're actuallyj migrating existing pages to the target node. Without this,mbind()only affects future page faults.MPOL_MF_MOVE_ALLis likeMPOL_MF_MOVEbut also moves pages shared with other processes. Requires specific privilegesCAP_SYS_NICE.
I use MPOL_MF_STRICT without MPOL_MF_MOVE because the allocator is called immediately after getting memory from the backing allocator, before any pages are faulted in. There's nothing to move yet, since the pointer hasn't been returned yet, nobody can write anything to it. The binding sets policy for when those pages do get faulted, which happens on first access.
If you're retrofitting NUMA awareness onto existing allocations, and say you have a buffer that's already been touched, and you want to migrate it... You would need MPOL_MF_MOVE. But that's a different operation with different performance characteristics (page migration isn't free).
The Allocator Composition Pattern
Zig's allocator interface is built for wrapping. You implement a vtable with four functions:
pub fn allocator(self: *NumaAllocator) std.mem.Allocator {
return .{
.ptr = self,
.vtable = &.{
.alloc = alloc,
.resize = resize,
.free = free,
.remap = remap,
},
};
}The ptr field carries your state, the vtable points to your implementations of those functions. Every allocation goes through this indirection, so the vtable pointer should be comptime-known when possible. In this case it is.
The implementation wraps a backing allocator:
fn alloc(
ctx: *anyopaque,
len: usize,
ptr_align: std.mem.Alignment,
ret_addr: usize
) ?[*]u8 {
const self: *NumaAllocator = @ptrCast(@alignCast(ctx));
const ptr = self.backing_allocator.rawAlloc(len, ptr_align, ret_addr) orelse return null;
if (self.numa_node >= 0) {
self.bindToNode(ptr, len) catch {};
}
return ptr;
}The catch {} is doing more work than it looks like, let's talk about it.
Graceful Degradation is a Feature
When mbind() fails, we swallow the error and return the memory anyway. This looks like sloppy error handling, but it's intentional:
self.bindToNode(ptr, len) catch |err| {
// Binding failed - continue anyway
std.debug.assert(err == error.NumaBindFailed);
};The scenarios where mbind() fails:
- Non-NUMA system: Single-socket machines or VMs without NUMA emulation return
ENOSYSor similar. - The requested node doesn't exist on the system.
- The node doesn't have enough free pages.
- Some environments have permissions you need, and you don't have those.
For cases 1 and 4, failing the allocation would make the code unusable on development machines. I'm not going to require a multi-socket server to run the test suite. The allocator should work everywhere, with NUMA binding as an optimization when available.
For cases 2 and 3, you might want different behaviour. If you explicitly asked for node 5 and it doesn't exist, maybe that's a configuration error that should surface. But in practice, the topology detection layer catches these before we get here. By the time we're allocating, the node ID has already been validated.
The assert in the code is more of a personal style that sets up my expectation. Binding failures should only produce NumaBindFailed not something else. If I see a different error in debug builds, that's a bug in my understanding.
Resize: Subtle Differences
Allocation is the straightforward path. Resize is a bit more involved:
fn resize(
ctx: *anyopaque,
buf: []u8,
buf_align: std.mem.Alignment,
new_len: usize,
ret_addr: usize
) bool {
const self: *NumaAllocator = @ptrCast(@alignCast(ctx));
if (!self.backing_allocator.rawResize(buf, buf_align, new_len, ret_addr)) {
return false;
}
if (new_len > buf.len and self.numa_node >= 0) {
self.bindToNode(buf.ptr, new_len) catch {};
}
return true;
}When we grow the region, the backing allocator might extend it in place or it might fail (requiring the caller to alloc-copy-free). If it extends in place, the new pages don't inherit the mbind policy of the original region. Each page gets its policy set at fault time based on whatever policy was active when that virtual address range was configured.
So we rebind the entire new length. This is redundant for pages that were already bound, but mbind() on already-bounded pages is essentially a no-op; the kernel checks the policy, sees it matches, and returns. The alternative would be tracking which byte ranges are new and only binding those, which adds complexity for no real gain.
The remap case is worse: The backing allocator might move the entire region to a new virtual address. When that happens, we absolutely need to rebind because the new virtual addresses have no policy set at all.
Verifying Binding Actually Worked
The thing about mbind() is that it can return success even when your memory isn't where you think it is. The syscall sets policy, not placement. Actual placement happens at page fault time, and the kernel can still fall back if the preferred node is under memory pressure.
To verify placement, you need get_mempolicy with MPOL_F_ADDR:
pub fn queryNodeBinding(ptr: [*]u8) !i32 {
var mode: c_int = 0;
var nodemask: c_ulong = 0;
const result = linux.syscall5(
SYS_get_mempolicy,
@intFromPtr(&mode),
@intFromPtr(&nodemask),
64,
@intFromPtr(ptr),
MPOL_F_NODE | MPOL_F_ADDR,
);
if (result != 0) {
return error.GetMempolicyFailed;
}
return @intCast(mode);
}The combination of MPOL_F_NODE | MPOL_F_ADDR says "tell me which node this specific address is on." Without those flags, you get the policy for the address, not the actual placement. With them you get the node ID where the page physically resides, but only if the page is faulted in. Query an unfaulted page and you get the policy, not the placement.
This is how the test suite validates cross-node binding on multi-socket systems:
// Allocate on node 0, verify it's there
var alloc0 = NumaAllocator.init(backing, node0);
const mem0 = try alloc0.allocator().alloc(u8, 4096);
@memset(mem0, 0); // Force page fault
const bound0 = try NumaAllocator.queryNodeBinding(mem0.ptr);
try std.testing.expectEqual(node0, bound0);
// Allocate on node 1, verify it's there
var alloc1 = NumaAllocator.init(backing, node1);
const mem1 = try alloc1.allocator().alloc(u8, 4096);
@memset(mem1, 0);
const bound1 = try NumaAllocator.queryNodeBinding(mem1.ptr);
try std.testing.expectEqual(node1, bound1);
// Confirm they're actually different
try std.testing.expect(bound0 != bound1);Without the @memset intrinsic, the pages aren't faulted and the query returns policy not placement.
Testing Without NUMA Hardware
Most development happens on laptops and single-socket workstations. How do you test NUMA code without NUMA hardware?
You have some options, either don't, or use an emulator with NUMA emulation like QEMU. Spin up a VM with fake NUMA topology:
qemu-system-x86_64 -smp 4 -m 4G \
-numa node,nodeid=0,cpus=0-1,mem=2G \
-numa node,nodeid=1,cpus=2-3,mem=2GThe emulated NUMA has no real latency differences (it's all the same RAM), but mbind() and get_mempolicy() work correctly. This is how CI validates the binding logic on commodity hardware.
What this Doesn't Handle
A few things I've punted on, for now at least:
Transparent Huge Pages (THP) interact weirdly, as best I can figure out, with mbind(). The kernel might merge your 4KB pages into 2MB THPs, and the THP might span nodes differently than your policy intended. For the storage system, I disable THP for the relevant memory regions. If you need huge pages, you want explicit hugetlbfs allocations with upfront NUMA binding, not runtime mbind on anonymous memory.
Interleaved allocation (MPOL_INTERLEAVE) stripes pages across nodes in a round-robin. This is useful for datastructures accessed by threads on multiple nodes, where you want to average out the latency rather than optimize for any single accessor. The current implementation doesn't expose this, but it's a straightforward change.
NUMA binding is half the story. You also need to pin CPU threads on the same node as their memory, aka CPU affinity. The storage engine handles this at a higher level... worker thread pools are per-node, with each pool's allocator bound to the corresponding NUMA node. The allocator doesn't manage CPU affinity; it just manages memory placement.
In Closing...
The NUMA-aware allocator is a little over 200 lines of Zig code, and one notable thing to mention about it, is that there are more lines dedicated to assertions and edge cases than there are on the happy path. That's the right ratio, in my opinion, for something that runs on every allocation in a latency-sensitive system.