When Your Database Lives in CPU Cache (Because Why Not?)
Remember when we thought putting databases in RAM was revolutionary? Great performance, quaint times.
WILD (Within-cache Incredibly Lightweight Database) looks at your fancy in-memory database and says "hold my beer." Instead of optimizing for those sluggish 100-nanosecond RAM accesses like some kind of peasant, WILD stores the entire database in CPU L3 cache, achieving 3-nanosecond read times that make even Redis look like it's running on a potato.
This is either the future of database architecture or the result of someone spending way too much time reading CPU architecture references... Maybe both? I guess we'll find out! (Hint: Don't use something like this...why? It'll become more clear, keep reading).
BTW, here's the code if you want to follow along later.
Goals
Every good project has goals, so why shouldn't something silly like putting your entire database in L3 cache?
Sub-microsecond latencies
While normal databases are out here measuring operations in milliseconds like it's 1995, WILD operates in nanoseconds. That's faster than the time it takes for a photon to travel from your CPU to your monitor. We're talking about database operations that complete before your kernel even knows they started.
But why does this matter?
Truth is, it doesn't unless you're doing some very specific things. Consider high frequency trading systems, or if you want your IoT toaster to make decisions faster than you can think "I want toast."
Cache-resident storage
Rather than treating CPU caches as an optimization layer, WILD makes it the primary storage medium. The entire database lives in L3 cache, with careful attention to cache line alignment and access patterns.
But why does this matter?
Your CPU has a lot of onboard memory in the form of caches on modern CPUs... the one I'm typing on right now has 96MB of it. That's enough to hold (in theory) 1.57 million data records, 64 bytes in size. So let's put that real estate to work, instead of letting the OS decide to put random crap there as it deems necessary for "optimization."
NUMA optimization
WILD tries to understand your CPUs topology and place data structures in NUMA-local memory domains, because reaching across NUMA zones costs more time than I want to spend on a query.
That's why that matters.
Zero-copy operations
Traditional databases spend more time copying data around than an office worker with a broken printer. When your data fits exactly in one 64-byte cache line, both keys and data, you don't need to serialize or deserialize them, or copy them, just so you can access the data.
If you know just how much time other databases spend on data marshaling, you'll understand why this is important if you want to go fast...if you can swing it (most can't).
What if...
This is the "hey, just build stuff" goal. Modern CPUs are walking around with dozens of MB of high speed cache, sometimes over 100MB; and most software treats it like an implementation detail. WILD is here to demonstrate that your CPU's cache is basically a tiny, incredibly fast SSD that no one bothered to format.
Anyway, enough with the bad attempt at being humorous, let's just get into the details...
Understanding CPU Caches
To appreciate WILD's architecture, as silly as it is, we need to understand the CPU cache hierarchy and why it matters for performance.
Modern computers have multi-level memory, each level trading capacity for speed:
- CPU Registers: ~1 KB space, <1ns access time
- L1 Cache: 32KB to 64KB, 1ns access time (per core)
- L2 Cache: 256KB to 1MB, 3-10ns access time (per core)
- L3 Cache: 8MB to 144MB, 15-50ns access time (shared)
- Main memory: 8GB to 128GB with 100-300ns access time
- SSD: 1TB to 8TB with 10-100 microsecond access time
- HDD: 1TB to 24TB with 1-10 millisecond access time
Cache Line Fundamentals
CPUs don't fetch individual bytes--they fetch entire cache lines, typically 64-bytes long. This is the smallest unit of data transfer between main memory and CPU cache. When you access a single byte, the entire 64-byte cache line containing that byte is fetched.
CPUs fetch entire cache lines because programs typically access nearby memory locations (functions on the stack, contiguous chunks of memory, etc.), even though they're not sharing data. When a core modifies the same cache line, it bounces between the core's L1 cache, possibly causing significant slowdowns, this is where the term thrashing comes into play.
Memory addresses are divided into:
- Tag: Identifying which cache line it is
- Index: Which cache set (for set-associative caches)
- Offset: Position within the 64-byte line
Some of the performance implications happen, for instance when two or more threads are accessing different variables stored within the same cache line, causing expensive coherency traffic, even though they're not sharing data.
Arranging frequently accessed data within cache line boundaries improves performance. For example, when placing a struct's hot fields in the first 64-bytes.
The key thing to remember is that CPUs don't fetch individual bytes, they always work in 64-byte chunks, making memory layout design critical for performance.
Cache Coherency and NUMA
In multi-core systems, each core has its own L1/L2 caches but shares L3 cache. This creates complex coherency protocols:
MESI protocol
Each cache line exists in one of four states:
- Modified: This core has the only valid copy and has modified it. Must write back to memory before evicting the data.
- Exclusive: This core has the only valid copy, but it's unmodified (matches memory), so it can transition to Modified without communicating over the bus.
- Shared: Multiple cores have valid, unmodified copies. All copies match memory.
- Invalid: Cache line is invalid or empty.
MESI is core to how CPU cores can safely cache shared data while maintaining the illusion of a single coherent memory system.
False Sharing
Even though CPU cores aren't sharing actual data, they're forced to share the cache line containing that data. When one core modifies its variable, the entire cache line gets invalidated on other cores, forcing expensive reloads.
So let's say you have some code:
struct {
counter_a; // Core 0 frequently modifies this
counter_b; // Core 1 frequently modifies this
} shared_data;
When Core 0 writes to counter_a
, the cache line bounces to Core 0's L1 cache.
When Core 1 writes to counter_b
, the cache line bounces to Core 1's L1 cache.
This ping-ponging can reduce performance by a significant margin.
You can detect this if you monitor cache miss rates, while you still have decent spatial locality, if you add more threads and performance gets worse, or if CPU performance counters show excessive cache line transfers.
But there are some simple things to solve this if it's hitting you, for instance add some padding between the two fields large enough that the sizeof(counter_a) + padding == cache_line_size (aka 64)
. You can also use thread-local storage, or different data structures.
The reason false sharing is so insidious, is because it's mostly invisible to single-threaded code but destroys parallel performance.
Cache Line Bouncing
This one is pretty simple, if data is in L1 cache on one core, and data in the cache line is accessed by another core and modified, the data will transfer from the first core to the second. When many updates (think counters, as above) happen on different cores, the amount of this traffic goes up, and your performance goes down.
The main thing to keep in mind, as it comes to WILD, is in its code, I try and carefully design data structures to avoid these problems, and use NUMA-aware placement to avoid penalties crossing NUMA boundaries.
NUMA Awareness and Cache Optimization
NUMA stands for "Non-Uniform Memory Access" and in modern multi-socket systems, each CPU socket has local memory, and accessing remote memory (attached to other sockets) incurs significant latency penalties:
- Local Memory Access: 100ns
- Remote Memory Access: 200-300ns (2-3x slower)
- Remote Cache Access: 150-200ns
WILD's NUMA Strategy
In WILD's source code, you'll find cache_topology.zig
, we try and detect what the NUMA hierarchy looks like... In Linux you can read /sys/devices/system/cpu
to understand physical vs logical (hyperthreaded) cores, L3 cache domains and their CPU associations, cache sizes and shared CPU lists, NUMA node boundaries.
So when I create new data structures, they're placed in memory local to the NUMA domain that will access them most frequently. This isn't very well tested because I don't actually have a multi-socket system here, but please file a bug if you try and it doesn't work!
You can use CPUID instructions on x86_64 architectures (Intel/AMD) to detect the actual cache line size, and I do that because it's critical to understand so we can align data correctly, even on systems that have non-standard cache line sizes (16, 32, 128 bytes), WILD can still adapt its record structure accordingly.
SMT (Hyperthreading) Awareness
WILD discriminates between actual cores, and SMT siblings... The physical cores get preferred placement for shards, and SMT siblings are used for additional parallelism when needed only. Load balancing considers physical core topology to avoid resource contention. This prevents situations where hyperthreaded cores complete for the same execution units while other physical cores remain idle.
Architecture and Performance Enablers
Alright, let's not be too long winded here, so let's drive by how WILD is built, and do some explaining (read the code if you want to understand completely)... Let's go!
Cache Line Aligned Records
The CacheLineRecord
structure in `flat_hash_storage.zig` is carefully designed to be aligned on cache line boundaries:
pub const CacheLineRecord = extern struct {
metadata: u32, // 4 bytes: valid flag + length + reserved
key: u64, // 8 bytes: hash key
data: [52]u8, // 52 bytes: data payload
};
Fields are ordered in the order they are displayed because metadata
gets checked first for validity, then key is checked next for hash collisions, and the data is accessed last, only after a match on a read. This minimizes the number of memory accesses needed for typical operations, and if you do the math, u32 + u64 + [52]u8 == 64
, exactly 1 cache line.
Flat Hash Table with Linear Probing
WILD uses a simple but highly optimized hash table design that uses powers of 2 capacity which enables bitwise operations instead of relying on modulo to calculate shard to place the data into (3 cycles vs 30). When two hash keys collide, WILD simply checks the next cache line. This has excellent cache locality compared to chaining or cuckoo hashing.
There are no tombstones, deleted records simply leave an empty slot that can be immediately reused. This avoids the complexity and performance overhead of table reorganizations.
The Flat Hash uses Wyhash for string keys, and MurmurHash3 finalizer for avalanche properties. It's also a single hash table that lives over all the slots that can hold data (remembering power of 2 rule, stepping down to the largest power of 2 that can fit in your L3 cache).
Static Memory Management
WILD isn't the first database to do something like this, it was made pretty famous by TigerBeetle... The idea that you allocate all the memory your program will use at startup, after which, no more allocations are possible. WILD uses a similar sort of static allocator, backed by an arena allocator which itself is backed by a page allocator. The exact reasons why that is selected, is because that performed the best, YMMV.
Batch Operation Optimization
Very little to do here because individual operations are extremely fast, but WILD does try and eek out what little extra performance we can by reading data in batches. You don't really need complex batching logic in an architecture like this, because all records are already cache-line optimized, and that represents all storage in the system, so to go beyond basic looping over all the keys, and reading the results out is good enough, and what WILD does.
Zero system calls
WILD never makes system calls during normal operations. That means no:
- malloc/free calls (see, Static Memory Management above)
- Tile I/O (this isn't your father's durable database, it's a rather silly one, after all)
- Network operations
- Locks or synchronization primitives (we just don't need to)
This eliminates kernel transition overhead and makes latency completely predictable. It scales with the size of the data.
What WILD is NOT
WILD is a demonstration, so it is:
- Not Durable (data only exist in CPU cache and is lost if data is evicted from a cache or the program terminates)
- Not ACID Compliant (no transactions, consistency guarantees, or crash recovery)
- Not Distributed (Single node architecture and so no replication, sharding, consensus...)
- Not General Purpose (Optimized for very specific types of data storage cases)
- Capacity Constrained (Limited by L3 cache size)
- Platform Specific (Only runs on x86_64 Linux)
Desipte the limitations, WILD's techniques have broad applicability for things like, write-ahead logs in databases, buffer pool management, index structures, query processing optimization in query engines, memory-mapped storage, and I'm sure some other use cases.
Conclusion
WILD is a demonstration that extraordinary database performance is possible when we design around CPU cache hierarchies, rather than treat them as an afterthought. By archieving 3 nanosecond read latencies, and 615.5 million operations per second (Ryzen 9 7800X3D CPU, 96MB L3 cache, Linux 6.16.8), WILD challenges fundamental fundamental assumptions about the performance ceiling for data storage systems.
Remember, this is an experiment, but the code within are things all modern databases should have a handle on, as it relates to their own storage systems.
Why leave performance on the table?