How to Lose Data on Purpose and Still Access It Later
Everyone tells you not to build your own distributed storage system. "Just use S3," they'll say. "Use Ceph. Use MinIO. Real engineers don't reinvent wheels."
They're wrong, real engineers build whatever interests them, with expectations in reality. So let's explore what I'm up to, building Basalt, a distributed storage system I'm designing and building to replace how I use Ceph in my home datacenter, trying to make it simpler to use operationally, and be a better target for development. Along the way, I'll share every wall I hit, every "aha" moment, and every piece of context I had to gather to move forward.
If you're the kind of developer who learns by doing...even when you don't know how you'll reach the finish line...this is for you.
Just Copy Everything
My first attempt at durability was embarrassingly simple: Store every file on three different servers. If one dies, you've got two copies left. Simple. Reliable. Done.
And it works! For about five minutes, until your data grows, and you do the math.
Then the millenium happened, years more passed, and one day I found myself with...
10 TiB of data.
Three-way replication meant I needed 30 TiB of raw storage. My storage costs tripled. Here's the thing that really got me: I was storing video, large, cold files that people rarely accessed, but needed to be there when they did. I was paying 3x for durability on data that just... sat there.
So, how do the big players (Facebook, etc.) store over 400 petabytes. There's no way they accept this kind of overhead.
That's when I first heard the term erasure coding.
Math Instead of Copies
The pitch sounds too good to be true: Instead of copying your data three times, you use math to generate parity data. With the right architecture, you can lose multiple chunks of data and still reconstruct everything, but with way less storage overhead.
The specific configuration that we'll explore in this post, is Reed-Solomon encoding with k=4 and m=2.
- Split your data into 4 chunks (This is
k) - Compute 2 parity chunks (This is
m) - Store all 6 chunks on different servers
- Lose any 2 and you can still recover everything
The overhead? Just 50% (6 chunks to store 4 chunks worth of data). Compare that to 200% for three-way replication. Same fault tolerance, fraction of the cost.
Sounds good, eh? Well how the fsck does it work?
The First Wall: A Galois what?
Every tutorial I found started with something like "Reed-Solomon operates over a Galois Field (say 2^8)" and then immediately lost me in polynomial arithmetic and irreducible generators.
I spent a frustrating weekend reading papers that assumed I had a PhD in maths. Spoiler: I don't. So, I took a step back and asked a question: "What problem is this actually solving?"
When you're doing erasure coding, you're doing a lot of matrix math...multiplying, dividing, adding numbers together. Normal integer math has a fatal flaw: numbers grow. Multiply two bytes together and you might get a number that doesn't fit in a byte anymore. Divide two integers and you might get a fraction.
A Galois Field where GF(2^8) is a number system where none of that happens.
It has exactly 256 values (0-255, fitting neatly into a single byte). When you add, subtract, multiply, or divide any two values, the result is always another value in that same set. No overflow/underflow/fractions.
The operations look a bit weird:
- Addition is XOR. Literally just
a ^ bwhich also means subtraction is XOR, meaning addition and subtraction are the same operation. - Multiplication is weird. It involves treating bytes as polynomials and doing modular arithmetic. But you can just pre-compute a 256x256 lookup table, or use log/antilog tables that reduce multiplication to two lookups and an addition.
The truth is that you don't need to know why the math works (though it's pretty elegant once you understand it, if you're into that). I just needed to understand what it gives me: a way to do matrix math on bytes without everything exploding.
Splitting Data
Once accepting GF(2^8) as my number system, the work of actually encoding becomes pretty easy.
Picture a matrix, 6 rows, 4 columns. The top 4 rows are just an identity matrix (1s on the diagonal, 0s elsewhere). The bottom 2 rows contain carefully chosen values from our Galois Field. This is the encoding matrix.
To encode data:
- Take your data, split it into 4 equal chunks (padding if necessary)
- Treat each chunk as a vector of bytes
- Multiply the encoding matrix by your data vector
- The result is 6 chunks: Your original 4 data chunks unchanged, plus 2 parity chunks
The identity matrix rows are why your data chunks come out unchanged...multiplying by identity gives you back your input. The parity rows mix all 4 data chunks together in a specific reversible way.
If you know any 4 of the 6 values, and you know the matrix that produced them, you can solve for the original inputs, recovering the missing 2 values.
Great, hit a wall, and I was able to figure it out.
Reconstruction Isn't just Encoding Backwards
My encoder worked. I could split data into 6 chunks. But when I tried to simulate a failure; pretending that 2 chunks were gone; I realized I had no idea how to get them back, for real.
My intuition was, I have 4 surviving chunks, I just need to find the 4 original data chunks, so just solving a system of equations. But like always, the details were a bit hard to figure out, but here's what worked:
- Identify which chunks survived. Say chunks 0, 1, 4, and 5 survived (two data, two parity).
- Pull out the corresponding rows from the encoding matrix. If chunk 0 survived, take row 0. If chunk 4 survived, take row 4. You end up with a 4x4 matrix.
- Invert that matrix using Gaussian elimination; but in
GF(2^8)space. Standard linear algebra, within our weird arithmetic. - Multiply the inverted matrix by the surviving chunks. The result is the original 4 data chunks.
The encoding matrix is desined so that any 4 rows form an invertible 4x4 matrix. This property is called being maximum distance separable, and it's why Cauchy matrices are preferred over Vandermonde...Cauchy matrices guarantee this property; Vandermonde matrices sometimes don't.
When reconstruction finally worked on my system, I literally celebrated. Four chunks in, two artificial failures, four data chunks out...bit for bit identical to the originals.
One Server is Easy, Six is a Nightmare
I had encoding and decoding working on a single machine. Now came the actual distributed part.
My naive plan: hash the object ID, use that to pick 6 servers, store one chunk on each. What could go wrong?
In case it's not obvious, everything. Everything could go wrong.
Problem 1: Server changes scramble everything
Add a seventh server and suddenly the hash function might send chunks to completely different locations. Remove a server and it's chaos. The tight coupling between the hash result and server identity means any topology change is bad.
Problem 2: You can't track millions of objects individually
Where does object abc123 live? You'd need a massive lookup table. What happens when chunk 3 of that object moves during recovery? Now your lookup table needs distributed transactions.
Problem 3: Failure domains matter
If your hash function puts chunk 0 and chunk 1 on the same physical rack, and that rack loses power, you've lost 2 chunks. For some objects, that's game over; you can only tolerate 2 failures.
More and more, it's looking like I need an indirection layer to solve for these problems.
Placement Groups: A Helpful Abstraction
The solution came from reading about how Ceph handles this, and it's elegantly simple: Don't map objects directly to servers. Instead:
- Objects map to Placement Groups via consistent hashing
- PGs map to servers via a deterministic algorithm
So instead of tracking millions of objects, you track thousands of placement groups. Each PG is responsible for a slice of the hash space and contains many objects.
For my k=4, m=2 setup, each PG maps to exactly 6 servers...one for each chunk. When an object comes in, I:
- Hash the object ID to get a PG number
- Look up which 6 servers own that PG
- Encode the object into 6 chunks
- Send chunk N to server N in that PG's server list
When a server dies, I don't think about individual objects, I think about which PGs had a chunk on that server. Those PGs are now "degraded"...they're missing a chunk but still functional (because we can tolerate up to 2 missing chunks).
The mapping from PGs to servers uses a deterministic algorithm...given the same inputs (cluster topology, PG ID), everyone computes the same result. No central lookup service is needed. Clients can figure out where to send data themselves.
Failure Domains
Remember how I said putting 2 chunks on the same rack is dangreous? The placement algorithm needs to be smarter than random. My cluster has 6 servers across 3 racks (2 servers per rack). If I randomly assign chunks, there's a decent chance two chunks land in the same rack. If the rack loses power, then two chunks disappear, some objects are at risk of being lost.
When computing which servers own a PG, the algorithm ensures chunks land in different failure domains. With k=4, m=2, I need 6 failure domains minimum...but that means I actually need 6 racks if I want to survive any rack failure.
For my small setup, I only have 3 racks, so I settled on host-level failure domains (each server is its own domain). Not as robust as rack-level, but it means a single server dying never takes out more than one chunk.
The formula I learned: You need at least k + m failure domains for erasure coding, plus ideally some extra headroom for maintenance and recovery.
The Write Path
With placement figured out, I next needed to actually write data durably. The question becomes: When do I tell the client that a write succeeded?
Attempt 1: Ack after the primary receives data
This is fast, but dangreous. If the primary dies before forwarding chunks, then what happens? Your data is gone, but the client thinks it succeeds.
Attempt 2: Ack after all 6 servers confirm
Safe, but slow. One slow server and every write waits.
What I settled on was to ack after k+1 (5) servers confirm, but require all 6 for a clean status.
This gives me good speed (one slow server doesn't block) while maintaining safety (5 confirmations means even if 2 fail, 3 have the data). The PG stays in a degraded state until all 6 confirm, triggering background repair.
But there's a subtle problem here that I didn't figure out until much later...
Partial Writes Are Evil
Imagine that a write is in progress: The primary has encoded the data and is sending chunks to all 6 servers. Chunks 0, 1, and 4 write successfully. Then the primary crashes.
What's the state of the world? Three servers have the new data. Three have the old data. And there's the nightmare: You can't reconstruct either version.
With k=4, I need any 4 chunks to be consistent. But I have 3 new and 3 old, neither is big enough. The data is corrupt not just lost.
This is what's known as the "RAID hole", and it's a fundamental problem with erasure coding. Replication doesn't have this issue, but with EC, chunks are interdependent.
The solution to this problem is well known though, and this isn't my first rodeo, but you basically have two options... journaling or write-ahead logging:
- Before writing chunks, log your intent somewhere durable
- Write all chunks
- Mark the write complete
- If you crash mid-write, replay the journal/WAL: Either complete the write or roll it back
I integrated this with my Raft consensus layer such that the intent to write gets committed to the Raft log before any chunks are written. If the primary fails, the new leader can see the in-progress write and either complete or abort it.
A State Machine for Survival
A placement group isn't just "working" or "broken." It moves through states that describe its health and what's happening to it:
- Creating: PG is being set up, not ready for data
- Peering: Servers are talking, figuring out who has what
- Active: PG can serve reads and writes
- Clean: Active AND all 6 chunks are where they should be
- Degraded: Active but missing some chunks (1 or 2 servers down)
- Recovering: Actively rebuilding missing chunks
- Incomplete: Not enough chunks survive to serve data (cross your fingers and toes you never hit this state)
The lifecycle essentially is, Creating -> Peering -> Active+Clean -> (server dies) Active+Degraded -> (recovery starts) Active+Recovering -> (recovery completes) Active+Clean.
Degraded doesn't mean broken. A PG with 4 or 5 chunks can still serve reads and writes. It's at higher risk, another failure could push it to incomplete, but it's functional.
Monitoring these states becomes my primary operational tool. A healthy cluster is one where most PGs are Active+Clean, with occasional brief trips through Degraded during maintenance or failure.
Rebuilding After Failure
All of a sudden, server 3 dies. Several placement groups had their chunk N stored there. Now what?
Well, first you need to detect that server 3 died. Heartbeats can help here of course. If it misses a bunch of them in a row, then you're presumed dead. But recovery doesn't start right away, because...
Wait and see...
Maybe the server just rebooted. Maybe there was a network blip. If I immediately start copying terabytes of data, and the server comes back 30 seconds later, I just wasted a ton of bandwidth and I/O.
When you look around the industry, you'll see different wait times, but most tend to be in the 10-15 minutes range before declaring a failure permanent and starting a recovery operation. During this window, placement groups are degraded, but functional.
So your server has been dead for 12 minutes, and you decide that's enough. Now, you pick a replacement. The missing chunks will need to be placed on a new server. The algorithm considers: Failure domain (can't pick a server in the same domain as existing chunks), available capacity (pick servers with more free space), and current load (don't overload servers already doing lots of recovery).
Next, we reconstruct the missing chunk. This is where erasure coding's tradeoffs show up. To reconstruct one missing chunk, I need to read k chunks (in my case, 4). That's 4x bandwidth amplification; rebuilding 1GiB of missing data requires transferring 4 GiB across the network.
For comparison, replication just copies 1 GiB to a new location. But replication also stored 3x the data to begin with, so it had 3 GiB sitting around.
Once the reconstructed chunk is written to the replacement server and confirmed, the placement group transitions from recovering back to clean. Life goes on.
Serving Data While Limping
A client wants to read an object. Chunk 2 is on a dead server. What do we do? We have some options:
- Option 1: Fail the read. This is safe, but terrible for users trying to access that data
- Option 2: Reconstruct on the fly. Read chunks 0, 1, 3, 4 (or any 4 available), decode to get chunk 2's contents, return the data.
Option 2 is the no brainer here for me, but it comes with some caveats:
- Latency increases. Instead of reading 1 chunk from 1 server, you're reading 4 chunks from 4 servers, waiting for all of them, then doing CPU-intensive decoding
- Bandwidth aplifies. Transferring 4x the data for every read
- Failure risk compounds. If another server is just slow (not dead), you end up waiting on the slowest of 4 servers
A trick we can use is to speculatively read from all 6 servers, take the first 4 responses, cancel the rest. If all servers are healthy, you get the fastest 4. If one is dead, you barely notice. If two are dead, you still succeed (but barely).
Degraded performance is acceptable for rare situations. If your cluster is frequently degraded, you have a bigger problem you should investigate...like not enough servers or too many failures.
Bandwidth Amplification (Just to be clear)
Talking about it before, I just want to lay out the maths to avoid repeating myself.
With k=4, m=2, every chunk I need to rebuild will require reading 4 chunks. If a server with 1TiB of data dies, I need to read 4 TiB across the network to rebuild its lost chunks.
In Facebook's F4 paper, they note at about 200 TiB per day of recovery traffic; around 10-20% of their total network bandwidth, just for repair operations. Compare with replication, and you'd only need to copy 1 TiB of data.
So erasure coding saves storage at the cost of recovery bandwidth. For most use cases, this is a good tradeoff; storage is a constant cost, recovery is (hopefully) rare. But it's worth understanding this if your set of circumstances is different for some very real reason.
Some alternatives to erasure coding do exist which are more efficient. While I was working at Microsoft, I learned about Local Reconstruction Codes that add extra "local" parity to reduce reconstruction bandwidth usage. For my project, the added complexity wasn't worth it...k=4, m=2 is plenty efficient for my scale.
What Breaks at 3 AM?
After running Ceph for a while, which Basalt is very similar to, I've learned some hard lessons.
Scrub your data
Disks aren't always benevolent... sometimes it's best to think of them as malevolent... They will sometimes return bad data without errors. Cosmic rays can flip bits, latent sectors are a thing, etc. The only way to know your data is intact is to periodically read it all and verify the checksums still match. This is what we call "Scrubbing" which is a process to read every chunk and verify their integrity.
Without scrubbing, eventually you'll have multiple errors in your data, maybe a lot of them. Still small when compared to the whole dataset, but might not progress equally across the whole cluster. If you never do anything like this, eventually, you'll be screwed.
Test your recovery path
It's not enough to just have recovery code available to run, you need to regularly test it. This isn't a new technique, but I have my own "chaos" type system where I'll just shut down a node for 20 minutes, and bring it back up, to verify the system recovers correctly.
Monitor Placement Group states religiously
The number of degraded PGs is your health indicator for your cluster. Zero is good. One or two during maintenance is fine. Ten is concerning, and so on.
I have alerts that fire if 1% of placement groups are degraded for more than an hour. That's usually enough time for transient issues to resolve, but short enough to catch real problems.
Capacity planning matters more than you think
When a server dies, recovery needs somewhere to put the rebuilt chunks. If all your servers are 95% full, there's nowhere to put the data. Recovery stalls, placement groups stay degraded, and with one more failure, you might be in deep trouble.
I usually aim to keep 20% of my available capacity free, and I keep extra cold spare disks available if my used space approaches that limit.
Is It Worth It?
"Just use S3" would have been easier, probably cheaper, counting my time, and definitely lower risk.
But I learn more when building than I do in just using someone else's code. I now understand what REALLY happens when I put some object into storage; I know most of the ways, if not all the ways, it can go wrong, and what that can look like. I can debug latency issues, because I have physical access to all the gear I need, and knowledge of the code paths, so I can make informed decisions about consistency vs availability because I've implemented both.
If you're building a storage system because you need a storage system... use an existing one. If you're building it because you want to understand distributed systems at a deep level... there's no better teacher than writing the code yourself.
The glass isn't easy to eat, but the view from the other side of the abyss is worth it.