How modern databases keep their cool
So I've been thinking a lot about database internals the past few weeks. For instance, how do these things actually work under the hood? I mean, we're all writing code that hammers these poor databases with concurrent requests, and somehow they don't explode into a million conflicting pieces. Pretty wild, eh? Let's figure out why.
What problem are we solving?
Okay, so think about this: You've got a database with users hitting it from all directions. Some are reading data, others are writing data--often to the same records! Without some kind of smart concurrency system, we'd have complete chaos...eventually.
Traditionally, we'd probably just throw locks at the problem. But locks are... well, they're not great for hot code; if someone is reading a record and someone else wants to write to it, the writer has to wait. If someone's writing and others want to read...EVERYONE WAITS!
What if we could just... not do that?
What if readers never had to wait for writers, and writers never blocked readers? Sounds great, eh?
So there are a few ways to solve this, but today let's focus on Multiversion concurrency control (MVCC).
MVCC allows us to basically say: "Hey, why don't we just keep multiple versions of each piece of data around? Then readers can see the version that was valid when they started reading, and writers can create new versions without messing with anyone's view of the world!"
Let's think about MVCC first
Instead of having one authoritative value for each key in our database, we essentially have a timeline of values. Each time a transaction changes a value, it doesn't overwrite the old one--it adds a new version with a timestamp.
When you start a transaction, you get a "read timestamp" that essentially defines which slice of reality you can see. You can only see versions of data that were committed up to your timestamp. It's like each transaction gets its own time machine, viewing the database at a specific moment in time.
If you've never thought about storing data like this, it's probably a little mind-bending, but it's kind of neat.
Let's build it!
I don't know about you, but I learn best by doing. So let's do.
This won't win any awards, or be particularly useful beyond demonstration, but that's ok; we'll build an MVCC test in Go, and do it bit by bit.
First, we need some basic types to represent our versioned data:
type Timestamp int64
type VersionedValue struct {
Value string
Timestamp Timestamp
}
type KeyHistory struct {
Versions []VersionedValue
}
So each value has a timestamp attached to it (it can be anything so long as it's monotonically increasing), and each key has a history of different versions. So far, not complicated.
Our actual store needs to manage this versioned data, plus keep track of the current time:
type Store struct {
mu sync.RWMutex
data map[string]*KeyHistory
clock Timestamp
}
And now we need some notion of a transaction, of course, which are going to track their own read timestamp and buffer their writes until they're ready to commit:
type Transaction struct {
store *Store
readTS Timestamp
writeBuffer map[string]string
committed bool
}
Alright, now if a transaction wants to read a key, it should see the most recent version of that key that was committed before the transaction started.
func (s *Store) Read(key string, ts Timestamp) (string, bool) {
s.mu.RLock()
defer s.mu.RUnlock()
history, exists := s.data[key]
if !exists {
return "", false
}
// Walk backwards through versions to find the most recent one
// visible to this transaction
for i := len(history.Versions) - 1; i >= 0; i-- {
candidate := history.Versions[i]
if candidate.Timestamp <= ts {
return candidate.Value, true
}
}
return "", false
}
This illustrates the core of MVCC. We scan backward through the versions until we find the one that was created at or before our read timestamp. That's the version we can see. It starts from the end and goes backwards because we don't want the oldest version every time, we want the newest version we can see.
When a transaction wants to read, it first checks its own write buffer (in case it has modified this key itself) and then falls back to the store:
func (tx *Transaction) Read(key string) (string, bool) {
if val, ok := tx.writeBuffer[key]; ok {
return val, true
}
return tx.store.Read(key, tx.readTS)
}
Writing is super easy during the transaction--we just put the value into our local buffer:
func (tx *Transaction) Write(key, value string) {
tx.writeBuffer[key] = value
}
But the commit process? That's a little more interesting. We need to check for conflicts and then actually apply our changes if everything looks good:
func (tx *Transaction) Commit() error {
if tx.committed {
return fmt.Errorf("transaction already committed")
}
tx.store.mu.Lock()
defer tx.store.mu.Unlock()
// Conflict detection
for key := range tx.writeBuffer {
history, exists := tx.store.data[key]
if !exists {
continue
}
latest := history.Versions[len(history.Versions)-1]
if latest.Timestamp > tx.readTS {
return fmt.Errorf("write-write conflict on key '%s'", key)
}
}
commitTS := tx.store.nextTimestamp()
for key, val := range tx.writeBuffer {
history, exists := tx.store.data[key]
if !exists {
history = &KeyHistory{}
tx.store.data[key] = history
}
history.Versions = append(history.Versions, VersionedValue{
Value: val,
Timestamp: commitTS,
})
}
tx.committed = true
return nil
}
Wait, what's happening in that conflict detection section?
Basically, we're saying: "If any key I'm trying to write has been modified by another transaction after I started (i.e., its latest version has a timestamp greater than my read timestamp), then I need to abort because I might be working with outdated information.
This is actually implementing a specific isolation level called snapshot isolation. It prevents most conflicts but allows some weird anomalies that stricter isolation levels would prevent. But that's a whole other deep dive!
After checking for conflicts, we assign a commit timestamp to our transaction and then create a new version for all the keys we modified, all with the same commit timestamp. This keeps our transaction's changes atomic--they all appear at the same logical time.
Seeing it in action
Let's write a quick main()
function so we can test this out:
func main() {
store := NewStore()
tx1 := store.Begin()
tx1.Write("foo", "bar")
err := tx1.Commit()
if err != nil {
fmt.Println("tx commit error:", err)
} else {
fmt.Println("tx1 committed")
}
tx2 := store.Begin()
val, ok := tx2.Read("foo")
if ok {
fmt.Println("tx2 reads foo: %s\n", val)
} else {
fmt.Println("tx2 reads foo: not found")
}
// An exercise for you: Write some code that triggers
// a conflict, to see it in action.
}
When we run this, it behaves just like you'd expect a transaction database to behave. Each transaction sees a consistent snapshot of the database as of when it started. Transactions that would create conflicts get rejected. It's pretty cool to see this working in ~100 lines of Go code.
There's more (that's missing)...
Now, I have to admit that this implementation is super basic compared to what real databases do. Here are some things I skipped:
- Garbage Collection: We're just keeping all versions forever, which would obviously explode in a real system. PostgreSQL has its VACUUM process, and other DBs have similar mechanisms to clean up old versions no transactions can see anymore
- Efficient storage: Storing complete copies of values for each version is wasteful. Real systems are much smarter about how they store versions.
- Range queries: I only implemented point lookups (get a single key). Rang queries (give me all keys between X and Y) are way more complex with versioned data.
- Indexes: Adding indexes to an MVCC system is another whole challenge that I ignored.
- Serializable isolation: My implementation only gives snapshot isolation, but sometimes you need stricter guarantees.
How Real Databases™ do it
I think it's fascinating how different databases implement MVCC in their own ways:
- PostgreSQL keeps multiple physical rows in its tables for different versions of the same logical row. When you
UPDATE
a row, it doesn't actually update anything--it inserts a new version and marks the old one as expired. - MySQL InnoDB uses an "undo log" approach. The actual table stores the latest version, but old versions can be reconstructed from the undo log entries when needed.
- CockroachDB being a distributed database, has to solve all kinds of extra problems like clock skew between nodes. They use a hybrid logical clock that combines wall-clock time with logical counters.
Final Thoughts
I've got to say, implementing even this toy version of MVCC made the concept a bit more approachable to write this post. There's something about actually coding up an idea that makes all the abstract concepts suddenly become concrete.
MVCC is one of those elegant ideas that sounds almost too good to be true. "Just keep multiple versions of everything, and all your concurrency problems go away!" Of course, the devil is in the details, and real implementations have to solve all sorts of thorny problems that I glossed over.
But the core of the idea is powerful--by maintaining a history of values rather than a single current value, we can give each transaction its own consistent view of the database without excessive locking. That's pretty darn cool.
If you want to play with this code yourself, it's all in this repo. I also took a stab at serializable isolation, and if you're interested, it's in this branch.
Let me know what you think!