Deep dive: TrueTime
Out of the gate, what is TrueTime?
It's an algorithm used in Google Spanner, giving it the ability to do something no distributed database had done before...globally consistent transactions without a centralized timestamp authority.
Not only is it a breakthrough algorithm for the time, it's almost offensively elegant. If you can't know the exact time, just be honest about how wrong you might be, and then wait it out.
Why is time such a mess?
Here's a fundamental problem with time:
You have two machines. Machine A writes some data. Machine B reads it a moment later. Did B see A's write? Well, that depends on what "a moment later" means exactly. Usually, that means machines A & B need to agree down to the millisecond, what time it is.
Often times, systems don't agree.
Every computer has a quartz oscillator that allows measuring time, but will drift... a few hundred microseconds per second on a good day. NTP sorta solves this problem by synchronizing to reference servers, but on a local network you're looking at ~10ms accuracy. Across the public internet... it's more like 100+ ms. The worst part about NTP, for the purposes of distributed databases and other distributed systems, is that there's no formal uncertainty bound with it. Whatever time NTP tells you it is, it's best to think that it's "around that time." Even on sync your time could go forward or backwards, that can have real consequences.
The road to Spanner
Spanner didn't materialize out of thin air. Google was running face-first into the limitations of its own systems for about a decade, from stories I've heard and read about.
Bigtable gave Google a scalable distributed key-value store, but it had no cross-row transactions and only eventually consistent replication. Applications that needed strong consistency, had to work around these problems themselves.
The Spanner paper notes that this was a source of "frequent complaints", which, if you've worked at a company where engineers have to hand-roll consistency guarantees on top of a store that doesn't provide it, is perhaps the most obvious statement of the decade.
Megastore came next, adding semi-relational data modeling and synchronous replication on top of Bigtable. Over 300 Google applications used it, including Gmail and the Android market. But its write throughput wasn't great...we're talking single digit writes per second per Paxos group. It forced all writes through conflict resolution with no long-lived leaders. Read performance was fine. Write performance was a pain.
Percolator added cross-shard ACID transactions on top of Bigtable using two-phase commit, but it relied on a centralized timestamp oracle that had to be contacted twice per transaction. A single point of coordination for every write, globally...can you see where this is going?
So around 2007, Google started building what would become Spanner. It began as a Bigtable-like versioned key-value store and gradually evolved into a temporal multi-versioned database. The paper describes a "slow realization" that they needed to stop treating databases features as afterthroughts. The first real production customer was F1, a rewrite of Google's advertising backend that had been running on manually sharded MySQL. The last resharding of that MySQL cluster had consumed over two years of engineering time. Two years, just to reshard.
F1 went live on Spanner in early 2011 with five replicas across the US, and by early 2012 the entire AdWords database was on it. At the time, that was Google's money machine running on this database.
TrueTime: be honest about uncertainty
Ok, so here's where it gets clever... TrueTime's API has 3 methods:
- TT.now(): Returns an interval
[earliest, latest] - TT.after(t): Returns true if
thas definitely passed - TT.before(t): Returns true if
thas definitely not arrived yet
That's it, the whole API. But notice what TT.now() does differently from every other clock API you've ever used? It doesn't give you a single timestamp, but a range. It says "I don't know exactly what time it is, but I guarantee the real time is somewhere in this window."
The width of that window, called epsilon, is typically between 1 and 7 milliseconds, averaging about 4ms. That's not a guess, that's bounded by physics.
GPS receivers and atomic clocks
Each Google data center has a fleet of time master machines, and at the time, the majority were likely GPS masters with GPS receivers and dedicated antennas, which advertise uncertainty near zero (GPS is accurate to nanoseconds). The minority are what Google calls Armageddon masters: Machines equipped with atomic clocks (this may be more common today). But either way, they're your last resort if GPS is down...which has happened a couple of times.
The daemon and Marzullo's algorithm
Every machine in a Spanner deployment runs a timeslave daemon that polls a varity of time masters every 30 seconds (from nearby and distant datacenters). The daemon uses a variant of Marzullo's algorithm to detect liars: It collects confidence intervals from multiple sources, computes their intersection , and throws out any source whose interval doesn't overlap with the majority.
Between synchronization events, local clock drift creates a sawtooth pattern in epsilon. At 200 microseconds drift over a 30-second poll interval, you accumulate up to 6ms of drift, plus about 1ms for communication delay. So epsilon swings from roughly 1ms (right after sync) to about 7ms (right before the next sync). By 2023, Google reported TrueTime provides less than 1 millisecond uncertainty at the 99th percentile. The hardware and the ops around it have gotten that much better over the years.
Another neat thing from the paper, it notes that machines that show clock drift beyond worse-case bounds get evicted from the cluster entirely. And they note that bad CPUs are 6 times more likely than bad clocks. The clocks were somewhat more reliable.
Commit-wait: How it integrates
Here's the elegant part...Spanner uses TrueTime to guarantee external consistency...meaning if transaction T1 finishes before T2 starts in real time, then T1's timestamp is less than T2's timestamp. Every observer, everywhere, sees the same order, and that order matches wall-clock reality.
There are two rules:
Start rule: When a transaction is ready to commit, the coordinator assigns a commit timestamp s that is at least TT.now().latest. Since the real time is guaranteed to be less than or equal to latest, this means s is at or above the actual current time.
Commit-wait rule: The coordinator doesn't make the committed data visible until TT.after(s) returns true...meaning TT.now().earliest > s. At that point, the commit timestamp is definitely in the past. No machine anywhere in the system could possibly see a now that's before s.
The proof of external consistency falls out almost trivially:
- Commit-wait guarantees
T1's timestamp is in the past whenT1finishes - By assumption,
T1finishes beforeT2starts (in real time) - The start rule guarantees
T2's timestamp is at or above current time whenT2starts - Therefore,
S1 < S2. Always. Globally.
The expected wait is the product of 2 * epsilon, so maybe around 10ms when the paper was written, but probably around 2ms today. But the practical aspect of this algorithm is that this wait overlaps with Paxos replication latency. You're already waiting for a majority of replicas to acknowledge the write. Google's own blog notes that "the vast majority of the time, Spanner's commit-wait step involves no waiting" because the Paxos round takes longer than the uncertainty interval.
Said another way, you pay for the strongest consistency guarantee in distributed systems, and the cost is usually zero additional latency.
It's not all rainbows and sunshine...
You can't really build TrueTime to perform as well as it does inside Google's datacenter. You most certainly don't have their private network, the Marzullo-based outlier detection, etc.
The commit-wait latency is real, but modest, and overlaps with Paxos latency most of the time. For write-heavy workloads it matters, though. For workloads like Google's analytics business where 99.9% of transactions are reads, it basically doesn't matter.
In closing...
TrueTime's innovation was recognizing that clock uncertainty, when explicitly bounded, becomes a tool instead of a problem. Every other system tries to minimize uncertainty and hope for the best. TrueTime quantifies it, exposes it as a first-class API primitive, and then uses it as the foundation for the strongest consistency guarantee in distributed systems.
The approach works because the physics constrains uncertainty to single-digit milliseconds, small enough that the commit-wait cost is tolerable, especially when overlapped with consensus latency. All this time has made it better too, with reports saying it's sub-millisecond now at the 99th percentile as of 2023, and serving 6 billion queries per second across 17+ exabytes of data, today.
That's pretty darn cool.