Skip to the content.

Notes on Concurrency and Distributed Systems

A network of computers is no different than a network of humans with respect to correctness, atomicity, consistency, availability, and partition tolerance. For computers, this is true even within a single machine between cores, L* caches, RAM, and disk.

To reason about whether some approach is correct/atomic/consistent/…, I’ve found it can be helpful to translate the problem to the domain of people before the invention of the telegraph. The content and format of data is mostly irrelevant, except for information designed to deal with being distributed (vector clocks, e.g.). Clocks are not synchronized, not even within some tolerance; all times are local only. All actors are geographically remote from each other, and the only way to communicate is by physical letters, sent by pony express. Some of the things that can happen:

Protocols like TCP and devices like ECC RAM are designed to handle some of these issues, so we don’t normally worry about getting a mangled message. Some messaging systems may deliver duplicate messages.

Let’s say you send a command to Alice to “Do X and return an ack/nack”. If you don’t get a reply, you will have no idea whether or not X has actually been done. After Alice sends a reply, she may have no idea whether or not you received it. Even if you get an ack, X might have been undone by someone else after Alice sent the ack but before you received it.

Because of the long delivery time, it’s much more obvious that any information can be outdated by the time it is received. That’s also true of memory and database accesses in a computer, we just don’t think about it most of the time. Protocols like Paxos still work in this human problem domain, because they were designed with exactly these kinds of limitations in mind.