Dan Donahue bio photo

Dan Donahue

Musician. Traveler. Programmer.

LinkedIn Github Stackoverflow

There's been a funny shift since since the invent of the computer. With the increased processing speed, us developers have convinced business folks that they can have system consistency at the millisecond level. It took a while but we convinced them that they could expect this. The problem is that in doing so, we've actually made our jobs harder and our code more complex as we've struggled to provide the guarantee we've made. We've come full circle now and business experts are thinking in terms of absolute up-to-the-millisecond consistency and we have to convince them to loosen that constraint now. Good work developers!

The irony in all this is that industries had figured out how to do their business without 100% consistency long before the computer. In fact, it was essential.

Consider making a withdrawl at your bank. You go to the teller window (or an ATM, but I want to stress that this was solved before computers) and request a withdrawl. When you request money from of your account, you expect that money in your hand before you leave the bank. Banks know this. They cannot wait to see that any and all outstanding transactions you've made have come in before they accept or deny your request. Imagine you mailed your rent check three days ago and that would put you in the red. The bank teller won't tell you "ok, we'll give you your money... in a week after we're sure that any outstanding transactions have processed. And by the way, we're going to lock your account until then to ensure that it's in a completely up-to-date state before we give you the money you asked for." That wouldn't make you happy and it wouldn't be good business for the bank.

Instead, what banks have done is understood that your balance can never be trusted as fully up-to-date, or consistent, and have built up a policy around the possibility that you may overdraw your account. In fact, they've turned it into a money-making opportunity for themselves. They'll let you overdraw your account and they'll charge you a fee for doing so. Everyone wins. You get your money as you expect AND it's good for the bank when you overdraw.

The point here is that time can't be trusted. Just because you paid a bill yesterday does not mean that transaction made it into your account before the withdrawl you're trying to take right now. That's an important concept that you need to remember when working in a distributed system.

The usual response to that is "so why build a distributed system? Build a monolithic system and keep everything consistent." But relaxing the "always consistent" state of your system can actually make your code much more simple. It places more emphasis on the business rules. As usual, the technology is easy. Understanding the ins and outs of the domain is hard. Developers, for reasons I can't rationalize, have flipped this on its head and are now reaping what they've sown.

This is also a more realistic picture of the domain. Remember - THIS IS HOW THE DOMAIN WORKED BEFORE THE COMPUTER. Trying to bend the rules of the domain to take "advantage" (and I put that in quotes very intentionally) of absolute consistency changes both the domain rules AND makes the code more complex. Not good.

Taking an example from the computer age, when you're looking at a screen of data, it's already stale. Literally a millisecond after you started reading this blog post, I could've went in and fixed a typo and what you're looking at is no longer the absolutely latest and most correct data. That's just reality. That only gets more important when its data you have to make decisions on, like whether or not to buy or sell a stock or approve or disapprove a transaction. The point is to embrace this staleness, not fight it. And by the way, this is the case whether you have a huge "always consistent" system or a distributed system.

The point here is that race conditions don't exist in business. Something should not pass or fail because events occured out of expected order or because they happened a millisecond sooner or later than expected. If we are planning to pay a bill with funds from a check we deposited and that check hasn't cleared yet, or doesn't clear at all, don't we still owe the first counterparty money for the bill? What do we do in this case? I doubt the counterparty will be happy if we just tell them "error - invalid state". You have to delve in and understand the domain policy around this situation. Banking has been around long enough that there are probably well established rules for this. Consider that bills were being paid and deposits being made long before we had to worry about the order of instructions being processed by CPU. A horse was probably taking cash from one institution to another. There's a lot of inconsistent state and lag there. I'm sure they figured out how to proceed.

In any case, if you want to hear this from someone more intelligent than me, check out this blog post by Udi Dahan: Race Conditions Don't Exist