A Blog about Business, Enterprise Architecture and Software Design by the Founders of ZenModeler

Everything Fails All The Time : Cure Or Prevent Errors In Your Design?

In computing « everything fails all the time », like Werner Vogels, Amazon’s technical director, used to say – or, to put it more bluntly: « shit happens ». Instead of preventing at any cost it can sometimes be more worthwhile to cure, that is to know what to do in case of error. And not only from a technical point of view but first and foremost from a business point of view.

This will allow me to point out these « errors » and how instead of avoiding them it is sometimes better to think on a global scale about the role these errors play in the system.

In order to illustrate this view I’d like to describe the inner workings of the Starbucks Coffee Shops who have a reputation for not the quality of their coffee but for the sheer number of orders they are able to process every hour (otherwise known as the throughput). This is the opportunity to illustrate an important number of principles linked to distributed computing with an example from every-day life.

How does an order at Starbucks get processed?

Here is a user story and a typical scenario:

  • User Story:
    • As a client
    • I want to place an order
    • So as to enjoy my coffee as quickly as possible* Scenario:

The asynchronism introduced when the cashier transmits the order to the barrista even though the financial transaction isn’t over is what I find most interesting. But in this case, what happens if an error occurs during the process?

  • In the case of non-payment (not enough cash, credit card blocked, etc.): the coffee is simply thrown away or given to another client who has luckily made the same order.
  • In the case of a messed-up order (the cashier misunderstood my order, the barrista misunderstood, etc.) : the order is simply redone.
  • Etc.

The possible mechanisms in order to process an error

  • « Keep moving on »: when the consequence of the error is negligible (like a coffee for example)
  • « Try again » : useful especially when one of the links in the chain is momentarily indisposed. Example from every-day life: the cashier repeats the name of the order several times until he receives an answer from the barrista.
  • « Compensate »: what does my bank do when it has wrongfully charged me for a service by debit? It credits my account with the corresponding amount, so as to compensate the mistake.

Benefits

The asynchronism introduced after the order allows the increase of occupied time of each participant. The barrista does not need to wait for the end of the payment to start making the coffee. I’ve even sometimes noticed during rush hour at Starbucks how an intermediary actor will take the order in advance, playing the role of dispatcher between the two lines of treatment: payment and creation.

The consequence of an error in one or the other of the treatment chains is negligible in regards to the huge gains obtained in the debit (throughput) of the system. The important point to highlight is that instead of focusing on avoiding at all costs an « error » in the process, the errors are considered as part of the system with consequential actions which are provided for in case they arise.

The functioning of Starbucks is a good way to illustrate how instead of putting into place mechanics which avoid any errors by the introduction of a centralizing element (no coffee gets done if the payment isn’t processed first), it is sometimes better to accept errors in order to increase the occupational rate of each actor in a system. And so every potential error is regarded through its consequence.

This mode of functioning also puts into gear the principal of correlation identifier: that is how I correlate two treatments which are executed independently in order to finalize processing in its entirety – like how the client receives his coffee after paying for it in our example. It is then necessary to correlate the client, the payment and his order. This is done in several different ways depending on the Starbucks establishment and the country, the correlation identifier can thus be:

  • The name of the ordered coffee: especially in Europe
  • The client’s name: I have experienced this mostly in the United-States, it should be noted that this reenforces the homely aspect of Starbucks because I much prefer to hear my first name called out instead of an impersonal “who ordered the double expresso Machiatto?”

Conclusion

I like this example from Starbucks because the transposition into the real world of a mechanism and the intuition that goes with it are an excellent way to conceive or comprehend something which can otherwise remain obscure. This also allows us to illustrate the mechanisms used by distributed and asynchronous systems:

Ultimately it’s mostly the opportunity to demonstrate how a globally integrated “common” error management of the entire treatment chain permits an optimization which leads to considerable gains in debit in processing.

This article was inspired by the formerly published article by Gregor Hohpe « Starbucks does not use two phase commit ». I also recommend the following article by Gregor: « Let’s have a conversation » in which he compares the exchanges between two systems to a conversation between two people.