STAR

TLA+ – The Die Hard Problem


 

Abstract

The “Die Hard” problem epitomizes the core dilemma of distributed computing: do you factor the elements of the system from a set of highly visible and intentional actions, supported by a complex “behavior” (a.k.a stateful), or do you let elements of the system CRUD each other? This post attempts to give an answer to that question.

___________

A couple of years ago, Dr. Lamport developed  an example based on the “Die Hard” problem:

In the movie Die Hard 3, the heroes must solve the problem of obtaining exactly 4 gallons of water using a 5 gallon jug, a 3 gallon jug, and a water faucet. We now apply TLA+ and the TLC model checker to solve this problem.

Here is how TLA+ creates a specification of the solution:

The initial predicate enumerates the start states (both jugs are empty)

Init =∆ ∧ big = 0 
∧ small = 0

 

The coupling here is obvious, as the “state” of the system is defined as type property values.

Then the next-state action predicates enumerates the possible actions on the system:

Next =∆
∨ FillSmall
∨ FillBig
∨ EmptySmall
∨ EmptyBig
∨ Small2Big
∨ Big2Small

 

Then each action is specified in terms the “end state” reached once the action is applied to the system. For instance:

FillSmall =∆ ∧ small′ = 3
∧ big′ = big

FillBig =∆ ∧ big′ = 5
∧ small′ = small

EmptyBig =∆ ∧ big′ = 0
∧ small′ = small

… and so on

 

The solution is then found ad absurdum by expressing that the system should not be able to reach a “state” of 4 gallon in the big jug. The tool detects that’s a possible state (as an error condition) and the error trace provides the solution. 

IMHO, TLA+ is creating an unnecessary coupling between Action, States and Types by stating that: “a state is an assignment of values to a variable“. This is congruent to Marius Eriksen’s position which sees “state as a formula” (If you have not seen his presentation on Functional Programming @Twitter, I highly recommend it). 

The way I would approach solving the problem would be by creating the following state machine:

TLA+ - The Die Hard Problem

Fig. 1 – The Die Hard Problem State Machine

The state labels are encoded as follows: [jug_size][full,empty,partial][jug_size][full,empty,partial], for instance one of the possible initial states is BigEmptySmallEmpty (so are BigFullSmallEmpty and BigEmptySmallFull).

The key advantage of that approach is that the actions, the states and the types (jug) are as decoupled as they can possibly be: the same state machine would be common to jugs of any size. What is important here, as we create the solution specification, is that every action can be associated to one ore more transitions which connects two states. These actions express an intent to change the state of the system, however, it is ultimately the system and not the action which decides the outcome, i.e. the resulting state. It is the type which assign specific values to variables and maintain the mapping between states and assignments. In other words, IMHO, the relationship between State and Assignment is not an IS-A relationship but an association.

Fig. 2 – Proposed Metamodel

That is a subtle, but very important disctinction, especially in the context of distributed systems. One should never expect that “a computer, which you didn’t even know existed, would be in the position to set the state of your system”. That is the essence of distributed systems: elements of a system request an action to be performed on another element, they are never in the position to assign property values.

To illustrate that point, let’s look at a very interesting action: Big2Small (empty the big jug into the small one).  When this action is invoked (while the system is in the state BigPartialSmallEmpty), there are four possible outcomes:

  • BigEmptySmallFull
  • BigPartialSmallFull
  • BigPartialSmallPartial (not represented on the diagram)
  • BigEmptySmallPartial

The action Big2Small should not decide on the outcome (how could it? it would mean that an external actor would have absolute knowledge of the system), it only expresses the intent. The state decides whether the intent is receivable, the type realizes the intent.  

Furthermore, there is no need to create individual “states” that would reflect the different values the big jug could have when it is partially filled, i.e. [1..4] because they are strictly equivalent from an action/transition perspective. In other words, I argue that a state is defined by its action (associated to the incoming and outgoing transitions), not the type property values.  In the present case, the significant “state” of the system here is “partial”.

How does the state machine “computes” the values of each jug? 

TLA+ - The Die Hard Problem

Fig. 3 – Computing the Solution

Given an initial state you travel a path as a series of (action,state), the values of the type are computed along the way and represented as a tuple (small_jug_value,big_jug_value). The path highlighted in green describes the sequence that leads to the solution (Big jug contains 4 gallons).

To prove my point (unfortunately, not as in a “mathematical proof”) let’s ask the same question with the following problem: given two jugs of 11 and 4 gallons, produce 1 gallon of water. Not surprisingly, the same state machine applies and the solutions comes just as easily, albeit with a few more iterations:

TLA+ - The Die Hard Problem

Fig. 4 – The state machine (states and actions) is an invariant for the the two jug class of systems where Vjug1 < Vjug2 and these numbers are “relatively prime” numbers

What’s very interesting about this problem is that it epitomizes the core dilemma of distributed computing: do you factor the elements of the system from a set of highly visible and intentional actions, supported by a complex “behavior” (a.k.a stateful), or do you let elements of the system CRUD each other? (i.e. allow the elements of the system to expose operations such as “PUT the Big Jug’s content to 4 gallon” and you respond with success or an error whether it is possible or not). IMHO, that is why formalisms like TLA+ or Petri Nets are so important because the formalisms software engineer use in their day job have no semantic to represent states (and a fortiori, their relationship to actions). From here, a number of interesting questions emerge then:

  • how much do you need to share them (beyond the list of possible actions) with the other systems? 
  • how do the “states” relate to these actions? and hence what is the most effective definition of a state?
  • how does the public interface relate to the inner implementation of the system?

Dr. Lamport provides a critique of the state machine approach here. However, my word of caution is, and has always been, beware of simplistic reifications… when answering these questions. The proposed metamodel (above) allows for a simple replacement of the type without changing the actions or the states. Creating a metamodel where the state machine has the potential to be an invariant is a very desirable property in distributed systems because it allows you to reason very effectively on whether other elements of the systems could be impacted by a given change in one of the elements. Changes in properties are much easier to deal with (and much less likely to impact other elements) when compared to the state machine. Changes in a distributed system can quickly become a runaway system, that’s why it is so critical to establish a foundation that manages change in the most effective way possible. 

Now, it could well be that both TLA+ semantics and the ones I suggest here are strictly equivalent, but it does not appear to me, at first glance. My goal is to create the TLA+ metamodel to clarify whether there is indeed an equivalence or not.

I wrote a follow up to this post (State Machines and Computation) that illustrates the proposed metamodel and avoids the reification of (control) state into assignments of variables. 

Leave a Reply

Your email address will not be published. Required fields are marked *

39 − = 34