I don't think there is an industry that is more pompous than the Information industry. Yet, the hype went up a notch lately: "I" no longer stands for Information but for Intelligence. I can already see the who's who of Gartner's new shiny IT magic quadrant!

But who could be so stupid as to believe that Artificial Intelligence is around the corner?

You want some tangible proof? just look at Google, the company who mastered Information, now touts its "Intelligence" in just about every network packet.

What happened this week is beyond me, and a sad reminder that we, as humans, need to understand what we are so "intelligently" designing.

I have run my blog ebpml.org since 2001. I am proud to be one of the oldest blog talking about Service Oriented Architecture. I never engaged in SEO for myself or someone else. I am a scientist, I respect people's ideas and work, and provide references (i.e. links) religiously. This is not true for everyone, recently someone took one of my posts to build half of a conference's presentation without any acknowledgement. I have no respect for that kind of behavior whatsoever. His name is not even worth mentioning here. Strangely enough, Google's algorithms don't pick up on such a despicable behavior. They might even reward it...

I just received the following response from Google after they had sent me an email complaining about a suspicious link on my blog (the name of the page was pointing clearly at a hack, and it was not even there), so after checking that the link was not there I asked Google for a "reconsideration request":

 

 

Yes, that was not good enough, the fact that the link was not there, was not enough to return to a clean status: I was "violating" Google's rules and 15 years of work needed to be thrown into darkness "to preserve the quality of Google's Search Engine".

A few days earlier, I had also received a panicked email from a web site I reference for the simple reason that I use their content for free, per their licensing policy.

 

 

And this is what happened to Jordi Cabot who runs a popular and very useful blog on Model-Driven Engineering

 

What's next? self-driving cars texting drivers up the road to Get out of the way, or else?

So here is what I have to say -if there is anyone from Google who reads this post:

Dear Google,

Do you think that as software engineers we can believe more than a nanosecond that humans should have to take any action to facilitate the work of your algorithms?

Do you think, as humans, it is fair that you tell us that "something is wrong with our online presence and you'll censor it if we don't change it"? When you say "violated", are you referring to some kind of crime? (bonus question: are you some kind of sociopath?)

While we are at it, I also wanted to thank you for giving us a glimpse at the world to come, we'll no longer have to deal senseless "processes" or even cruel "dictators", we'll have to deal with your "intelligent" algorithms.

Congratulations, you've just earned the #1 spot in the "superficial intelligence" magic quadrant. Your depth of vision and market leadership is simply in a class of its own.

Sincerely,

Jean-Jacques Dubray

P.S.: I never used Bing before, but it is now my default search engine. Bing Maps are actually a log better than Google Maps.

 

 

Abstract

The goal of this post is to start developing the metamodel of computing, i.e. when you write an algorithm, or any piece of code, what the foundamental elements that you need to build and assemble to achieve the desired goal. The post itself was inspired by, and heavily relies on, Dr. Lamport's article on "Computation and State Machines".

The major issue discussed here is the concept of state. The general view of "state" in our industry is "an assignment of variables". "Control state" (the state that enables or disables actions) is thus reified as a variable assignment. That definition appears erroneous, and the reification unnecessary. We suggest that assignment of variables are just that, and inherently controlled by  a "Type", while (control) state can be more effectively associated to ranges of values of specific properties. From there, we derive then a simple metamodel for computing where State, Type, Actions are clearly articulated, without reification, and provide a DSL to illustrate it.

_________

I have always wondered about the relationship between Mathematics and Computing. This spectacular paper from Dr. Lamport: Computation and State Machines provides clear insights about that question. Anyone who writes software should read it. 

The core thesis of the paper, and the foundation of TLA+, is that an expression (in a programming language) can be viewed as a transition from one state to another. I am at odd with that thesis and its main corollary:

a state is an assignment of values to all possible variables.

It is difficult to disagree with such a statement, however, there is an example in the paper, the Factorial function example, that illustrates well the point I am trying to make.

Beyond the definition of state, which appears more as an axiom, Dr. Lamport uses that example to also ask several important questions:

we must decide what constitutes an [action]. How many [actions] are performed in executing a statement such as f = i ∗ f? Does a single [action] change the value of f to the product of the current values of i and f ? Are the evaluation of i ∗ f and the assignment to f two [actions]? Are there three actions? the reading of i, the reading of f, and the assignment to f ?

[Since] there is no standard way to describe a relation on the set of states, [should we] use a transition predicate, which is [defined as] a Boolean-valued formula N whose free variables are primed and/or unprimed state variables?

 
Fig. 1 - The Factiorial Algorithm (a- Program, b- Corresponding State Machine, c- TLA+ implementation)

 

Dr. Lamport uses labels (Fig. 1a) to align the computation of the algorithm with the states of the corresponding state machine (incidentally, I have a new appreciation for the infamous goto statement...). 

The state machine (Fig. 1b) is my representation of the algorithm's state machine. Nothing out of the ordinary: an initialize action bootstraps the process and the transitions operate until the test automatically transitions to done (i.e.  when i < 2). The multiply action performs the computation:

            f = i-- * f; 

The question that comes to mind is, assuming a state to be "an assignment of all possible variables": what state are you in at the end of that computation? When Dr. Lamport translates the factorial algorithm in a TLA+ definition (Fig. 1c), he is forced to introduce a new variable (pc) which possible values are exactly the set of states of the corresponding state machine (test, mult and done). In one of the simplest case possible, the definition of a state (as an assignment of values), fails to be correct, unless you introduce a variable which possible values are the states themselves. I believe that's a reification and that reification leads to a definition of state that is incorrect. If that definition is incorrect, the key implication is that an expression is not a transition. I would actually argue that an expression is an action.

In the TLA+ example, we can see that variables (i.e. types), states, actions and transitions are fully coupled. This is an unwanted and unnecessary coupling. It may well be mathematically correct (i.e. it does not violate the semantics of mathematics), but fails to represent the semantics of states, types, actions and transitions.

As I suggested in my previous post, the semantics of the system should include a "type" which is solely responsible for deciding which state the state machine is in once an action has been applied, and new values have been assigned to the variables.

Fig. 2 - Proposed Metamodel of Computation (BOLT notation)

With these kinds of semantics, the factorial algorithm could be represented as:

/// Types
type Factorial { 
     f : int  
     i : int  
     map{ default:done, mult: i>1}
}  

/// States
state mult {
   multiply
}
state done { 
   initialize
}

/// Actions
action initialize { 
      f' = 1
      i' = 7
} 
action multiply {
     f' = f*i
     i' = i-1
}

The system is by default in state "done" which allows a single action: initialize (which in the real world would be parameterized), when the system is in the state "mult" the action multiply can be applied, until it reaches the state "done", in which case, the system can only be reinitialized. Incidentally, I prefer the notion of default state to init state. 

You may also have noticed that the synthetic state "test" is no longer part of the definition of the system. I strongly question the fact that "test" could be a state since its meaning is quite different from "done" or "mult". If it is not, then why this value can be assigned to the same variable as "done" and "mult"?

This new (meta)model no longer requires that "IF ... THEN ... ELSE" statements "be added to mathematics" as it was necessary when Dr. Lamport designed TLA+. As a matter of fact, they can be re-interpreted, mathematically, as "ranges", the predicate i > 1 simply means, when i is in the range [2,], we are in the "mult" state, i.e. they are the very fabric of "state".  

The next question is then: what is the best (hence lowest) coupling between action and types? In other words, are actions only coupled to a subset of the type properties, or are they requests to type operations? For instance, to align with Abstract Data Types, I could have written the system's description as:

/// Types
type Factorial { 
     f : int  
     i : int  
     map{ default:done, mult: i>1}
 operation _multiply {
        f' = f*i
        i' = i -1
     }
 operation _initialize { 
        f' = 1
        i' = 7
     } 
}  

/// States
state mult {
   multiply 
}
state done { 
   initialize 
}

/// Actions
action initialize { 
      _initialize
} 
action multiply {
     _multiply
}

I prefer the former, which is far better aligned with Functional Programming, and provides the lowest coupling with a type (since I could reuse actions over different types which share the same set of properties).  The resulting property values are passed to the type (as a request to change its state) which validates a set of constraints (not represented here) before it decides to accept (or refuse) the new values.

With the exponential growth of Compute, Network and Data resources (e.g. Cloud, Mobile, IoT...) the complexity of distributed systems is about to increase by an order of magnitude or more. I suggest that it may be time to realign the foundation of computing towards a proper articulation of its core concepts (Fig.3): State, Type, Action and Relationship (STAR). As Dr. Lamport suggests:

Computer Science should be about concepts, not languages.

Fig. 3 The Double-Symmetry of the core concepts of Computation
jdubray
01/02/15

TLA+ - The Die Hard Problem

 

Abstract

The "Die Hard" problem epitomizes the core dilemma of distributed computing: do you factor the elements of the system from a set of highly visible and intentional actions, supported by a complex "behavior" (a.k.a stateful), or do you let elements of the system CRUD each other? (i.e. allow the elements of the system to expose operations such as "PUT the Big Jug's content to 4 gallon" and you respond with success or an error whether it is possible or not). This post attempts to give an answer to that question.

___________

A couple of years ago, Dr. Lamport developed  an example based on the "Die Hard" problem:

In the movie Die Hard 3, the heroes must solve the problem of obtaining exactly 4 gallons of water using a 5 gallon jug, a 3 gallon jug, and a water faucet. We now apply TLA+ and the TLC model checker to solve this problem.

Here is how TLA+ creates a specification of the solution:

The initial predicate enumerates the start states (both jugs are empty)

Init =∆ ∧ big = 0 
        ∧ small = 0

 

The coupling here is obvious, as the "state" of the system is defined as type property values.

Then the next-state action predicates enumerates the possible actions on the system:

Next =∆
       ∨ FillSmall
       ∨ FillBig
       ∨ EmptySmall 
       ∨ EmptyBig
       ∨ Small2Big 
       ∨ Big2Small

 

Then each action is specified in terms the "end state" reached once the action is applied to the system. For instance:

FillSmall =∆    ∧ small′ = 3 
                ∧ big′ = big

FillBig =∆      ∧ big′ = 5 
                ∧ small′ = small

EmptyBig =∆     ∧ big′ = 0 
                ∧ small′ = small

... and so on

 

The solution is then found ad absurdum by expressing that the system should not be able to reach a "state" of 4 gallon in the big jug. The tool detects that's a possible state (as an error condition) and the error trace provides the solution. 

IMHO, TLA+ is creating an unnecessary coupling between Action, States and Types by stating that: "a state is an assignment of values to a variable". This is congruent to Marius Eriksen's position which sees "state as a formula" (If you have not seen his presentation on Functional Programming @Twitter, I highly recommend it). 

The way I would approach solving the problem would be by creating the following state machine:

TLA+ - The Die Hard Problem

Fig. 1 - The Die Hard Problem State Machine

The state labels are encoded as follows: [jug_size][full,empty,partial][jug_size][full,empty,partial], for instance one of the possible initial states is BigEmptySmallEmpty (so are BigFullSmallEmpty and BigEmptySmallFull).

The key advantage of that approach is that the actions, the states and the types (jug) are as decoupled as they can possibly be: the same state machine would be common to jugs of any size. What is important here, as we create the solution specification, is that every action can be associated to one ore more transitions which connects two states. These actions express an intent to change the state of the system, however, it is ultimately the system and not the action which decides the outcome, i.e. the resulting state. It is the type which assign specific values to variables and maintain the mapping between states and assignments. In other words, IMHO, the relationship between State and Assignment is not an IS-A relationship but an association.

Fig. 2 - Proposed Metamodel

That is a subtle, but very important disctinction, especially in the context of distributed systems. One should never expect that "a computer, which you didn't even know existed, would be in the position to set the state of your system". That is the essence of distributed systems: elements of a system request an action to be performed on another element, they are never in the position to assign property values.

To illustrate that point, let's look at a very interesting action: Big2Small (empty the big jug into the small one).  When this action is invoked (while the system is in the state BigPartialSmallEmpty), there are four possible outcomes:

  • BigEmptySmallFull
  • BigPartialSmallFull
  • BigPartialSmallPartial (not represented on the diagram)
  • BigEmptySmallPartial

The action Big2Small should not decide on the outcome (how could it? it would mean that an external actor would have absolute knowledge of the system), it only expresses the intent. The state decides whether the intent is receivable, the type realizes the intent.  

Furthermore, there is no need to create individual "states" that would reflect the different values the big jug could have when it is partially filled, i.e. [1..4] because they are strictly equivalent from an action/transition perspective. In other words, I argue that a state is defined by its action (associated to the incoming and outgoing transitions), not the type property values.  In the present case, the significant "state" of the system here is "partial".

How does the state machine "computes" the values of each jug? 

TLA+ - The Die Hard Problem

Fig. 3 - Computing the Solution

Given an initial state you travel a path as a series of (action,state), the values of the type are computed along the way and represented as a tuple (small_jug_value,big_jug_value). The path highlighted in green describes the sequence that leads to the solution (Big jug contains 4 gallons).

To prove my point (unfortunately, not as in a "mathematical proof") let's ask the same question with the following problem: given two jugs of 11 and 4 gallons, produce 1 gallon of water. Not surprisingly, the same state machine applies and the solutions comes just as easily, albeit with a few more iterations:

TLA+ - The Die Hard Problem

Fig. 4 - The state machine (states and actions) is an invariant for the the two jug class of systems where Vjug1 < Vjug2 and these numbers are "relatively prime" numbers

What's very interesting about this problem is that it epitomizes the core dilemma of distributed computing: do you factor the elements of the system from a set of highly visible and intentional actions, supported by a complex "behavior" (a.k.a stateful), or do you let elements of the system CRUD each other? (i.e. allow the elements of the system to expose operations such as "PUT the Big Jug's content to 4 gallon" and you respond with success or an error whether it is possible or not). IMHO, that is why formalism like TLA+ or Petri Nets are so important because the formalisms software engineer use in their day job have no semantic to represent states (and a fortiori, their relationship to actions). A number of interesting questions emerge then:

  • how much do you need to share then (beyond the list of possible actions) with the other systems? 
  • how do the "states" relate to these actions? and hence what is the most effective definition of a state?
  • how does the public interface relate to the inner implementation of the system?

Dr. Lamport provides a critique of the state machine approach here. However, my word of caution is, and has always been, beware of simplistic reifications... when answering these questions. The proposed metamodel (above) allows for a simple replacement of the type without changing the actions or the states. Creating a metamodel where the state machine has the potential to be an invariant is a very desirable property in distributed systems because it allows you to reason very effectively on whether other elements of the systems could be impacted by a given change in one of the elements. Changes in properties are much easier to deal with (and much less likely to impact other elements) when compared to the state machine. Changes in a distributed system can quickly become a runaway system, that's why it is so critical to establish a foundation that manages change in the most effective way possible. 

Now, it could well be that both TLA+ semantics and the ones I suggest here are strictly equivalent, but it does not appear to me, at first glance. My goal is to create the TLA+ metamodel to clarify whether there is indeed an equivalence or not.

I wrote a follow up to this post (State Machines and Computation) that illustrates the proposed metamodel and avoids the reification of (control) state into assignments of variables. 

jdubray
12/31/14

TLA*


How do you interact with Dr. Leslie Lamport, 2014 Turing award recipient, when you are not a Computer Scientist? The answer is that you probably don't. Dr. Lamport did reply to me yesterday encouraging me to "clarify my thoughts and submit an article to a proper scientific journal". I wish I could, I have not written a refereed paper for more than 15 years and I don't see that changing any time soon, most likely ever.

So I'll just try to clarify my thought process and post it here.

A couple of weeks ago, a friend of mine told me that he was impressed by TLA+ and they had been successful, like Amazon, using it for their project. I feel sorry not having come across the work of Dr. Lamport earlier, this is such an important foundation to Software Engineering and programming languages:

TLA stands for the Temporal Logic of Actions, but it has become a shorthand for referring to the TLA+ specification language and the PlusCal algorithm language, together with their associated tools.

TLA+ is based on the idea that the best way to describe things formally is with simple mathematics, and that a specification language should contain as little as possible beyond what is needed to write simple mathematics precisely. TLA+ is especially well suited for writing high-level specifications of concurrent and distributed systems.

PlusCal is an algorithm language that, at first glance, looks like a typical tiny toy programming language. However, a PlusCal expression can be any TLA+ expression, which means anything that can be expressed with mathematics.

The best introduction I found so far on TLA+ was written by Stephan Merz (INRIA). TLA+ comes with a toolbox and a full set of documentation as well.

As you can imagine, sentences like "TLA+ is especially well suited for writing high-level specifications of concurrent and distributed systems" resonate immediately with my interests.

The way I parse and analyze new elements of knowledge is by constructing their metamodel. In 2003 I was part of the WS-CDL (Choreography Definition Language) working group and several people in the group felt that we should create a specification that was in essence aligning itself with π-calculus. The discussions were difficult and I felt there was quite a bit of hand waving in the π-calculus camp. It was not a stupid proposal by any means, but it seemed widely incomplete: it's semantics were too limited to deliver a "choreography definition language". A couple of years earlier Intalio had shoehorned the π-Calculus theory into BPML (which later became WSCI) and I felt that had been quite a disaster. So I set out to create a representation of the π-Calculus semantics, which ended the discussions immediately. The semantics of Robin Milner's theory were indeed too limited and could not account for many scenarios. In the end, the theory, though related, was never designed to support choreography or process definitions. Serendipity is such a romantic concept...

In the end, a theory is just that: a model, and without creating an objective representation of its semantics, you can pretty much argue anything you want such as "which means anything that can be expressed with mathematics" or the foundational predicate of TLA+ which states that: "a state is an assignment of values to a variable"

Let me say first that I really like TLA+ and that I wish I could be 1/10th of a fraction of someone capable of discussing with Dr. Lamport, but I am not and will never be.

So this is how I think, verbatim, and here is a digest, my digest, of what I wish I could discuss with Dr. Lamport.

First, I start by looking at how well, something like TLA+ aligns with what I call the M4 layer (mainly because I feel that UML is an M3 layer technology, and as such there can be more than one M3 layer). To my great surprise TLA+ is fully STAR compliant. It has a precise set of semantics for Action, State (Assignment) and Relationship (Tuples), and Types (Sets and Records). This is the first time I see a conceptual model so complete. It's hard to contain my excitement.

Second, I try to construct the metamodel. In the present case, it is a bit more difficult because the documentation is simply too hard for me to interpret (and somewhat inconsistant -that's what a metamodel surfaces instantly). The high level view of the metamodel states that a digital system exposes a collection of behaviors, behaviors are represented as a graph of states and events (aka steps). The specification of the digital system uses sets, functions and logic operators (using the BOLT notation) :

TLA*

I claimed earlier that the metamodel was already inconsistent at the highest level, because when you analyze the description Dr. Lamport makes and the semantics of TLA+, they seem to diverge (but again, I am barely scratching the surface here):

TLA*

For instance, as Dr. Lamport reduces a state as a mere variable assignment, it is not clear how a state relates to an expression from a TLA+ semantics point of view (next-state action). 

So until I am able to create the metamodel with enough precision, I'd like to discuss what I feel are some important points such as "a state is an assignment of values to a variable", because I feel strongly that it is not. Yes, as a first approximation model, it would be hard to disagree, say when you take the example of an hour clock. But, when you take a close look at the first metamodel (above), one can also infer that a state can be uniquely identified by the actions that lead to the state and the actions that are enabled when you are in that particular state. I prefer that definition because it is type independent (Please see my post on the "Die Hard" problem for more details). I do not view "hr = 1" and "hr = 2" ... as different "states" of the clock because from an action perspective, they are strictly equivalent. I understand that there is a confusion in Software Engineering as to what state is (most software engineers would define state as the snapshot of a type instance), but that definition would be independent of the transitions into and out of that state.  

This is how Dr. Lamport describes the Hour Clock system:

There is a basic problem with using a state machine as a specification. Consider an hour clock, a clock that displays only the hour. Ignoring the actual time that elapses between ticks, an hour clock is trivially specified by a state machine that increments the hour with each step

and this is how this system is modeled with TLA+:

This example is quite approachable and illustrates well the semantics of states and actions:

  • there are 24 possible initial states to the Hour Clock (HC) (as defined by the initial predicate: HCini) 
  • the nex-action predicate (HCnxt) defines a single action which transitions from a given state to the next 
  • the allowed states and their temporal relationships are specified by the HCsafe expression (start from any of the allowed initial states and apply the next-action for ever. 

Now, I don't ever expect Dr. Lamport to agree with me but I believe that his definition of a state is physically incorrect, though mathematically sound: "a state is an assignment of values to a variable". It introduces a strong coupling between the states, types and actions. Yes, indeed, the ultimate result of the action will be to increment a counter, but does that define a state? It's a bit like saying the state of an electron is solely defined by the energy it took to reach that state. Not quite. Computer Scientists have a permanent bias towards the reification of the conceptual view (states and relationships) into the physical view (actions and types). I believe that bias is wrong and must be avoided: states and relationships must never be expressed in terms of actions and types (and vice versa). That reification is the root of all problems in Software Engineering.

States are specific entities (uniquely identified by a label, not properties): states are not variable assignments, though they can be reduced to it. The engine of a car can reach the "started" state once the starter is properly powered or when someone pushes it fast enough. And yes, I can relate the fact that each time the engine is in the started state, RPM > 0, but are all property assignments a state? Yes, it does look like somebody PUT a RPM value, but isn't that a mere observation? When I set my car's color to "blue", did I change its "state"? That's why I use the word "reduce" because the perspective of a variable assignment does not reflect properly the metamodel of how state, types and actions relate. In particular it couples (if not pushes) the responsibility of deciding whether a change of state has occured within the action, which IMHO, is the wrong way to look at distributed systems. What happens when you reify the concept of state behind the concept of type? You end up writing arguments like that one:

However, a naive state-machine specification of the hour clock asserts that the hour changes with every step, while an hour-minute clock changes the hour only on every 60th step. This problem is solved by requiring invariance under stuttering. The specification of the hour clock must allow any finite number of stuttering steps, ones that leave the hour unchanged, between successive changes to the hour. Steps of the hour-minute clock that change only the minute are then stuttering steps allowed by the hour clock's specification

Of course, my limited knowledge of TLA+ does not entitle me to critique Dr. Lamport's work, but it is difficult not to look at a more accurate representation of the state machine of a clock and see for instance that a "Daily Savings Time" action would reach the "ticked" state just as a normal "tick" action (in other words, the clock should reach the "ticked" state in both cases, though an observer of the "hour clock" might see no "change in state" if they simply look at the variable assignment.

I would also want to see in the TLA+ description a "started" state which can transition to a "stopped" state, which prevents the clock to further honor requests to "tick". I would also question why would the "states" of a clock change based on its precision? It only does when you are coupling types and states.

So I don't mean to be overly condescending, because I understand, this is exactly how it looks, but I feel that one more time, the reification of the conceptual view (states) into the physical view (types) is failing. If Dr. Lamport was a physicist, and not a Mathematician, he might understand the argument that Physics has four foundamental concepts: Light, Mass, Energy and Force. Not, 5, not 2: 4. Optics has been very successful at demonstrating that one can deliver a lot of valuable insights by ignoring, Mass, Energy and Force, but in the end, you hit a wall and you cannot satisfactorily describe a system by considering light alone and reifying all the other concepts as mere "assignments" (such as refractive indices).

The most constraining wall that we are hitting with this kind of reification is that we cannot define a common structure to problem statements. If the proper view of "state" is brought into perspective, one can then define a problem as:

  • either a missing transition between two states
  • or an unwanted (but possible) transition between to states

Dr. Lamport was kind enough to evaluate this definition and immediately spotted that it was incomplete. First he noted that this definition was limited to "engineering" problems. He also added that one condition was glaringly missing:

An incorrect behavior can also be the result of an incorrect initial state (this is rare) and as the result of an incorrect or missing liveness/fairness property. The latter as a distinct class of errors is often overlooked because one simply assumes a sufficiently strong fairness assumption, so that failure to satisfy a liveness property must be due to one or more missing state transitions. For example, consider a two-process program in which one process can loop forever and the second one can perform a single action, and correctness requires the second process?s action to be performed. Without a fairness condition to ensure that the second process takes a step, then there can be a behavior in which only the first process takes steps--a "problem" that is neither of the two you claim to be the only ones.

Why is formalizing the problem specification important? because except for obvious problems, software is built from a problem specification (a.k.a. requirements, user stories, ...). TLA+ has created the foundation for a formal solution specification, which we could build on to create a formal problem specification. Once we have a formal problem and solution specification, we open the door to support formal verification as well.

TLA*

I understand that I am playing way out of my league here, but I cannot help thinking that reifying the conceptual view into the physical is a monumental mistake, in particular, it makes it more difficult for a problem specification formalism (aligned with TLA+) to emerge.


Abstract

Distributed Computing is at the core an MNP problem: you have M clients, invoking N remote operations which can each have P versions. That's a big problem to have, this post attempts to explain why.

Traditional Software Engineers are unaware of that problem because they work in a monolithic world where there is one client (the Web  app) and one version of each operation, tightly coupled to the Web app. That's hard enough, but that's infinitely simpler than the MNP problem. That problem is slowly becoming mainstream because today M is always bigger than one: with mobile devices, every business has at least 3-6 types of consumers, not to mention APIs where clients can easily run in the hundreds, thousands or even millions for the "lucky" ones. When M > 1, versioning starts creeping up as not all clients are able move in lock-steps with each change in the operations. Unless you are Google or Facebook and have the superpower to set P to any value you'd like (even P=1), you'll face the MNP [1] problem sooner than you think. You could also try the "DON'T" approach, but I doubt that this is even remotely possible. 

_______

I built my first distributed computing system in 1987 using an Apple IIe and a handful A/D and D/A converters to communicate with PID controllers. By 1989, I had transformed it into a commercial product using shiny new PCs (running DOS/Turbo Pascal) and using HPIB (the HP Interface Bus) to talk to many devices. In the early 90s I switched to NeXT to build realtime process control systems for the semiconductor industry. By 1995, I was building systems that were processing continuous streams of data of about 15kb/s. That doesn't sound like much but with my team at HRL, we were able to build the first system that could control the fabrication of complex semiconductor devices at the physical level (composition, temperature and thickness, down to the atomic layer level using RHEED oscillations). The control algorithms were so sophisticated that a Professor from CU had demonstrated that theoretically they could not be written.

Moore's law helping, we are now capable of building systems that can process in realtime tens if not hundreds of megabytes of data per second. That being said, from an interaction's perspective things have not changed much. All interactions are either events (something you might be interested in just happened here) or actions (could you please do this for me). If you want to get really fancy you might consider implementing trans-actions (i.e. invoking two or more actions which outcome is coordinated).

So when I read comments like this, it's hard not to reflect over the last 25+ year:

The problem in distributed computing is that vendors try to make it easy for one system to talk to another and a lot of very good software engineers believe that if two systems can call each other, then it does not matter if you have two or one hundred. After all, distributed computing is only about encoding the semantics of an event or an action into a string and have another system decode these semantics and apply the proper response to that event or action. Who cares about these "details", right?

I was very unlucky in my career because I developed software for NeXT computers and its wonderful OS: NeXTStep (which now lives as OS X). Imagine, we had Interface Builder, Reflection APIs, Dynamic Typing and Collections in 1989, CORBA/OLE in 1991, the first O/R mapping tool (EOF) in 1993, WebObjects (the first dynamic HTML app server) in 1995... Everything else at the time, looked like it was a decade behind. Where else other than on a NeXT computer the Web would have picked a birth place? Heck, even distributed computing in NeXTStep was easy because Objective-C made it possible to create non breaking interfaces, a small detail (little did I know) that would occupy a good chunk of the PaST 25 years. 

What most people don't see is that Distributed Computing is at the core an MNP problem: you have M clients, invoking N remote operations which can each have P versions. That's a big problem to have. Traditional Software Engineers are unaware of that problem because they work in a monolithic world where there is one client (the Web  app) and one version of each operation. That's hard enough, but that's infinitely simpler than the MNP problem. That problem is slowly becoming mainstream because today M is always larger than one: every company has at least 3-6 consumers, not to mention APIs where clients can easily be in the hundreds, thousands or even millions for the "lucky" ones. When M > 1, versioning starts creeping up as not all clients can move in lock-steps with each change in the operations. 

Note that I have somewhat over-simplified the problem already because for consistency reasons the N operations are nearly always impacting more than one system of record, so we should really be talking about the MNPT problem where T is the number of system of records involved in the response of a given operation (again, in a monolithic system T=1). 

So for the most part software engineers, when they are tasked to build 100 operations, end up building 100 operations. But when M, P and T are greater than one, things change a bit.

If we now assume 10 consumers, 100 operations, 10 versions, 10 systems of record, we already have a system where there are 100k potential communication paths. Yes 100k, no, not lines of code, 100,000 possible communications paths (10,000 when you have only one system of record, which again, is more the exception than the norm). Good luck writing code for that kind of system, especially if you pay no attention to interface or endpoint details! Imagine changing one detail somewhere and regression testing it? Compare that to a 100 possible paths to the corresponding monolithic architecture (I don't factor in the Systems of Record here because App developers are never involved in "integration"). Yep, you guessed it, most software engineers have never dealt with that problem and, from what I have seen, they generally don't know how to deal with it. I am not saying that all paths will be exercised, of course, but what I am saying is that if you pay no attention to the details of how clients, operations and versioning are implemented, your code will exercise far many more paths than necessary, multiplying the level of effort involved in building and maintaining the system.

For instance, to deal with change "effectively", developers tend to adopt an "immutable" versioning strategy: you need to change an operation because a new consumer has slightly varying requirements? no problem, fork it, and voila V+1. (I have seen teams at V+25, all 25 versions in production). As you can see having M consumers arbitrarily talking to P versions of N operations will very quickly get out of hand when you need to maintain that kind of code base. Imagine what bug fixing looks like in a 100x25 code space?

Others look at "minimizing" endpoints (actually the same team tried that too, I discovered that they had up to 60 operations hidden behind a single endpoint, an "implementation detail"). Sure enough that argument can be supported by basic math, as long as N is low, you would stay out of trouble. In the early days of REST, the RESTafarians had even argued that the interface was "uniform", N = 4. Done! problem solved! The reality is that you can't control N to that level, you can't "hide" operations because this is the unit of work a consumer needs to invoke and you must to control access to. Yes of course, you should try to minimize N, this is why I often recommend to create "intentional interfaces", an interface where operations express "intents", otherwise you quickly end up supporting all kinds of variants for a growing number of clients (a.k.a Vertical Slice Architecture). This is especially true if you do not decouple properly the User Interface from the Intent. In that case, each consumer could potentially lead to its own operation.

Of course, over time, as your organization grows, you should expect the information system to become more fragmented and T to increase as well. As new systems of record are added, many operations will need to be retrofited to achieve consistency, and the more operations you have, the merrier. 

I guess by now most people would understand the big picture (that is, unless you are emotionally or financially attached to a magic bullet). I would personally support any approach where:

  • operations are explicit (machine readable contracts, generated clients)
  • an interface can change while remaining compatible (when possible) with existing consumers which are not immediately interested in consuming that change (that should limit the number of versions in production to 3 or less)
  • an architecture where operations are not hard wired to the systems of record since you want to be in the position to 
    • control how changes from the systems of record propagate to the operation's contract
    • add new systems of record in the future without asking your clients to change anything

If your approach fails to meet one of these criteria, you will most likely feel excruciating pain in relatively small settings (10 consumers/100 operations/10 versions/10 back-ends is kind of minimum the Mobile/Cloud/SaaS era). It won't be instant, it might take 18 to 24 months, but you will end-up in the 100,000 communication paths pit, and once you end up there, you cannot refactor yourself out of it, this is a runaway system: the more communication paths you have, the less you are in the position to change them, the more new communication paths you open. I am not saying that if you follow these criteria it will be easy, but it will be easier.

Whatever you choose to do, you must make sure you never end-up there. In any case, I suggest you monitor carefully the number of communications path your system uses, in relation to the maximum theoretical number. You should quickly get a sense that you are losing control of your code base.

Let's try to put this on a diagram to make it clearer. In essense there are 3 approaches to distributed computing. I am not saying that one wins all the time, all I am trying to say is that they are not interchangeable and they solve different kinds of problems:

  1. Consumer-System of Record (a.k.a Point-to-Point)
  2. Event-Driven (this pattern is more often used in system-to-system communication than consumer-to-system of record)
  3. Service Oriented

Not represented on this diagram is the versioning dimension, which, again, would act as a multiplier of all the communication paths between the elements of the distributed system. 

You are free to call integration points "microservices", you are free to claim that this is not "point-to-point" but in essence, each time a consumer communicates directly with a system of record, you are using a "point-to-point" integration pattern, wiring consumers to the integration points of the system of record. That is the root of all problems in distributed computing where the number of components is greater than two. That is why RPC fails, that is why CORBA or EJB were too "brittle" because any change in the system impacted too many communication paths. And to minimize impact, you respond to change by creating more integration points (or microservices if you like that term better), hence more communication paths. That is the definition of a runaway system.

Beyond the brittleness of the component interfaces, the point-to-point integration pattern pushes the responsibility of the consistency to the either the consumer, or to the system of records themselves (which typically use events for obvious coupling reasons). This compounds the problem because it creates the need to duplicate or at least spread the consistency logic on the edges of the system (consumer and systems of record). Personally, I prefer a "Service-Oriented" approach where the consistency responsibility is given to a 3rd party component exposing an intentional and versionable interface because that logic does not belong to the consumer or the system of record.  

Over the last 15 years or so, there has been a strong tension between the (cool) customer facing (Web) Apps and the (boring) back-office systems of record. I best illustrate that tension by talking about the CAR theorem which states that we cannot built distributed systems with Consistency, Agility and a Rich data model (highly Relational), we have to pick two. The boring guys like me are tasked to keep the back-office consistent, period. The cool kids build customer facing apps. They need to be agile and offer to the customer as much data as required (products, transaction history, support, ...). When you have 100,000 communication paths to deal with (even a fraction thereof), you cannot be agile and consistent at the same time, and I am not even talking about how rich/relational your data model it.

We have reached a point where the very concept of monolith no longer exist (simply because M>1), and with IoT just around the corner, I view the "Microservices" movement as something very positive for SOA because for the first time application architecture has no other choice than adopting integration architecture principles and expose proper integration points, intentionally and not as an after-thought. This gives us a unique opportunity to unify both. Today every organization feels the pain of dealing with several "apps" (Smartphones and Tablets, iOS and Android) and several versions of these apps running in the wild. They need to acknowledge the MNP(T) problem to its full extent. All I know is that the worst way to deal with that problem is to use a point-to-point approach, I hope the math makes it clear.

[1] The "MNP" acronym was chosen for its similarity to the "Multihop Network Programming" protocol

 

1 2 3 4 5 6 7 8 9 10 11 ... 23 >>

Search

 Subscribe







blog engine