Read more: |
Matti Grönroos
The history of the availability metrics is based on the mainframe architecture from decades ago: The display terminals had fixed cabling attached directly to the mainframe and the application were simple and straightforward. In such a world, the users' perception of the availability was pretty much the same as the mainframe's technical availability.
However, the world has changed, and it changes constantly. Servers are almost immortal in terms of their hardware, and the uptime of the hardware tells you almost nothing about anything. Availability has become a more unclear concept year by year as value chains become more complex.
Let us start with a simple example. A user sits at his or her workstation and presses a button. If nothing happens for a while, the user thinks that the service provider is lousy, and he or she walks to the coffee machine.
But is the case that simple? Let us strip the example down to the bones and look at only the most central components represented in the value chain:
Even in this oversimplified model, the transaction passes about twenty times through different components. One must have a very good crystal ball at hand to be absolutely sure that issue lies within a component within the responsibility of the service provider.
The concept of availability is somewhat analogous to a spoiled egg: If the egg box contains a spoiled egg, what is rotten: The egg itself, the box of eggs, the entire shopping cart, the supermarket or the mall hosting the supermarket? There is no single answer to this, just as there is no single answer to what the availability of a complex value network is.
If the availability is to be measured and reported with unambiguous metrics, the procedure and the principles must be agreed very clearly to have a common view on both sides of the table. This is extremely important if the quality penalties will be based on the availability metrics. The calculation method shall be reasonable, taking into account the criticality of the systems. No reason to pay penalties for insignificant incidents. You always must aim for a dashboard producing reasonable information and useful metrics.
Let us look at a chart reflecting the Value Network thinking. We have two systems X and Y. There are three offices A, B, and C having a total of 1200 users. System X runs on a 4-way cluster where one member can be down without the users seeing a performance degradation.
The traditional model of availability calculation based of server availability gives the following results:
If point 4 is down, the case is sexier. All these availability figures are logical:
None of these figures is absolutely wrong or absolutely correct.
The whole thing is not made easier by the fact that internal systems may have a considerable number of integrations between systems. There can also be integrations to services outside the organization.