Read more: |
Matti Grönroos
Recovering from a service break is not only restarting the server. It takes seconds or minutes only. However, the end-users may see a much longer outage, even hours.
A service break is not always a simple on-off question. If the performance of the system collapses, the users see the case as an outage even if every component in the value chain is up and running.
If the service breaks will be subject to quality penalties, both contract parties are protected by writing an exact definition of a service break into the contract: impact of criticality, criteria for service break start, and criteria for end of recovery.
Recovering from a service break is a value chain, too.
Apart from rare physical faults, the recovery of the infrastructure is a quick action. The multi-nine availability figures are based on this. However, this is only a small fraction of the downtime seen by the end-users.
Only when the infrastructure is restored, the recovery of information management systems, information and application begins. A large database not designed for a quick recovery might have to do a considerable amount of work to recover. The elapsed time for this may be tens of times more than the time needed for the infrastructure restart. This part of the recovery usually lies beyond the responsibility of the service provider. Instead, the time to recover depends on the database design.
Before allowing users to log in, it may be necessary to verify the success of the recovery and the integrity of the critical data. This is a step that is also part of the outage experienced by the end-users, although it may be the responsibility of the customer organization itself.
The recovery time of the end-users should not be underestimated either. They may have taken their coffee break, or they may have begun another task. Some of the work might need to be done again. That is why it may take time to reach the business-as-usual state again.
Another pitfall related to irrelevant figures may be hidden in service contracts: Counting the downtime minutes only.
Let us compare two scenarios. The upper one is subject to one long infrastructure-related service break and to lower one to three short ones.
The paradox is that the upper case may trigger the sanction statement of the contract, but the lower one does not. However, the end-users see the lower one as much worse case because of the repeating long post-infra recovery time. Once again, we deliver watermelons.
Only a few service contracts place the availability and performance side by side, even if there were good reasons to do so. The lack of performance is visible to the user in the response time, which is somewhat subjective. For the service recipient, it appears as limited throughput (events per unit of time), which is more objective. In addition, the responsibility of performance is spread over several players. It is why it is easiest to sweep it under the carpet.
When the pain limit is reached, the end-users cannot use the system in a sensible way. In such a situation, it is a matter of taste whether the service is down or not. That is why the severely degraded performance, and a total service break are equal from the end-users' perspective. Since measuring the response time in an objective way is difficult, it it easy to skip as a troublesome topic.