Read
more:

Matti Grönroos

RTO and RPO - Recovery Targets

Traditionally, the IT application development projects do not put much focus on the recovery capabilities of the coming system. Instead, let us finally see if the result is acceptable or not. If yes, go live. If not, go live.

It would be possible to approach the question more systematically, and to set recovery targets in addition to the functional requirements. There are two choices available: RTO and RPO, Recovery Time Objective and Recovery Point Objective.

As our website explains, service breaks are very interesting things. Less attention is paid to the recovery from failures, even if the recovery time has much more impact to the company business case than the service breaks themselves. Perhaps it is less sexy than writing code with the most modern tools.

When the mainframes dominated the world, the infrastructure restart was a time-consuming exercise. Almost as soon as it was complete, card readers started to eat card decks, and the production was up and running. The overwhelming interest in the server uptime and downtime only dates back to this era. Of course, the popularity is based also on the monitoring applications all capable of collecting this information.

When the company business is dependent on the IT systems, the entire value chain shall be designed for a quick recovery. That is why it is essential to define reasonable values for two recovery targets:

The business case is straightforward: The shorter the RTO and RPO are, the bigger budget is needed.

If these targets are to be included as SLA metrics in the service agreement, it is necessary to agree the exact criteria for the start and end times of the service break, see Service Break.

The definition of the starting time of a service break is not unique. Often, the downtime triggers when the notification by end-users is received by the service provider, or when the service provider's monitoring system detects it. In addition, it may be a necessary discussion topic, if the troubleshooting before an unplanned system restart is to be counted as Incident Management or as a part of the service break.

The end time has more options:

Each of these, and a few additional ones, is a valid criterion on some grounds. It is essential to find a common understanding of the criteria. A service provider delivering plain infrastructure services only most likely does not want to take the responsibility for the chain beyond point 4. Nevertheless, it is good to find a body whose responsibility is to ensure the success of the entire recovery.