Vihreä Ruusu - RTO and RPO - Recovery Targets

Vihreä Ruusu

Service Governance	Service Level	Processes and Practices
Mode of Operation	Vendor Relations	Availability

Suomeksi

Matti Grönroos CV

Matti Grönroos

RTO and RPO - Recovery Targets

Traditionally, the IT application development projects do not put much focus on the recovery capabilities of the coming system. Instead, let us finally see if the result is acceptable or not. If yes, go live. If not, go live.

It would be possible to approach the question more systematically, and to set recovery targets in addition to the functional requirements. There are two choices available: RTO and RPO, Recovery Time Objective and Recovery Point Objective.

As our website explains, service breaks are very interesting things. Less attention is paid to the recovery from failures, even if the recovery time has much more impact to the company business case than the service breaks themselves. Perhaps it is less sexy than writing code with the most modern tools.

When the mainframes dominated the world, the infrastructure restart was a time-consuming exercise. Almost as soon as it was complete, card readers started to eat card decks, and the production was up and running. The overwhelming interest in the server uptime and downtime only dates back to this era. Of course, the popularity is based also on the monitoring applications all capable of collecting this information.

When the company business is dependent on the IT systems, the entire value chain shall be designed for a quick recovery. That is why it is essential to define reasonable values for two recovery targets:

RTO Recovery Time Objective, expressing the maximum duration of the service break. This target has quite significant cost implications. It is a totally different exercise to recover a 10-gigabyte database in 10 hours from recovering 100 gigabytes in one hour. Therefore, the RTO is a business case, not a technical challenge only. The system, database, and the infrastructure must all be designed to meet the RTO target. None of them can be the only one to carry the responsibility for tight RTO targets.
RPO Recovery Point Objective, expressing the maximum amount of data the organization can tolerate losing. The RPO is usually expressed as an interval: how many working hours it is acceptable to lose because of lost data. Most modern database systems support zero-time RPO for the committed transactions, but such techniques typically increase the recovery time. It is possible to find a working balance between the RPO and the RTO with a skillful database design and co-operation with the infrastructure experts. Note: Daily backups only do not make it possible to have the RPO shorter than 24 hours.

The business case is straightforward: The shorter the RTO and RPO are, the bigger budget is needed.

If these targets are to be included as SLA metrics in the service agreement, it is necessary to agree the exact criteria for the start and end times of the service break, see Service Break.

The definition of the starting time of a service break is not unique. Often, the downtime triggers when the notification by end-users is received by the service provider, or when the service provider's monitoring system detects it. In addition, it may be a necessary discussion topic, if the troubleshooting before an unplanned system restart is to be counted as Incident Management or as a part of the service break.

The end time has more options:

Each of these, and a few additional ones, is a valid criterion on some grounds. It is essential to find a common understanding of the criteria. A service provider delivering plain infrastructure services only most likely does not want to take the responsibility for the chain beyond point 4. Nevertheless, it is good to find a body whose responsibility is to ensure the success of the entire recovery.