|By Toddy Mladenov||
|June 4, 2014 03:07 AM EDT||
Experiencing downtime is not something that companies wish for but as we have seen lately it is something that we hear quite often about. Interestingly enough very few enterprises, especially in the Small and Medium Business area, spent enough time to work out good procedures for recovery of their IT systems and applications. The recovery procedures should always be driven by the business needs, and this is the part where lot of IT departments are failing and as a result the recovery turns out to be reactive procedure that is triggered by the issue, results in a chaotic recovery activities and ends up with post-mortem but no improvements after that. Putting more initial thought into the Business Impact Analysis (BIA) is a prerequisite for a good recovery procedures and defining the two main characteristics - RTO and RPO are crucial part of this process.
Let's start with the first one - Recovery Time Objective (RTO). RTO is defined as the duration of time within which the system or the service must be restored after disruption in order to avoid unacceptable consequences related to break in business continuity. The first thing that you need to have in mind about RTO is that it is an objective - this means that it is a target that you may not be able to achieve all the time. There are certain activities that you need to do during this time that may have variable duration. At a high level those are grouped in:
- Recognizing that there is a disruption - this may depend on your level of monitoring or lack of it and may involve manual checking of each system or service that participates in the business process
- Troubleshooting and identifying the failing system and/or service - this will depend on the level of diagnostics you have implemented and may also involve different people or teams
- Fixing the issue - depending on the root cause this can be as simple as rebooting the system to as complex as requiting code changes or even ordering new hardware
- Testing the fix - last but not least you need to make sure that the fix actually resolves the issue
In all those four activities the human factor is the most variable part. People need to be notified, updated, they need time to understand the issue, troubleshoot, code etc. The more automation you provide the less impactful the human factor is for the recovery time.
Once the system or services is brought back to operation though you need to determine what is the state of the data. This is where the next characteristic becomes important - Recovery Point Objective (RPO). RPO is defined as the period in which data might be lost from the system due to disruption without major impact to the business continuity. Although this is also objective you need to be more careful with this one. There are few things to think about here:
- Is data loss acceptable at all? In lot of cases the answer is no but there are situations in which you can tolerate loss of data.
- How to recover the data? Does it require copying, shipping backup tapes or manual entry of the data?
- How long will it take to recover the data? Two extremes are from few seconds required to repoint the system to a replica of the data on another server to requesting an off-site backup copy of the data
- How to test that the data is recovered? This can vary from automated tests to manual tests
Depending on your RPO your time to recover the business operations for your system may vary.
When thinking about Business Continuity (BC) you need to think about both components - recovering the operation of the system or service (RTO) and recovering the data to a point at which it is usable for the business (RTO). Both those actions need to take time that is less than the Maximum Tolerable Downtime (MTD) as we defined it in Determining the Cost of Downtime. In general though you should set your RTO and RPO in a way that you have a buffer of time for unexpected issues that may occur during recovery.
This post was first published on our company's blog.
- Why Shutting Down TechNet Is Not a Problem for IT Pros
- Open Source in the Cloud - How Much Should You Care?
- Is Your Cloud Ready for the Enterprise?
- Business Strategy for Enterprise Cloud Startups
- The Importance of Private Clouds
- Cloud Computing Service Models
- Are There Other "as-a-Service" Cloud Offerings?
- Essential Cloud Computing Characteristics
- How Do You Choose Your Cloud Provider?
- There Is More to PaaS Than You Think