Predictive Systems recently conducted a disaster recovery roundtable to address this critical business strategy. We asked experts in distinct areas network design and engineering, security, and storage their thoughts and recommendations about business continuity and disaster recovery planning. In this article, our experts tackle some tough questions facing American businesses.
----------
Roundtable Experts
Peter Browne Vice President, Predictive Systems' Global Integrity business
unit. Peter has over 30 years as an information security practitioner, consultant,
speaker, and author. His extensive experience includes disaster recovery responsibility
for major firms such as First Union, GE, and Motorola.
Ernie Hassell Managing Consultant, Information Security. Ernie has over 15 years of IT service experience, most recently leading the disaster recovery practice at Lucent Technology Worldwide Services.
Tim Rooker Managing Consultant, Network Design and Engineering. Tim has over 14 years of experience in network communicationss with such companies as BellSouth, MCI, and ING North America. Tim has previously performed disaster recovery planning and design for BellSouth.
Scott Wilson - Vice President, Network and Systems Management Strategy. Scott has been in the network management field for six years, delivering large-scale network management solutions to a variety of customers including large Internet service providers. Scott has extensive experience in storage solutions.
----------
Moderator: With regard to business continuity and disaster recovery,
what have been the biggest lessons learned by American businesses because of
the September 11th disaster?
Hassell: One thing is that there was a real gap with respect to knowledge transfer at different levels of many organizations impacted by the 9/11 disaster. For example, within the executive management level, lets say CFO types, it was critically important to have arrangements for funds disbursement and other processes that keep the wheels of a business turning. In many instances, things broke down when there wasnt a plan in place when the CFO wasnt around. There was no transfer of critical knowledge.
Another thing realized was that the end points of many companies infrastructures werent really known. In other words, disaster recovery planning wasnt fully done. So, companies didnt know how many nodes, desktops, what applications, which resulted in quite a bit of chaos.
Browne: Generally, people will be able to plan for things they know. Nobody had any experience with the likes of the situation on September 11th. The 30-year rule comes into play. What do I mean by that? If you look at most people in their careers, they dont last more than 30 years in business. They didnt have any knowledge base for this type of disaster. If you look at what Gartner says, of things that go wrong, some 40% of them are operator error, 12% hardware, and 40% application failure. Only 5% of things that go wrong are true disasters. Things like major snowstorms, hurricanes, tornadoes, and tsunamis are considered disasters, but theyre experienced by quite a few people. This disaster wasnt in anybodys frame of reference.
Wilson: Storage is a hot topic in New York City right now. Ive heard so many times of companies with disaster recovery processes documented, only to find that nobody knew where the document was. Another big problem, with regard to data, was that many key employees had data stored on their laptops or desktops, but they were never replicated anywhere. A significant amount of data was lost. Many disaster recovery plans focused on power outages in a central data center, but never considered the loss of an entire facility or disbursed data that was never collected into a central location in the first place.
Rooker: As far as the lessons learned, I believe that many people now consider disaster recovery as an insurance policy. Thats significant. Before 9/11, a disaster was considered to be a hardware outage or something similar. Nobody could ever predict a scenario of this magnitude. Everything is literally being re-written as we sit here.
Moderator: It appears that business continuity/disaster recovery is a business state of mind. Ive uncovered three prevalent scenarios in my research. The first one is, Dont bother me with this since its so remote it will never happen. Another is, Sure, well take care of that, though nobody is actually paying any attention to the details. The third scenario is, Weve got the latest and greatest systems in place and weve spent a fortune on them. What can you share about these scenarios from your experiences?
Browne: We worked with a large financial services company on a business continuity plan. The first thing we did was to look at their products and services and we uncovered some interesting attributes. First, we found certain services didnt make any money. When we identified critical services that had financial meaning, we went off-site and did planning for three days. This doesnt fit any of your scenarios since this seems to be one of the correct ways to tackle the problem.
One interesting point here. On the evening of the first day, we were talking about how to deal with emergencies and how to move employees to alternate sites if a given physical site was out of commission. At 6:00 PM that night, a call came in saying that the headquarters was burning. Actually, there was a fire two floors above their office and the place was drenched and filled with smoke. The client actually had to exercise their disaster recovery plan as it was being created.
Moderator: How did they do?
Browne: They actually did very well, because it was fresh in their minds. We got people to move to alternate locations. Things went very smoothly. This made believers of the executive team in the need and the power behind the process. The company is very proactive to this day in the whole world of business continuity and disaster recovery.
I know of hundreds of cases where too many more important things were on the front burner and they looked at business continuity planning as a discretionary audit. Basically, theyre doing lip service and this seems to be the state of mind for many businesses today. So, your second scenario is more prevalent than wed want to admit.
Moderator: Is this because you cant immediately justify the whole disaster recovery process to the bottom line?
Browne: Thats right.
Rooker: Its almost out of sight, out of mind. A specific painful experience, as Peter described, has a significant impact on a business culture. But, for many companies, its an abstract exercise to go through. If they dont have any direct experience, its just an exercise. Unfortunately, thats a dangerous mindset.
Hassell: A positive point coming out of this horrible incident is the fact that the insurance firms are clamping down on companies, making sure that they have disaster recovery plans in place to take care of business interruption. They simply wont write policies anymore, since the losses were so tremendous. Insurance companies want to mitigate their risks, and one way to do that is through a required disaster recovery plan.
Moderator: Insurance companies may be the police that demand that a disaster recovery plan is in place. But, who blows the whistle if the plan is inadequate?
Hassell: I think insurance companies are viewing themselves as stakeholders that should be involved in the process through onsite involvement. If done correctly, the process will actually blow the whistle.
Moderator: Does the level of planning differ depending upon the type of company? For example, would a plan for a financial services company be significantly different from a mail order company?
Rooker: Plans do differ based on the type of business. In the financial services industry, for example, what is the level of outage that the business can endure? For financial services, the answer is, none at all. The disaster recovery plan has to be rock solid. This requires a lot of effort, a lot of money, and a lot of due diligence.
Moderator: So its really about the pain and suffering of financial loss?
Rooker: Thats the beginning. But, with a financial services company, you have to also look at perception. It may be a little less tangible, but no less catastrophic to the company if people perceive the company as unstable. Perception plays a major role in companies such as financial services, high-tech security, etc. A complete outage will do irreparable damage. Degraded service, on the other hand, where the company limps along until things are fully restored, might do in the short term. If the public and regulatory agencies will accept that, then the pain and suffering threshold is better known.
Browne: In any type of business, there are pain points based on the kind of business, the transaction velocity, and things like that. If a company is doing real-time or near real-time transactions, the pain threshold is lower. It doesnt make any difference what kind of business youre in. It depends on whether you are relegated to choke points or funnel points in your business.
Moderator: This may be a naïve question, but can disaster recovery planning and implementation be measured in terms of return on investment (ROI)?
Browne: Absolutely. Thats what sells it in most cases. Doing the risk analysis that says, If this happens, then you lose that, is critical. There is direct loss, and then there is indirect loss -- the cost of recovering, the cost of rebuilding, etc. And, theres consequential loss, such as lost revenue, or a loss of reputation, or a loss of business. Forty percent of companies go out of business in less than five years after a disaster. Those statistics are immutable. Our core competence is the ability to measure and quantify risk.
Moderator: Ive always thought ROI was a cause/effect situation. You can say we spend X dollars on getting this achieved. But, you cant say we saved X dollars until the disaster occurs. If its never called into play, isnt it considered a loss?
Hassell: That shortsightedness is exactly the sort of situation companies catch themselves in. However, if executed properly, a disaster recovery plan should identify things on the back end that were not visible on the front end. Just the details, such as knowing where the equipment is, having an up-to-date asset list, or as Peter pointed out, what pieces of the business arent making money. These are things many companies havent accurately captured. A good plan will catch that, and save countless dollars in loss when a disaster occurs. Its on the back end a streamlined process and a better handle on what it is you have to do. Thats the name of the game.
Browne: This is a form of business process engineering. Businesses that go through the process find themselves more efficient. By the way, what is disaster recovery planning? Asset management and communication. Guess what? Those are sweet spots. Youve got to get better in both cases to be really good at what you do.
Moderator: When a disaster recovery planning team comes forward and says we need X amount of dollars for certain redundancies, remote locations, etc., to safeguard the business, that becomes very expensive. I guess what Im saying is, businesses dont mind doing the planning, but they faint over the cost of actually implementing it.
Wilson: Exactly. You wind up looking at a variety of scenarios that a company just cant afford to enact. That brings us back to the notion of how much pain and suffering the company can endure and still remain in business. Thats where you draw the line.
Rooker: There are solutions and then there are less expensive solutions. If you know what the potential loss is, then you know how much money you can afford to spend in preventing it. Thats called making an informed business decision. So, in this instance, you can determine what form of redundancy is important to keep the business viable.
Moderator: Consider this: A small company that has the need for a disaster recovery plan. The person responsible for that is just that: one person. Is disaster recovery planning too big for one person to handle?
Browne: I consulted for a small company with exactly that dilemma. And that probably answers the question. They saw the need for an outside eye to catch the flaws in the logic to ensure that every aspect had been considered. I guess putting all your eggs in one basket is something a company, even a small one, should not do.
Wilson: Sometimes it takes a third party to catch the obvious. Yes, a consultant will see it as his or her obligation to alert a company to various problems because its the consultants reputation on the line. Thats a healthy thing. However, even large companies can benefit from the outside assistance.
Moderator: The business continuity process appears to follow four major steps: first, a company has to outline the scenarios that would interrupt business; then, plan reasonable continuity to withstand the scenarios; third, to test them; and last, to put the tests into place once theyre found to be adequate. Do I have it right?
Hassell: The DRI Disaster Recovery Instituteis the grandparent of disaster recovery planning. Its the official body that oversees the certification of this kind of planning. Theyve laid out a solution portfolio and it talks to this assessment process. First, they look at risk assessment and business impact analysis. Strategic planning is then the result of the analysis. Third, they must get buy-in for plan development and implementation. The last step is plan maintenance.
Moderator: Is there any particular part of the process youd like to comment on?
Hassell: As a precursor to entering this process, it is critically important, and of great benefit to a consultant, to have a policy in place with respect to disaster recovery. Things get far more difficult if that policy doesnt exist. But, if it doesnt, a consultant can go in to help ratchet up the need. Then, he or she can strive to be the champion. Even a simple three-line policy gets the ball rolling.
Controlled risk needs to be separated from uncontrolled risk. Controls need to be in place in the physical sense. For example, you can look at a guard outside the door and security cameras as physical representations that mitigate risk. Now, when we look at the anthrax threat, were looking at an uncontrolled risk. Security and risk analysis go hand in hand.
Rooker: Hes absolutely correct. With regard to the network, you apply analysis starting with components internal to the network. Looking at the core and its physical nature, you may have hardware failures; you may have application failures, etc. Then, you work your way up to the next level, which may be the infrastructure, building security, things like that. Peter may have a statistic on this, but a very high portion of outage involves the last mile of a network. One example might be a local exchange carrier carrying the signal to the building. You could have an internal network thats rock solid but the local exchange carrier facilities, servicing your location, may only be available as a single feed to the building. Another area of concern may be the redundancy/diversity for the central office serving your location. If youve got construction up the street and a construction worker is using a jackhammer, guess what? Youre history. You have to determine how far out you need to take the analysis.
Browne: My experience has been that a majority of outages have been last mile outages.
Rooker: One extreme example of such an outage is the story of a farmer in Alabama burying a cow with a backhoe that hit a fiber link and took out a majority of the East coast. We have seen the enemy and it is a backhoe!
Moderator: In terms of storage, companies can opt to outsource it to a storage service provider to ensure that data are in a different physical location, or use them as a redundant facility. But there are caveats to that. Many of these providers are going belly up and giving three days notice that you have to get your data out or its toast.
Wilson: A leading hosting provider is exactly in that situation since it filed Chapter 11 last month. You have to ask the fundamental question, How critical is that data to my business and am I going to trust it to somebody else? Most companies arent that comfortable with this particular market. If its less mission critical, or you cant afford to build one of these solutions yourself, you still have to do due diligence to ensure that the storage service provider has technical and financial viability. Then, you have to ask yourself what your contingency plan is if they do go out of business. These days, companies are requiring legal clauses that ensure conversion facilities to another provider in the event that the provider goes out of business. Its sad, but we have to anticipate that these companies might go belly up. This requires cash reserves on the providers part to ensure that the work gets done. Were back to the concept of an insurance policy.
Moderator: Is there anything else we can highlight here with regard to storage or other services?
Hassell: Internet hosting centers is a similar example. An e-business that uses an outsourced hosting facility would be crazy not to have a contingency clause built into their contract with the hosting center.
Browne: This applies any time you outsource a key component of your business. There needs to be an audit mentality applied to these situations. Contracts need to acknowledge back-up plans and the like.
Moderator: Are there any final comments any of you would like to make?
Browne: Recovery planning is increasingly part of the business life cycle. When you think about recoverability at the very beginning of a life cycle -- the requirements, architecture, design, construction, testing, implementation, and the post-mortem -- then a company is doing the right level of risk management. This is what we help companies do. Recoverability is a piece of the overall business concept.
One key point: as companies move into the so-called e-commerce space, speed is a critical variable. It used to be that you could do disaster recovery in 72 hours and satisfy the auditors. In recent years, it has shifted to 24 hours, then down to four hours. Today it may be that you have to recover in minutes. If companies understand this point and apply the right risk management to it, then they can identify the right technology to recover successfully.
Rooker: A buy-in from a companys management team is extremely important. It cant be underestimated in the process of disaster recovery planning and implementation.
Wilson: Companies forget to look at the entire business process in their disaster recovery planning. You have to look at all your partners, suppliers, distribution point or channels, whatever. You can have your own systems covered, but if a critical piece of your business lies in someone elses shop, youve got to ensure theyre covered as well. The plan needs to be very comprehensive even outside of your space. A case in point: I dont think anybody in upper Manhattan expected to lose phone service from an event downtown. But they did. This will probably spawn other technologies, including voice over IP as redundant communications lines. Phone service was one service that was taken for granted.
Hassell: Recovery
time, which relates to the pain threshold concept discussed earlier, brings
in financial impact into all of this. Each company is going to have to look
at resumption, recovery and restoration and place them on a scale with dollars
and time.
© Copyright 2002 Predictive Systems, Inc. All Rights Reserved.
Please explore our other menu options for more information on NetConnX and
the services that we provide.
NetConnX 888-411-1699 24hrs
Copyright © 1996 - 2003 NetConnX Technologies