Thursday, November 7, 2013

IT Contingency Planning

James E. Gilbert
UMUC
August 2, 2013

Abstract
Modern organizations increasingly rely on information technology (IT) to conduct their daily activities.  As a result, ensuring the resiliency of this asset has become a critical component for most enterprises.  From hurricanes and power outages to cyberattacks, public agencies and private businesses alike face a myriad of threats from both manmade and natural causes.  To mitigate these risks, it is imperative that organizations devise an appropriate contingency plan that incorporates backups and safeguards for IT infrastructure.   The following paper outlines the various planning steps, recovery operations and testing requirements necessary to ensure a successful business continuity plan with a 24-month proposal to adequately test the preparations.  Although maintaining a comprehensive contingency plan requires a significant expenditure of personnel, equipment, and production costs, not developing a backup often proves far more costly.

Introduction
According to a study conducted by McGladrey and Pullen LLP, 43% of companies that experience a disruptive event lasting 10 days never reopen.  51% of firms continue to operate for up to two years following a major data outage, with only 6% of businesses surviving in the long-term (Tittel & Korelc, 2013).  Given the necessity for business continuity as well as the increased dependence on IT systems and services, ensuring the availability of these resources has become paramount in the contingency planning cycle.  While business continuity planning should be tailored for each organization, a number of similarities exist throughout all plans with the Disaster Recovery Institute International (DRII) identifying these common tasks (Vacca, 2009).  Among these steps include planning activities such as conducting a business impact analysis (BIA) and risk assessment.  Organizations must also determine recovery options by identifying relevant risks, selecting appropriate strategies and developing a comprehensive contingency plan.  Finally, continuity operations must also incorporate a verification component.  This includes personnel training, periodic testing, and maintenance of the plan as changes in the organizational mission or structure occurs.  Each step of the contingency planning process is important to the overall success of the enterprise.  Ensuring continuity throughout a disaster requires appropriate resources allocated to critical systems within an organization which necessitates a strong commitment by a firm’s senior management.

Planning
The first step in designing a relevant business contingency arrangement is planning.  This stage requires an organization to weigh the returns from any proposed safeguards.  The frequency and severity of an outage should be assessed when determining the amount of resources that should be devoted to this process.  Applying these considerations to IT resources may be difficult for some firms.  It is often complicated to assess the exact level of impact an intrusion can have on a firm with cyberattacks ranging from amateur denial-of-service (DoS) attacks to advanced persistent threats (APTs) perpetrated by nation states.   Moreover, determining the rate of occurrence of cyberattacks is difficult to estimate for organizations that have never experienced one.  In these instances, it is the responsibility of the cybersecurity professional to make a convincing case for the incorporation of IT resources into the business continuity plan.  This often requires computer security personnel to demonstrate the anticipated return on investment (ROI) that adequate planning will provide an organization (UMUC, 2013).

According to the DRII, this stage should include both a BIA and a risk assessment.  The BIA assesses the potential toll an outage can take on a critical business area.  Conducting this analysis requires stakeholders to identify key value drivers within the firm.  These are the elements within an organization deemed most critical to long-term operations.  Examples of value drivers include components such as intellectual property or data operations (Vacca, 2009).  The amount of resources a firm allocates to system restoration depends on the level of impact an outage is anticipated to have on daily operations.  The BIA assists managers in designing a hierarchy to determine which activities or areas should be reestablished first (Slater, 2012).

The second component in the planning stage is a risk assessment.  This step requires enterprises to perform an objective analysis of probable and possible risks that could affect daily operations.  This step should account for types of disasters or outages historically encountered taking into account the anticipated frequency of occurrence as well as the impact each incident is expected to have on the organization.  With this data, managers can then make an educated decision on how much investment is required to mitigate the impact from potential outages. 

Recovery Operations
The second major component of the contingency planning process that DRII identified deals with recovery strategies.  This includes identifying continuity options based on various scenarios, selecting the strategy most applicable to an organization’s needs and developing a continuity plan based on this data (Vacca, 2009).  Although the details vary greatly depending on the incident, the general theme should always focus on communication.  Contingency plans must include how organizations transmit information in the event of an emergency as well as how employees will talk to each other if normal communication channels are broken.  While some companies may value IT resources while other firms rely more heavily on supply chain logistics, every contingency arrangement should be planned and coordinated with business, security and IT managers working in conjunction to ensure continuity of operations (Slater, 2012). 

As IT resources become increasingly more important in the modern business community, the amount and types of disasters that an organization may encounter have risen significantly.  Where past disasters included natural occurrences such as hurricanes or floods, enterprises today must now also consider outages to their networks caused by manmade sources.  This increase in the number of potential outages has led to the creation of a variety of third-party service providers.  Modern enterprises no longer have to create contingency plans from scratch.  A number of companies offer specialized continuity planning software, while others provide turnkey arrangements to facilitate backup operations.  From data centers to mobile recovery services, Gartner estimates this area represents a $3 to $4 billion dollar industry (Collet, 2007).  Although organizations considering outsourcing this area have a number of options to consider, the primary consideration for IT resources “…requires that the company install backup and recovery systems to override any type of crisis in support of physical and digital security” (UMUC, 2013, p. 8).

Physical
The physical security aspect of a contingency plan includes ensuring that an alternate offsite location is available in the event of an emergency.   This includes not only physical office space but also the IT resources necessary to continue operations during an outage.  From servers and networks to data backups, this component provides a means of ensuring parallel operations.  Backup sites can range from physical locations with minimal infrastructure to sites that fully imitate current operations.  From locations owned and operated by the enterprise to reciprocal agreements with similar firms, organizations have a number of recovery options available.  Like many aspect of business continuity, the level of physical preparedness is often dictated by financial considerations.

Physical backup locations generally fall into three main categories ranging from basic to advanced: cold sites, warm sites, and hot sites (Swanson, Bowen, Phillips, & Gallup, 2010).  Cold sites are facilities with the lowest level of preparation and accordingly are often the least expensive to maintain.  These locations usually have minimal infrastructure in place beyond electricity and environmental controls.  As a result, cold sites require the longest amount of lead time to setup and become fully operational.  The next type of backup facility is a warm site.  These locations have more preparations in place than cold sites and as such are also more expensive to maintain.  Warm sites are usually partially furnished with some or all IT resources and telecommunication equipment already in place.  Accordingly, these facilities require less time to activate than cold sites.  The last category of physical backup locations is a hot site.  “Hot sites are facilities appropriately sized to support system requirements and configured with the necessary system hardware, supporting infrastructure, and support personnel” (Swanson et al., 2010, p. 22).  These locations require the least amount of time to become active with some maintaining a full-time staff.  As a result, hot sites represent the most expensive scenario for most organizations.

Digital
The second major component in assessing recovery options revolves around digital security considerations.  Although infrastructure and personnel are critical aspects in contingency planning, business continuity must also take into account data backups.  Inherent in this process is a multitude of questions and technologies.  Similar to physical security planning, this area is also heavily influenced by cost considerations (UMUC, 2013).

Depending on mission requirements, enterprises may choose any number of methods to backup digital media, databases, or proprietary data.  Decisions on how often data is backed up and to what extent should be guided by the critical nature of the information.  Organizational policy should be clear in dictating the frequency and scope of information archiving.  Additional considerations should include the location of media, frequency of data rotation and the data transmission method to an offsite location.  The National Institute of Standards and Technology (NIST) issues the Federal Information Processing Standards Publication (FIPS) 199, entitled the Standards for Security Categorization of Federal Information and Information Systems.  FIPS 199 outlines the recommended recovery strategies depending on the level of impact an outage is anticipated to have on an organization.  NIST recommends tape backups and a cold site for low priority events.  Outages anticipated to have a moderate effect on daily operations should be mitigated with optical backups and WAN/VLAN replications as well as a cold or warm site.  Finally, NIST recommends a backup strategy that includes mirrored systems and a hot site location for severe disruptions to an organization’s most mission critical systems (Swanson et al., 2010). 

As more organizations chose to backup their critical data, this in turn has led to an increase in the number of companies providing data archiving.  From data centers providing cloud storage to commercial vendors offering full service transportation and restoration services, modern organizations have a number of alternatives to choose from.  Enterprises who retain third-party providers should weigh a variety of criteria.  Considerations such as geographic location could become an issue if the vendor is close enough to the customer to also be affected by an outage.  Other deciding factors should include the accessibility of the stored data, security of the archived media, environmental considerations and of course, cost (Swanson et al., 2010).

Testing Requirements
The third major category the DRII associates with business continuity is the verification, maintenance, and personnel training associated with a disaster recovery plan.  Testing contingency preparations is an important component in this process.  Ensuring relevant personnel are adequately trained for their role during an outage helps guarantee a smooth operation during an actual event.  Additionally, a business continuity plan should be thought of as a living document.  Enterprises should periodically reassess and update contingency plans as mission requirements or organizational structure changes.  Finally, verifying the accuracy and capability of a plan also provides an additional measure of preparedness prior to an actual incident (Vacca, 2009). 

Tabletop and Functional Exercises
According to NIST, the two main evaluations are tabletop and functional exercises (Grance, Nolan, Burke, Dudley, White & Good, 2006).  Tabletop exercises are discussion-based activities where participants role-play their responsibilities during a simulated emergency.  These types of evaluations are usually conducted in an informal classroom setting with personnel discussing their roles and actions during an outage.  A facilitator guides participants through one or more scenarios in the attempt at meeting previously defined objectives.  Depending on the number of scenarios and the detail involved, tabletop exercises can last anywhere from two to eight hours.  This type of evaluation represents the most cost effective means of testing the viability of a business continuity plan.  Tabletop tests provide a forum for team members to demonstrate their emergency knowledge as well as give managers the ability to review contingency plans for errors, missing information or inconsistencies (Kirvan, 2009).

The other most commonly utilized validation activity is a functional exercise.  This evaluation is also scenario driven but instead of discussion-based, functional exercises employ a simulated operational environment.  These types of evaluations are designed to test various aspects of an IT plan to include personnel, procedures or equipment.  Components to test can include recovery site operations, backup systems, and any third-party continuity services (Kirvan, 2009).  Functional or simulated exercises can vary in size and scope and can cover a single component or a full-scale evaluation of an enterprise.  As a result, these tests can last anywhere from several hours to several days and often represent the most costly and time-consuming of the continuity evaluation tools (Grance et al., 2006).  Although they require a significant amount of resource expenditures, functional exercises are also one of the most effective methods of testing a disaster recovery plan prior to an actual event.

Alternate Testing
Although tabletop and functional exercises are the two most commonly utilized methods of evaluation, the commercial vendor Search Disaster Recover also recommends a variety of alternate tests to include plan reviews, orientation tests, and drills (Kirvan, 2009).  In a plan review, participants discuss the proposed business continuity plan in an informal setting.  This step is similar to a tabletop exercise albeit without a scenario.  Orientation tests introduce participants to the contingency plan and helps orient new staff to the disaster recovery policies and procedures of an organization.  Testing time for this evaluation can be as little as an hour and should be considered as a component in the employee training curriculum.  Finally, drills provide an impromptu method of testing staff on established emergency procedures.  These types of evaluations provide training under realistic conditions and are routinely used for response to natural disasters.

24-Month Testing Plan
Testing the veracity of a continuity plan encompasses a number of different exercises.  With a variety of activities available to an organization, the key is to incorporate annual testing into the overall disaster recovery process.  From drills to full-scale events, each activity possesses both merits in the form of preparation and drawbacks in the form of time and financial expenditures.   Finding a balance between an adequate amount of testing and a sufficient level of resource allocation is often the primary difficulty for organizations.  In addition to the actual amount of time needed to conduct the exercise, a far greater amount of time is necessary for “preparation and execution, funding, careful planning and a structured process from pre-test through test and post-test evaluation” (Kirvan, 2009).  Optimally, the financial considerations of any continuity plan should be based on organizational needs to include the “…maximum tolerable period of disruption and recovery time from which the specific measures will be based on” (Pinta, 2011, p. 57).  To determine the amount of money that should be spent on contingency planning and preparations, enterprises must consider factors such as the maximum tolerable downtime (MTD), recovery time objective (RTO), and recovery point objective (RPO).  For most organizations, the longer an outage occurs, the more costly it can become.  As a result, firms must balance the costs necessary to recover from an emergency with the cost of disruption to daily operations.   Plotting these two points on a graph allows managers to visualize the optimal cost balance point that should be allocated to business continuity planning (Swanson et al., 2010).  In their Special Publication 800-53, NIST requires federal agencies to test contingency plans on an annual basis at a minimum (Grance et al., 2006).  This provides a solid starting point for the continuity planning cycle. 

Full-scale and Functional Testing
Full-scale tests, which represent the most comprehensive assessment tool, also require the greatest amount of testing and planning time.  These exercises typically last anywhere from two to eight hours, but require a minimum of four months to plan.  Full-scale tests are also expensive and may be disruptive to daily operational activities (Kirvan, 2009).  As a result, a comprehensive test of all IT systems should take place every one to two years.  The exercise should encompass all aspects of a business continuity plan from evacuating the primary site to activating the backup location.  All IT and communication resources should be evaluated during this process to include “…settings of backup policy, data replication, high availability systems, active and passive devices, local mirror of systems and/or data and use of disk protection technology such as RAID technology” (Pinta, 2011, p. 61).  Due to the cost and time necessary to execute this type of plan, organizations should also consider smaller scale functional tests.  These events exercise only a portion of the continuity operation and as such may be planned in as little as three months.  The actual testing usually lasts two to four hours and causes less disruption to an organization’s daily activities (Kirvan, 2009). 

Drills, Orientation and Tabletop Testing
In addition to full-scale and functional exercises, organizations should also consider limited training events that require less planning and can be executed frequently throughout the year.  Orientation tests should be given to all new personnel in order to provide a solid foundation of an organization’s continuity operations and often only require a month to plan and an hour to deliver.  Drills on the most likely emergency scenarios should be conducted quarterly.  This includes exercises such as tornado or earthquake tests, fire drills, and communication plans.  Testing time for these events can be as little as 10 minutes with a planning cycle of one month.  Lastly, tabletop tests should be incorporated into an organization’s contingency preparations to refine the overall continuity plan.  These events should be conducted just prior to a functional or full-scale test every one to two years.  The planning cycle for these events range from two to three months and can be executed in approximately three hours depending on the size of the organization and the scope of the plan (Kirvan, 2009).  Integrating smaller scale exercises into an enterprises’ planning process allows for more frequent tests.  This in turns gives managers more opportunities to identify weaknesses in the continuity testing as well as provides employees more opportunities to practice their assigned duties in the event of an emergency.

Conclusion
As organizations increasingly rely on IT resources for daily operations, the number and variety of potential risks has risen significantly.  Modern enterprises must consider the impact a network outage would have on their business as well as the effects from traditional natural and manmade disasters.  Perhaps now more than ever, companies and agencies alike must ensure they have adequate disaster recovery and contingency plans in place prior to an actual emergency.  A business continuity plan should be tailored to meet an organization’s specific mission and requirements.  Threats and critical assets should be objectively identified utilizing tools such as business impact analysis and risk assessments.  These evaluations can then be used to develop a contingency plan and the necessary training and testing requirements to maintain the emergency preparations.  Finally, a business continuity plan will only succeed if adequate resources, personnel, and time are allocated to the practice.  This requires receiving support from senior management throughout the entire contingency planning process.

References
Collett, S. (2007). Evaluating business continuity services. CSO Security and Risk. Retrieved

Grance, T., Nolan, T., Burke, K., Dudley, R., White, G., & Good, T. (2006). Guide to test,
training, and exercise programs for IT plans and capabilities. NIST. Retrieved from http://csrc.nist.gov/publications/nistpubs/800-84/SP800-84.pdf

Kirvan, P. (2009). Business continuity and disaster recovery testing templates. Search Disaster

Pinta, J. J. (2011). Disaster recovery planning as part of business continuity management. Agris Online Papers in Economics & Informatics, 3(4), 55-61.

Slater, D. (2010). Business continuity and disaster recovery planning: The basics. CSO

Swanson, M., Bowen, P., Phillips, A. W., & Gallup, D. (2010). Contingency planning for federal

Tittel, E., & Korelc, J. (2013). Understanding the need for business continuity management and

University of Maryland University College (UMUC). (2013). Module 11: Service restoration and
business continuity. CSEC 650: Cybercrime Investigation and Digital Forensics. Retrieved from http://tychousa1.umuc.edu

Vacca, J. R. (2009). Computer and information security. Burlington, MA: Morgan Kaufman Publishers.


1 comment:

  1. Thanks for sharing valuable information for off site disaster recovery... Here you find more helpful information on disaster recovery plan example PDF.

    ReplyDelete