ad

Friday, January 29, 2016

Network Performance Optimization(Fault tolerance)

Introduction

As far as network administration goes, nothing is more important than fault tol- erance and disaster recovery. First and foremost, it is the responsibility of the net- work administrator to safeguard the data held on the servers and to ensure that, when requested, this data is ready to go.
Because both fault tolerance and disaster recovery are such an important part of network administration, they are well represented on the CompTIA Network+ exam. In that light, this chapter is important in terms of both real-world appli- cation and the exam itself.



What Is Uptime?

All devices on the network, from routers to cabling, and especially servers, must have one prime underlying trait: availability. Networks play such a vital role in the operation of businesses that their availability must be measured in dollars. The failure of a single desktop PC affects the productivity of a single user. The failure of an entire network affects the productivity of the entire company and potentially the company’s clients as well. A network failure might have an even larger impact than that as new e-commerce customers look somewhere else for products, and existing customers start to wonder about the site’s reliability.
Every minute that a network is not running can potentially cost an organization money. The exact amount depends on the role that the server performs and how long it is unavailable. For example, if a small departmental server supporting 10 people goes down for one hour, this might not be a big deal. If the server that runs the company’s e-commerce website goes down for even 10 minutes, it can cost hundreds of thousands of dollars in lost orders.
The importance of data availability varies between networks, but it dictates to what extent a server/network implements fault tolerance measures. The project- ed capability for a network or network component to weather failure is defined as a number or percentage. The fact that no solution is labeled as providing 100 percent availability indicates that no matter how well we protect our networks, some aspect of the configuration will fail sooner or later.
So how expensive is failure? In terms of equipment replacement costs, it’s not that high. In terms of how much it costs to actually fix the problem, it is a little more expensive. The actual cost of downtime is the biggest factor. For business- es, downtime impacts functionality and productivity of operations. The longer the downtime, the greater the business loss.
Assuming that you know you can never really obtain 100% uptime, what should you aim for? Consider this. If you were responsible for a server system that was available 99.5% of the time, you might be satisfied. But if you realized that you would also have 43.8 hours of downtime each year—that’s one full workweek and a little overtime—you might not be so smug. Table 8.1 compares various levels of downtime.



Table 8.1  Levels of Availability and Related Downtime
Level of Availability
Availability %
Downtime Per Year
Commercial availability
99.5%
43.8 hours
High availability
99.9%
8.8 hours
Fault-resilient clusters
99.99%
53 minutes
Fault-tolerant
99.999%
5 minutes
Continuous
100%
0

These figures make it simple to justify spending money on implementing fault tolerance measures. Just remember that even to reach the definition of commer- cial availability, you will need to have a range of measures in place. After the commercial availability level, the strategies that take you to each subsequent level are likely to be increasingly expensive, even though they might be easy to justify.
For example, if you estimate that each hour of server downtime will cost the company $1,000, the elimination of 35 hours of downtime—from 43.8 hours for commercial availability to 8.8 hours for high availability—justifies some serious expenditure on technology. Although this first jump is an easily justifiable one, subsequent levels might not be so easy to sell. Working on the same basis, mov- ing from high availability to fault-resilient clusters equates to less than $10,000, but the equipment, software, and skills required to move to the next level will far exceed this figure. In other words, increasing fault tolerance is a law of diminishing returns. As your need to reduce the possibility of downtime increas- es, so does the investment required to achieve this goal.
The role played by the network administrator in all of this can be somewhat challenging. In some respects, you must function as if you are selling insurance. Informing management of the risks and potential outcomes of downtime can seem a little sensational, but the reality is that the information must be provid- ed if you are to avoid post-event questions about why management was not made aware of the risks. At the same time, a realistic evaluation of exactly the risks presented is needed, along with a realistic evaluation of the amount of downtime each failure might bring.



The Risks

Having established that you need to guard against equipment failure, you can now look at which pieces of equipment are more liable to fail than others. In terms of component failure, the hard disk is responsible for 50 percent of all sys- tem downtime. With this in mind, it should come as no surprise that hard disks have garnered the most attention when it comes to fault tolerance. Redundant array of inexpensive disks (RAID), which is discussed in detail in this chapter, is a set of standards that allows servers to cope with the failure of one or more hard disks.





In fault tolerance, RAID is only half the story. Measures are in place to cope with failures of most other components as well. In some cases, fault tolerance is an elegant solution, and in others, it is a simple case of duplication. We’ll start our discussion by looking at RAID, and then we’ll move on to other fault toler- ance measures.

Fault Tolerance

As far as computers are concerned, fault tolerance refers to the capability of the computer system or network to provide continued data availability in the event of hardware failure. Every component within a server, from the CPU fan to the power supply, has a chance of failure. Some components such as processors rarely fail, whereas hard disk failures are well documented.
Almost every component has fault tolerance measures. These measures typical- ly require redundant hardware components that can easily or automatically take over when a hardware failure occurs.
Of all the components inside computer systems, the hard disks require the most redundancy. Not only are hard disk failures more common than for any other component, but they also maintain the data, without which there would be lit- tle need for a network.








Disk-Level  Fault Tolerance

Deciding to have ftard disk fault tolerance on the server is the first step; the sec- ond is deciding which fault tolerance strategy to use. Hard disk fault tolerance is implemented according to different RAID levels. Each RAID level offers dif- fering amounts of data protection and performance. The RAID level appropri- ate for a given situation depends on the importance placed on the data, the dif- ficulty of replacing that data, and the associated costs of a respective RAID implementation. Often, the costs of data loss and replacement outweigh the costs associated with implementing a strong RAID fault tolerance solution. RAID can be deployed through dedicated hardware, which is more costly, or can be software-based. Today’s network operating systems, such as UNIX and Windows server products, have built-in support for RAID.

RAID 0: Stripe Set Without Parity
Although it’s given RAID status, RAID 0 does not actually provide any fault tol- erance. In fact, using RAID 0 might even be less fault-tolerant than storing all your data on a single hard disk.
RAID 0 combines unused disk space on two or more hard drives into a single logical volume, with data written to equally sized stripes across all the disks. Using multiple disks, reads and writes are performed simultaneously across all drives. This means that disk access is faster, making the performance of RAID 0 better than other RAID solutions and significantly better than a single hard disk. The downside of RAID 0 is that if any disk in the array fails, the data is lost and must be restored from backup.



Because of its lack of fault tolerance, RAID 0 is rarely implemented. 



Disk 0


Disk 1


Disk 2



      



RAID 1
One of the more common RAID implementations is RAID 1. RAID 1 requires two hard disks and uses disk mirroring to provide fault tolerance. When infor- mation is written to the hard disk, it is automatically and simultaneously written to the second hard disk. Both of the hard disks in the mirrored configuration use the same hard disk controller; the partitions used on the hard disk need to  be
approximately the same size to establish the mirror. In the mirrored configura- tion, if the primary disk were to fail, the second mirrored disk would contain all the required information, and there would be little disruption to data availabil- ity. RAID 1 ensures that the server will continue operating in the case of primary disk failure.
A RAID 1 solution has some key advantages. First, it is cheap in terms of cost per megabyte of storage, because only two hard disks are required to provide fault tolerance. Second, no additional software is required to establish RAID 1, because modern network operating systems have built-in support for it. RAID levels using striping are often incapable of including a boot or system partition in fault tolerance solutions. Finally, RAID 1 offers load balancing over multiple disks, which increases read performance over that of a single disk. Write per- formance, however, is not improved.
Because of its advantages, RAID 1 is well suited as an entry-level RAID solution, but it has a few significant shortcomings that exclude its use in many environ- ments. It has limited storage capacity—two 100GB hard drives provide only 100GB of storage space. Organizations with large data storage needs can exceed a mirrored solution capacity in very short order. RAID 1 also has a single point of failure, the hard disk controller. If it were to fail, the data would be inacces- sible on either drive. 

An extension of RAID 1 is disk duplexing. Disk duplexing is the same as mirror- ing, with the exception of one key detail: It places the hard disks on separate hard disk controllers, eliminating the single point of failure.



RAID 5
RAID 5, also known as disk striping witft parity, uses distributed parity to write information across all disks in the array. Unlike the striping used in RAID 0, RAID 5 includes parity information in the striping, which provides fault toler- ance. This parity information is used to re-create the data in the event of a failure. RAID 5 requires a minimum of three disks, with the equivalent of a single disk being used for the parity information. This means that if you have three 40GB hard disks, you have 80GB of storage space, with the other 40GB used for parity. To increase storage space in a RAID 5 array, you need only add another disk to the array. Depending on the sophistication of the RAID setup  you  are  using,  the RAID controller will be able to incorporate the new drive into the array automat- ically, or you will need to rebuild the array and restore the data from backup.
Many factors have made RAID 5 a very popular fault-tolerant design. RAID 5 can continue to function in the event of a single drive failure. If a hard disk in the array were to fail, the parity would re-create the missing data and continue to function with the remaining drives. The read performance of RAID 5 is improved over a single disk.





The RAID 5 solution has only a few  drawbacks:
. The costs of implementing RAID 5 are initially higher than other fault tolerance measures requiring a minimum of three hard disks. Given the costs of hard disks today, this is a minor concern. However, when it comes to implementing a RAID 5 solution, hardware RAID 5 is more expensive than a software-based RAID 5 solution.
. RAID 5 suffers from poor write performance because the parity has to be calculated and then written across several disks. The performance lag is minimal and doesn’t have a noticeable difference on the network.
. When a new disk is placed in a failed RAID 5 array, there is a regenera- tion time when the data is being rebuilt on the new drive. This process requires extensive resources from the server.




RAID 10
Sometimes RAID levels are combined to take advantage of the best of each. One such strategy is RAID 10, which combines RAID levels 1 and 0. In this config- uration, four disks are required. As you might expect, the configuration consists of a mirrored stripe set. To some extent, RAID 10 takes advantage of the perform- ance capability of a stripe set while offering the fault tolerance of a mirrored solution. As well as having the benefits of each, though, RAID 10 inherits the shortcomings of each strategy. In this case, the high overhead and decreased write performance are the disadvantages.




RAID Level
Description
Advantages
Disadvantages
Required Disks

 
Table 8.2 Summary of RAID Levels

RAID 0
Disk striping
Increased read
Does not offer
Two or


and write
any fault
more


performance.
tolerance.



RAID 0 can be




implemented




with two or more




disks.


RAID 1
Disk mirroring
Provides fault
RAID 1 has
Two


tolerance. Can also
50 percent



be used with
overhead and



separate disk
suffers from



controllers,
poor write



reducing the
performance.



single point of




failure. This is




called disk




duplexing.







RAID Level
Description
Advantages
Disadvantages
Required Disks

 
Table 8.2 Summary of RAID Levels Continued

RAID 5
Disk striping
Can recover
May slow down
Minimum of

with
from a single
the network
three

distributed parity
disk failure.
during



Increased read
regeneration



performance
time, and



over a poor-write
performance



single disk. Disks
may suffer.



can be added to




the array to




increase storage




capacity.


RAID 10
Striping with
Increased
High overhead,
Four

mirrored
performance
as with


volumes
with striping.
mirroring.



Offers mirrored




fault tolerance.



Server and Services Fault Tolerance

In addition to providing fault tolerance for individual hardware components, some organizations go the extra mile and include the entire server in the fault- tolerant design. Such a design keeps servers and the services they provide up and running. When it comes to server fault tolerance, two key strategies are com- monly employed: standby servers and server clustering.

Standby Servers
Standby servers are a fault tolerance measure in which a second server is config- ured identically to the first one. The second server can be stored remotely or locally and set up in a failover configuration. In a failover configuration, the sec- ondary server is connected to the primary and is ready to take over the server functions at a moment’s notice. If the secondary server detects that the primary has failed, it automatically cuts in. Network users will not notice the transition, because little or no disruption in data availability occurs.
The primary server communicates with the secondary server by issuing special notification notices called fteartbeats. If the secondary server stops receiving the heartbeat messages, it assumes that the primary has died and therefore assumes the primary server configuration.



Server Clustering
Companies that want maximum data availability and that have the funds to pay for it can choose to use server clustering. As the name suggests, server clustering involves grouping servers for the purposes of fault tolerance and load balancing. In this configuration, other servers in the cluster can compensate for the failure of a single server. The failed server has no impact on the network, and the end users have no idea that a server has failed.
The clear advantage of server clusters is that they offer the highest level of fault tolerance and data availability. The disadvantage is equally clear—cost. The cost of buying a single server can be a huge investment for many organizations; hav- ing to buy duplicate servers is far too costly. In addition to just hardware costs, additional costs can be associated with recruiting administrators who have the skills to configure and maintain complex server clusters. Clustering provides the following advantages:
. Increased performance: More servers equals more processing power. The servers in a cluster can provide levels of performance beyond the scope of a single system by combining resources and processing power.
. Load balancing: Rather than having individual servers perform specific roles, a cluster can perform a number of roles, assigning the appropriate resources in the best places. This approach maximizes the power of the systems by allocating tasks based on which server in the cluster can best service the request.
. Failover: Because the servers in the cluster are in constant contact with each other, they can detect and cope with the failure of an individual sys- tem. How transparent the failover is to users depends on the clustering software, the type of failure, and the capability of the application soft- ware being used to cope with the failure.
. Scalability: The capability to add servers to the cluster offers a degree of scalability that is simply not possible in a single-server scenario. It is worth mentioning, though, that clustering on PC platforms is still in its relative infancy, and the number of machines that can be included in a cluster is still limited.

To make server clustering happen, you need certain ingredients—servers, stor- age devices, network links, and software that makes the cluster work.






Link Redundancy

Although a failed network card might not actually stop the server or a system, it might as well. A network server that cannot be used on the network results in server downtime. Although the chances of a failed network card are relatively low, attempts to reduce the occurrence of downtime have led to the develop- ment of a strategy that provides fault tolerance for network connections.
Through a process called adapter teaming, groups of network cards are config- ured to act as a single unit. The teaming capability is achieved through software, either as a function of the network card driver or through specific application software. The process of adapter teaming is not widely implemented; although the benefits it offers are many, so it’s likely to become a more common sight. The result of adapter teaming is increased bandwidth, fault tolerance, and the ability to manage network traffic more effectively. These features are broken into three sections:
. Adapter fault tolerance: The basic configuration enables one network card to be configured as the primary device and others as secondary. If the primary adapter fails, one of the other cards can take its place with- out the need for intervention. When the original card is replaced, it resumes the role of primary controller.
. Adapter load balancing: Because software controls the network adapters, workloads can be distributed evenly among the cards so that each link is used to a similar degree. This distribution allows for a more responsive server because one card is not overworked while another is underworked.
. Link aggregation: This provides vastly improved performance by allow- ing more than one network card’s bandwidth to be aggregated—combined into a single connection. For example, through link aggregation, four 100Mbps network cards can provide a total of 400Mbps of bandwidth.
Link aggregation requires that both the network adapters and the switch being used support it. In 1999, the IEEE ratified the 802.3ad standard for link aggregation, allowing compatible products to be produced.



Using Uninterruptible Power Supplies

No discussion of fault tolerance can be complete without a look at power-relat- ed issues and the mechanisms used to combat them. When you’re designing a fault-tolerant system, your planning should definitely include uninterruptible power supplies (UPSs). A UPS serves many functions and is a major part of serv- er consideration and implementation.
On a basic level, a UPS is a box that holds a battery and built-in charging circuit. During times of good power, the battery is recharged; when the UPS is needed, it’s ready to provide power to the server. Most often, the UPS is required to pro- vide enough power to give the administrator time to shut down the server in an orderly fashion, preventing any potential data loss from a dirty shutdown.

Why Use a UPS?
Organizations of all shapes and sizes need UPSs as part of their fault tolerance strategies. A UPS is as important as any other fault tolerance measure. Three key reasons make a UPS necessary:
. Data availability: The goal of any fault tolerance measure is data avail- ability. A UPS ensures access to the server in the event of a power fail- ure—or at least as long as it takes to save a file.
. Protection from data loss: Fluctuations in power or a sudden power- down can damage the data on the server system. In addition, many servers take full advantage of caching, and a sudden loss of power could cause the loss of all information held in cache.
. Protection from hardware damage: Constant power fluctuations or sudden power-downs can damage hardware components within a com- puter. Damaged hardware can lead to reduced data availability while the hardware is being repaired.


Power Threats
In addition to keeping a server functioning long enough to safely shut it down, a UPS safeguards a server from inconsistent power. This inconsistent power can take many forms. A UPS protects a system from the following power-related threats:
. Blackout: A total failure of the power supplied to the server.
. Spike: A spike is a very short (usually less than a second) but very intense increase in voltage. Spikes can do irreparable damage to any kind of equipment,  especially computers.



. Surge: Compared to a spike, a surge is a considerably longer (sometimes many seconds) but usually less intense increase in power. Surges can also damage your computer equipment.
. Sag: A sag is a short-term voltage drop (the opposite of a spike). This type of voltage drop can cause a server to reboot.
. Brownout: A brownout is a drop in voltage that usually lasts more than a few minutes.

Many of these power-related threats can occur without your knowledge; if you don’t have a UPS, you cannot prepare for them. For the cost, it is worth buying  a UPS, if for no other reason than to sleep better at  night.

Disaster Recovery

Even the most fault-tolerant networks will fail, which is an unfortunate fact. When those costly and carefully implemented fault tolerance strategies fail, you are left with disaster recovery.
Disaster recovery can take many forms. In addition to disasters such as fire, flood, and theft, many other potential business disruptions can fall under the banner of disaster recovery. For example, the failure of the electrical supply to your city block might interrupt the business functions. Such an event, although not a disaster per se, might invoke the disaster recovery methods.
The cornerstone of every disaster recovery strategy is the preservation and recoverability of data. When talking about preservation and recoverability, we are talking about backups. When we are talking about backups, we are likely talking about tape backups. Implementing a regular backup schedule can save you a lot of grief when fault tolerance fails or when you need to recover a file that has been accidentally deleted. When it comes time to design a backup schedule, three key types of backups are used—full, differential, and incremental.

Full Backup

The preferred method of backup is the full backup method, which copies all files and directories from the hard disk to the backup media. There are a few reasons why doing a full backup is not always possible. First among them is likely the time involved in performing a full backup.







Depending on the amount of data to be backed up, full backups can take an extremely long time and can use extensive system resources. Depending on the configuration of the backup hardware, this can slow down the network consid- erably. In addition, some environments have more data than can fit on a single tape. This makes doing a full backup awkward, because someone might need to be there to change the tapes.
The main advantage of full backups is that a single tape or tape set holds all the data you need backed up. In the event of a failure, a single tape might be all that is needed to get all data and system information back. The upshot of all this is that any disruption to the network is greatly reduced.
Unfortunately, its strength can also be its weakness. A single tape holding an organization’s data can be a security risk. If the tape were to fall into the wrong hands, all the data could be restored on another computer. Using passwords on tape backups and using a secure offsite and onsite location can minimize the security risk.

Differential  Backup

Companies that just don’t have enough time to complete a full backup daily can make use of the differential backup. Differential backups are faster than a full backup, because they back up only the data that has changed since the last full backup. This means that if you do a full backup on a Saturday and a differential backup on the following Wednesday, only the data that has changed since Saturday is backed up. Restoring the differential backup requires the last full backup and the latest differential backup.
Differential backups know what files have changed since the last full backup because they use a setting called the arcftive bit. The archive bit flags files that have changed or have been created and identifies them as ones that need to be backed up. Full backups do not concern themselves with the archive bit, because all files are backed up, regardless of date. A full backup, however, does clear the archive bit after data has been backed up to avoid future confusion. Differential backups take notice of the archive bit and use it to determine which files have changed. The differential backup does not reset the archive bit information.






Incremental Backup

Some companies have a finite amount of time they can allocate to backup pro- cedures. Such organizations are likely to use incremental backups in their backup strategy. Incremental backups save only the files that have changed since the last full or incremental backup. Like differential backups, incremental backups use the archive bit to determine which files have changed since the last full or incre- mental backup. Unlike differentials, however, incremental backups clear the archive bit, so files that have not changed are not backed up.




The faster backup time of incremental backups comes at a price—the amount of time required to restore. Recovering from a failure with incremental backups requires numerous tapes—all the incremental tapes and the most recent full backup. For example, if you had a full backup from Sunday and an incremental for Monday, Tuesday, and Wednesday, you would need four tapes to restore the data. Keep in mind that each tape in the rotation is an additional step in the restore process and an additional failure point. One damaged incremental tape, and you will be unable to restore the data. Table 8.3 summarizes the various backup strategies.



Backup Type     Advantages
Disadvantages
Data Backed Up
Archive Bit

 
Table 8.3   Backup Strategies

Full                   Backs up all
Depending
All files and
Does not use the
data on a
on the amount
directories are
archive bit, but
single tape
of data, full
backed up.
resets it after
or tape set.
backups can

data has been
Restoring
take a long time.

backed up.
data requires



the fewest



tapes.








Backup Type     Advantages
Disadvantages
Data Backed Up
Archive Bit

 
Table 8.3  Backup Strategies         Continued

Differential
Faster backups
Uses more
All files and
Uses the archive

than a full
tapes than a
directories
bit to determine

backup.
full backup.
that have
the files that


The restore
changed since
have changed,


process takes
the last full or
but does not


longer than
differential
reset the archive


a full backup.
backup.
bit.
Incremental
Faster backup
Requires
The files and
Uses the archive

times.
multiple disks;
directories that
bit to determine


restoring data
have changed
the files that


takes more
since the last
have changed,


time than the
full or
and resets the


other backup
incremental
archive bit.


methods.
backup.


Tape Rotations

After you select a backup type, you are ready to choose a backup rotation. Several backup rotation strategies are in use—some good, some bad, and some really bad. The most common, and perhaps the best, rotation strategy is grand- father,  father,  son (GFS).
The GFS backup rotation is the most widely used—and for good reason. For example, a GFS rotation may require 12 tapes: four tapes for daily backups (son), five tapes for weekly backups (father), and three tapes for monthly backups (grandfather).
Using this rotation schedule, you can recover data from days, weeks, or months earlier. Some network administrators choose to add tapes to the monthly rota- tion so that they can retrieve data even further back, sometimes up to a year. In most organizations, however, data that is a week old is out of date, let alone six months or a year.

Backup Best Practices

Many details go into making a backup strategy a success. The following are issues to consider as part of your backup plan:
. Offsite storage: Consider storing backup tapes offsite so that in the event of a disaster in a building, a current set of tapes is still available offsite.
The offsite tapes should be as current as any onsite and should be secure.



. Label tapes: The goal is to restore the data as quickly as possible, and trying to find the tape you need can be difficult if it isn’t marked.
Furthermore, this can prevent you from recording over a tape you need.
. New tapes: Like old cassette tapes, the tape cartridges used for the backups wear out over time. One strategy used to prevent this from becoming a problem is to periodically introduce new tapes into the rota- tion schedule.
. Verify backups: Never assume that the backup was successful. Seasoned administrators know that checking backup logs and performing periodic test restores are part of the backup process.
. Cleaning: You need to clean the tape drive occasionally. If the inside gets dirty, backups can fail.




Hot and Cold Spares

The impact that a failed component has on a system or network depends large- ly on the predisaster preparation and on the recovery strategies used. Hot and cold spares represent a strategy for recovering from failed components.

Hot Spares and Hot Swapping
Hot spares allow system administrators to quickly recover from component fail- ure. In a common use, a hot spare enables a RAID system to automatically fail over to a spare hard drive should one of the other drives in the RAID array fail. A hot spare does not require any manual intervention. Instead, a redundant drive resides in the system at all times, just waiting to take over if another drive fails. The hot spare drive takes over automatically, leaving the failed drive to be removed later. Even though hot-spare technology adds an extra level of protec- tion to your system, after a drive has failed and the hot spare has been used, the situation should be remedied as soon as possible.
Hot swapping is the ability to replace a failed component while the system is running. Perhaps the most commonly identified hot-swap component is the hard drive. In certain RAID configurations, when a hard drive crashes, hot swapping allows you to simply take the failed drive out of the server and install a new one.



The benefits of hot swapping are very clear in that it allows a failed component to be recognized and replaced without compromising system availability. Depending on the system’s configuration, the new hardware normally is recog- nized automatically by both the current hardware and the operating system. Nowadays, most internal and external RAID subsystems support the hot-swap- ping feature. Some hot-swappable components include power supplies and hard disks.

Cold Spares and Cold Swapping
The term cold spare refers to a component, such as a hard disk, that resides with- in a computer system but requires manual intervention in case of component failure. A hot spare engages automatically, but a cold spare might require con- figuration settings or some other action to engage it. A cold spare configuration typically requires a reboot of the system.
The term cold spare has also been used to refer to a redundant component that is stored outside the actual system but is kept in case of component failure. To replace the failed component with a cold spare, you need to power down the sys- tem.
Cold swapping refers to replacing components only after the system is com- pletely powered off. This strategy is by far the least attractive for servers, because the services provided by the server are unavailable for the duration of the cold-swap procedure. Modern systems have come a long way to ensure that cold swapping is a rare occurrence. For some situations and for some compo- nents, however, cold swapping is the only method to replace a failed component. The only real defense against having to shut down the server is to have redun- dant components reside in the system.




Hot, Warm, and Cold Sites

A disaster recovery plan might include the provision for a recovery site that can be brought into play quickly. These sites fall into three categories: hot,    warm,



and cold. The need for each of these types of sites depends largely on the busi- ness you are in and the funds available. Disaster recovery sites represent the ulti- mate in precautions for organizations that really need them. As a result, they don’t come cheap.
The basic concept of a disaster recovery site is that it can provide a base from which the company can be operated during a disaster. The disaster recovery site normally is not intended to provide a desk for every employee. It’s intended more as a means to allow key personnel to continue the core business functions.
In general, a cold recovery site is a site that can be up and operational in a rela- tively short amount of time, such as a day or two. Provision of services, such as telephone lines and power, is taken care of, and the basic office furniture might be in place. But there is unlikely to be any computer equipment, even though the building might well have a network infrastructure and a room ready to act as a server room. In most cases, cold sites provide the physical location and basic services.
Cold sites are useful if you have some forewarning of a potential problem. Generally speaking, cold sites are used by organizations that can weather the storm for a day or two before they get back up and running. If you are the regional office of a major company, it might be possible to have one of the other divisions take care of business until you are ready to go. But if you are the one and only office in the company, you might need something a little  hotter.
For organizations with the dollars and the desire, hot recovery sites represent the ultimate in fault tolerance strategies. Like cold recovery sites, hot sites are designed to provide only enough facilities to continue the core business func- tion, but hot recovery sites are set up to be ready to go at a moment’s notice.
A hot recovery site includes phone systems with the phone lines already con- nected. Data networks also are in place, with any necessary routers and switch- es plugged in and turned on. Desks have desktop PCs installed and waiting, and server areas are replete with the necessary hardware to support business-critical functions. In other words, within a few hours, the hot site can become a fully functioning element of an organization.
The issue that confronts potential hot-recovery site users is simply that of cost. Office space is expensive in the best of times, but having space sitting idle 99.9 percent of the time can seem like a tremendously poor use of money. A very popular strategy to get around this problem is to use space provided in a disas- ter recovery facility, which is basically a building, maintained by a third-party company, in which various businesses rent space. Space is usually apportioned according to how much each company pays.




Sitting between the hot and cold recovery sites is the warm site. A warm site typ- ically has computers but is not configured ready to go. This means that data might need to be upgraded or other manual interventions might need to be per- formed before the network is again operational. The time it takes to get a warm site operational lands right in the middle of the other two options, as does the cost.

No comments:

Post a Comment