Introduction
As far as network administration goes, nothing is more important than fault tol-
erance and disaster recovery. First and foremost,
it is the responsibility of the net- work
administrator to safeguard
the data held on the servers and to ensure that,
when requested, this data is ready to go.
Because both fault tolerance and disaster recovery
are such an important part of
network administration, they are well represented on the CompTIA Network+ exam.
In that light, this chapter is important in terms of both real-world appli-
cation and the exam itself.
What Is Uptime?
All devices on the network,
from routers to cabling, and especially servers,
must have one prime underlying trait: availability.
Networks play such a vital role in the operation of businesses that
their availability must be measured in dollars. The failure of a single desktop
PC affects the productivity of a single user.
The failure of an entire network affects the productivity of the entire
company and potentially the company’s clients
as well. A network failure might have an even larger impact than that as new e-commerce customers
look somewhere else for
products, and existing customers start
to wonder about
the site’s reliability.
Every minute that a network
is not running can potentially cost an organization money. The exact
amount depends on the role that the server performs
and how long it is unavailable. For example, if a small departmental server supporting 10 people goes down for one hour, this might not be a big deal. If the
server that runs the company’s e-commerce
website goes down for even 10 minutes, it can cost hundreds of thousands of dollars in lost orders.
The importance of data availability varies between networks, but it
dictates to what extent a server/network implements fault tolerance measures. The project- ed capability for a network
or network component
to weather failure
is defined as a number or percentage. The fact that no solution
is labeled as providing 100 percent availability indicates that no matter
how well we protect our networks,
some aspect of the configuration will fail sooner or later.
So how expensive is failure? In terms of equipment replacement costs, it’s not that high. In terms of how much it
costs to actually fix the problem, it is a little more expensive. The actual
cost of downtime
is the biggest factor.
For business- es, downtime
impacts functionality and productivity of operations. The longer the downtime,
the greater the business loss.
Assuming that you know you can never really obtain 100% uptime,
what should you aim for? Consider
this. If you were responsible for a server
system that was available 99.5% of the time, you
might be satisfied. But if you realized that you would also have 43.8 hours of
downtime each year—that’s one full
workweek and a little overtime—you might not be so smug. Table 8.1 compares various levels of downtime.
Table 8.1 Levels of Availability and Related Downtime
Level of Availability
|
Availability
%
|
Downtime
Per Year
|
Commercial availability
|
99.5%
|
43.8 hours
|
High availability
|
99.9%
|
8.8 hours
|
Fault-resilient clusters
|
99.99%
|
53 minutes
|
Fault-tolerant
|
99.999%
|
5 minutes
|
Continuous
|
100%
|
0
|
These figures make it simple to justify spending money on implementing
fault tolerance measures. Just remember that even to reach the definition of commer-
cial availability, you will need to
have a range of measures in place. After the commercial availability level, the
strategies that take you to each subsequent level are likely to be increasingly
expensive, even though they might be easy to justify.
For example, if you estimate that each hour of server downtime will cost
the company $1,000, the elimination of 35 hours of downtime—from 43.8 hours for commercial availability to 8.8 hours for high availability—justifies some serious
expenditure on technology. Although
this first jump is an easily justifiable one, subsequent levels might not be so easy to sell. Working on the same basis, mov-
ing from high availability to fault-resilient clusters
equates to less than $10,000, but the equipment, software, and
skills required to move to the next level will far exceed this figure. In other
words, increasing fault tolerance is a law of diminishing returns.
As your need to reduce the possibility of downtime increas- es, so does the investment required
to achieve this goal.
The role played by the network administrator in all of this can be
somewhat challenging. In some respects, you must function
as if you are selling
insurance. Informing management of the risks and potential outcomes of
downtime can seem a little sensational, but the reality is that the information
must be provid- ed if you are to avoid post-event questions about why
management was not made aware of the risks. At the same time, a realistic
evaluation of exactly the risks presented is needed, along with a realistic
evaluation of the amount of downtime each failure might bring.
The Risks
Having established that you need to guard against equipment failure, you
can now look at which pieces of equipment are more liable to fail than others.
In terms of component failure, the hard disk is responsible for 50 percent
of all sys- tem downtime. With this in mind, it should come as no
surprise that hard disks have garnered the most attention when it comes to
fault tolerance. Redundant array of inexpensive disks (RAID), which is discussed
in detail in this chapter, is a set of standards that allows servers
to cope with the failure
of one or more hard disks.
In fault tolerance, RAID is only half the story. Measures are in place to cope with failures of most other components as well. In some cases,
fault tolerance is an elegant solution, and in others, it
is a simple case of duplication. We’ll start
our discussion by looking at RAID, and then we’ll move on to other fault toler-
ance measures.
Fault Tolerance
As far as computers are concerned, fault
tolerance refers to the capability of the computer system or network
to provide continued
data availability in the event of hardware failure. Every component
within a server, from the CPU fan to
the power supply, has a chance of
failure. Some components such as processors rarely fail, whereas hard disk failures
are well documented.
Almost every component has fault tolerance measures. These measures
typical- ly require redundant hardware components that can easily
or automatically take over when a hardware
failure occurs.
Of all the components inside computer systems,
the hard disks require the most
redundancy. Not only are hard disk failures more common than for any other
component, but they also maintain the data, without which there would be lit-
tle need for a network.
Disk-Level Fault Tolerance
Deciding to have ftard disk
fault tolerance on the server is the first step; the sec- ond
is deciding which
fault tolerance strategy
to use. Hard disk fault
tolerance is implemented according to different RAID levels. Each RAID level
offers dif- fering amounts
of data protection and performance. The
RAID level appropri- ate for a given
situation depends on the importance placed on the data, the dif-
ficulty of replacing that data, and the associated costs of a respective RAID
implementation. Often, the costs of data loss and replacement outweigh the costs
associated with implementing a strong RAID fault tolerance solution. RAID can be deployed
through dedicated hardware,
which is more costly,
or can be software-based. Today’s network
operating systems, such as UNIX and Windows server products, have
built-in support for RAID.
RAID
0: Stripe Set Without Parity
Although it’s given RAID
status, RAID
0 does not actually provide any fault tol- erance. In fact, using
RAID 0 might even be less fault-tolerant than storing all your data on a single
hard disk.
RAID 0 combines unused disk space on two or more hard drives into a
single logical volume, with data written to equally sized stripes across all
the disks. Using multiple disks, reads and writes are performed simultaneously
across all drives. This means that disk access is faster, making the performance of RAID 0 better than other RAID solutions and significantly better than a single hard disk.
The downside of RAID 0 is that if any disk in the array fails, the data is lost
and must be restored from backup.
Because of its
lack of fault tolerance, RAID 0 is rarely implemented.
Disk 0
Disk 1
Disk 2
RAID
1
One of the more common RAID implementations is RAID 1. RAID 1 requires two hard
disks and uses disk mirroring to
provide fault tolerance. When infor- mation is written
to the hard disk, it is automatically and simultaneously written to the second hard disk. Both of the hard disks
in the mirrored configuration use the same hard disk controller; the
partitions used on the hard disk need to be
approximately the same size to establish the mirror. In the mirrored configura- tion, if the primary
disk were to fail, the second mirrored
disk would contain
all the required information, and there would be little disruption to
data availabil- ity. RAID 1 ensures
that the server
will continue operating in the case of primary disk failure.
A RAID 1 solution has some key advantages. First, it is cheap in terms of
cost per megabyte of storage, because only two hard disks are required to
provide fault tolerance. Second, no additional software is required to
establish RAID 1, because modern network operating systems have built-in
support for it. RAID levels using striping are often incapable of including a
boot or system partition in fault tolerance solutions. Finally, RAID 1 offers load balancing over multiple disks, which
increases read performance over that of a single disk. Write per- formance, however,
is not improved.
Because of its advantages, RAID 1 is well suited
as an entry-level RAID solution, but it has a few significant
shortcomings that exclude its use in many environ- ments. It has limited
storage capacity—two 100GB hard drives provide only 100GB of storage space.
Organizations with large
data storage needs
can exceed a mirrored
solution capacity in very short order. RAID
1 also has a single point of failure, the hard disk controller. If it were to
fail, the data would be inacces- sible on either drive.
An extension of RAID 1 is disk
duplexing. Disk duplexing is the same as mirror- ing, with the
exception of one key detail: It places the hard disks on separate hard disk controllers, eliminating the single point
of failure.
RAID 5
RAID 5, also known as disk
striping witft parity, uses distributed
parity to write information across all disks in the array. Unlike the striping used in RAID 0,
RAID 5 includes parity information in the striping, which provides fault toler-
ance. This parity information is used to re-create the data in the event of a failure. RAID 5 requires a minimum of
three disks, with the equivalent of a single disk being used for the parity
information. This means that if you have
three 40GB hard disks, you have 80GB
of storage space, with the other 40GB used for parity. To increase storage space in a RAID 5 array, you need only add another
disk to the array.
Depending on the sophistication of the RAID setup
you are using, the RAID controller will be able to incorporate
the new drive into the array automat- ically, or you will need to rebuild
the array and restore the data from backup.
Many factors have made RAID 5 a very popular fault-tolerant design. RAID
5 can continue to function in the event of a single drive failure. If a hard
disk in the array were to fail, the parity would re-create the missing data and
continue to function with the remaining drives. The read performance of RAID 5
is improved over a single disk.
The RAID 5 solution has only a few
drawbacks:
. The
costs of implementing RAID 5 are initially higher than other fault tolerance
measures requiring a minimum of three hard disks. Given the costs of hard disks
today, this is a minor concern. However, when it comes to implementing a
RAID 5 solution, hardware RAID 5 is more expensive than a software-based RAID 5 solution.
. RAID 5 suffers from
poor write performance because the parity has to be calculated and then written
across several disks. The performance lag is minimal and doesn’t
have a noticeable difference on the network.
. When a new disk is placed
in a failed RAID 5 array, there is a regenera- tion time when the data is being rebuilt
on the new drive. This process
requires extensive resources from the server.
RAID 10
Sometimes RAID levels are combined
to take advantage of the best of each. One such strategy is RAID 10, which combines RAID levels 1 and 0. In this
config- uration, four disks
are required. As you might
expect, the configuration consists of a mirrored stripe
set. To some extent,
RAID 10 takes
advantage of the perform-
ance capability of a stripe set while offering the fault tolerance of a
mirrored solution. As well as having the benefits of each, though, RAID 10
inherits the shortcomings of each strategy.
In this case, the high overhead and decreased write performance are the
disadvantages.
|
Table 8.2 Summary of RAID Levels
RAID 0
|
Disk striping
|
Increased read
|
Does not
offer
|
Two or
|
and write
|
any fault
|
more
|
||
performance.
|
tolerance.
|
|||
RAID 0
can be
|
||||
implemented
|
||||
with two or more
|
||||
disks.
|
||||
RAID 1
|
Disk mirroring
|
Provides fault
|
RAID 1 has
|
Two
|
tolerance. Can
also
|
50 percent
|
|||
be used with
|
overhead and
|
|||
separate disk
|
suffers from
|
|||
controllers,
|
poor write
|
|||
reducing the
|
performance.
|
|||
single point of
|
||||
failure. This
is
|
||||
called disk
|
||||
duplexing.
|
|
Table 8.2 Summary of RAID Levels Continued
RAID 5
|
Disk striping
|
Can recover
|
May slow down
|
Minimum of
|
with
|
from a
single
|
the network
|
three
|
|
distributed parity
|
disk failure.
|
during
|
||
Increased read
|
regeneration
|
|||
performance
|
time, and
|
|||
over a poor-write
|
performance
|
|||
single disk. Disks
|
may suffer.
|
|||
can be added to
|
||||
the array to
|
||||
increase storage
|
||||
capacity.
|
||||
RAID 10
|
Striping with
|
Increased
|
High overhead,
|
Four
|
mirrored
|
performance
|
as with
|
||
volumes
|
with striping.
|
mirroring.
|
||
Offers mirrored
|
||||
fault tolerance.
|
Server and Services
Fault Tolerance
In addition to providing fault tolerance for individual hardware
components, some organizations go the extra mile and include the entire server
in the fault- tolerant design. Such a design
keeps servers and the services
they provide up and
running. When it comes to server fault tolerance, two key strategies are com-
monly employed: standby
servers and server clustering.
Standby Servers
Standby servers
are a fault
tolerance measure in which a second server
is config- ured identically
to the first one. The second server can be stored remotely or locally and set
up in a failover
configuration. In a failover configuration, the sec- ondary server
is connected to the primary and is ready to take over the server functions at a moment’s notice.
If the secondary server detects
that the primary has failed, it automatically cuts
in. Network users will not notice the transition, because little
or no disruption in data availability occurs.
The primary server
communicates with the secondary server
by issuing special notification notices
called fteartbeats. If the secondary server stops receiving
the heartbeat messages, it assumes that the primary
has died and therefore assumes the primary server
configuration.
Server Clustering
Companies that want maximum data availability and that have the funds
to pay for it can choose to use server clustering. As the name suggests, server clustering
involves grouping servers
for the purposes
of fault tolerance and load balancing. In this configuration, other
servers in the cluster can compensate for the failure of a single server. The failed server has no impact on
the network, and the end users have no idea that a server
has failed.
The clear advantage
of server clusters
is that they offer the highest level of fault tolerance and data availability. The disadvantage is equally clear—cost. The cost of buying a single server can be a huge investment for many organizations; hav- ing to buy duplicate servers is far too costly. In addition to just hardware
costs, additional costs can be associated with recruiting administrators who
have the skills to configure
and maintain complex
server clusters. Clustering provides the following advantages:
. Increased performance: More servers
equals more processing power. The servers in a cluster can provide levels of
performance beyond the scope of a single system by combining
resources and processing power.
. Load balancing: Rather than having
individual servers perform specific roles, a cluster
can perform a number of roles, assigning
the appropriate resources in
the best places. This approach maximizes the power of the systems by allocating tasks based on which server in the cluster can best
service the
request.
. Failover: Because the servers in the
cluster are in constant contact with each other,
they can detect and cope with the failure of an individual sys- tem. How
transparent the failover is to users depends on the clustering software, the
type of failure, and the capability of the application soft- ware being
used to cope with the failure.
. Scalability: The capability to add
servers to the cluster offers a degree of scalability that is simply not
possible in a single-server scenario. It is worth mentioning, though, that
clustering on PC platforms is still in its relative infancy, and the number of machines that can be included in a
cluster is still limited.
To make server clustering
happen, you need certain ingredients—servers, stor- age devices, network links,
and software that makes the cluster work.
Link Redundancy
Although a failed network card might not actually stop the server or a system, it might as well. A network server that
cannot be used on the network results in server downtime. Although the chances
of a failed network card are relatively low,
attempts to reduce the occurrence of downtime have led to the develop-
ment of a strategy that provides fault tolerance for network connections.
Through a process called adapter teaming, groups of
network cards are config- ured to act as a single unit.
The teaming capability is achieved through
software, either as a function of the network card driver or through
specific application software. The process of adapter teaming is not widely implemented;
although the benefits it offers are many, so
it’s likely to become a more common
sight. The result of adapter teaming is increased bandwidth, fault tolerance,
and the ability to manage network traffic more effectively. These features are broken into three
sections:
. Adapter fault tolerance: The basic
configuration enables one network card to be configured as the primary device
and others as secondary. If the primary adapter
fails, one of the other cards can take its place with- out the need for intervention. When
the original card is replaced, it resumes the role of primary
controller.
. Adapter load balancing: Because
software controls the network adapters, workloads can be distributed evenly
among the cards so that each link is used to a similar degree. This
distribution allows for a more responsive server because one card is not
overworked while another is underworked.
. Link aggregation: This provides
vastly improved performance by allow- ing more than one network card’s bandwidth
to be aggregated—combined into a single connection. For example, through link
aggregation, four 100Mbps network cards
can provide a total of 400Mbps of bandwidth.
Link aggregation requires that both the network
adapters and the switch
being used support it. In 1999, the IEEE ratified the 802.3ad standard for link aggregation, allowing
compatible products to be produced.
Using Uninterruptible Power Supplies
No discussion of fault tolerance can be complete without a look at
power-relat- ed issues and the mechanisms used to combat them. When you’re
designing a fault-tolerant system, your planning should definitely include
uninterruptible power supplies (UPSs).
A UPS serves many functions and is a major part of serv- er
consideration and implementation.
On a basic level, a UPS is a box that holds a battery
and built-in charging
circuit. During times of good power, the
battery is recharged; when the UPS is needed, it’s
ready to provide power to the server.
Most often, the UPS is required to pro- vide enough power to give the administrator time to shut down the server in an
orderly fashion, preventing any potential data loss from a dirty shutdown.
Why Use a UPS?
Organizations of all shapes and sizes need UPSs as part of their fault
tolerance strategies. A UPS is as important as any other fault tolerance
measure. Three key reasons make a UPS necessary:
. Data availability: The
goal of any fault tolerance measure is data avail- ability. A UPS ensures access to the server in the event of a
power fail- ure—or at least
as long as it takes
to save a file.
. Protection from data loss:
Fluctuations in power or a sudden power- down can damage the data on the server
system. In addition, many servers take full advantage of caching, and a sudden loss of power could cause the loss of all information
held in cache.
. Protection from hardware damage:
Constant power fluctuations or sudden power-downs can damage hardware
components within a com- puter. Damaged
hardware can lead to reduced
data availability while the
hardware is being repaired.
Power Threats
In addition to keeping a server functioning long enough to safely shut it
down, a UPS safeguards a server
from inconsistent power. This inconsistent power
can take many forms. A UPS protects a system from the following
power-related threats:
. Blackout: A total
failure of the power supplied
to the server.
. Spike: A spike
is a very short (usually
less than a second) but very intense increase in voltage. Spikes can
do irreparable damage to any kind of equipment, especially
computers.
. Surge:
Compared to a spike, a surge is a considerably longer (sometimes many seconds)
but usually less intense increase
in power. Surges
can also damage your computer equipment.
. Sag: A sag is a short-term voltage drop (the opposite of a spike).
This type of voltage
drop can cause a server to reboot.
. Brownout: A brownout
is a drop in voltage
that usually lasts
more than a few minutes.
Many of these power-related threats can occur without your knowledge; if
you don’t have a UPS, you cannot
prepare for them. For the cost, it is worth buying a UPS, if for no other reason than to sleep
better at night.
Disaster Recovery
Even the most fault-tolerant networks will fail, which is an unfortunate
fact. When those costly and carefully
implemented fault tolerance
strategies fail, you are
left with disaster recovery.
Disaster recovery can take many forms. In addition to disasters such as
fire, flood, and theft, many other potential business disruptions can fall
under the banner of disaster recovery. For
example, the failure of the electrical supply to your city block might
interrupt the business functions. Such an event, although not a disaster per se, might
invoke the disaster
recovery methods.
The cornerstone of every disaster recovery strategy is the preservation
and recoverability of data. When talking about preservation and recoverability,
we are talking about backups. When we are talking about backups, we are likely
talking about tape backups. Implementing a regular backup schedule can save you
a lot of grief when fault tolerance fails or when you need to recover a file
that has been accidentally deleted. When it comes time to design a backup
schedule, three key types of backups are used—full, differential, and incremental.
Full Backup
The preferred method of backup is the full backup method, which copies all files
and directories from the hard disk to the backup media. There are a few reasons why doing a full backup is not
always possible. First among them is likely the time involved in performing a
full backup.
Depending on the amount of data to be backed up, full backups can take an
extremely long time and can use extensive system resources. Depending on the configuration
of the backup hardware, this can slow down the network consid- erably. In addition, some environments
have more data than can fit on a single tape.
This makes doing a full backup awkward,
because someone might need to be there to change the tapes.
The main advantage of full backups is that a single tape or tape set
holds all the data you need backed up. In the event of a failure,
a single tape might be all that is needed to get all data and system
information back. The upshot of all this is that any disruption to the network
is greatly reduced.
Unfortunately, its strength can also be its weakness. A single tape
holding an organization’s data can
be a security risk. If the tape were to fall into the wrong hands, all the data
could be restored on another computer. Using
passwords on tape backups and using a secure offsite and onsite location can
minimize the security risk.
Differential Backup
Companies that just don’t have
enough time to complete a full backup daily can make use of the differential
backup. Differential backups are faster than a full backup, because
they back up only the data that has changed since the last full backup. This
means that if you do a full backup on a Saturday and a differential backup on
the following Wednesday, only the
data that has changed since Saturday is backed up. Restoring the differential
backup requires the last full backup and the latest differential backup.
Differential backups know what files have changed since the last full
backup because they use a setting called the arcftive bit. The archive bit flags files that have changed
or have been created and identifies them as ones that need to be backed up. Full backups
do not concern themselves with the archive
bit, because all files
are backed up, regardless of date. A full backup,
however, does clear
the archive bit after data has been backed up to avoid future confusion. Differential backups take notice of the
archive bit and use it to determine which files have changed. The differential
backup does not reset the archive bit information.
Incremental Backup
Some companies have a finite amount of time they can allocate to backup
pro- cedures. Such organizations are likely to use incremental backups
in their backup strategy. Incremental backups save only the files
that have changed
since the last full or incremental backup. Like
differential backups, incremental backups use the archive bit to determine which
files have changed
since the last full or incre-
mental backup. Unlike differentials, however,
incremental backups clear the archive
bit, so files
that have not changed are not backed
up.
The faster backup
time of incremental backups comes at a price—the amount of time
required to restore. Recovering from a failure with incremental backups
requires numerous tapes—all the incremental tapes and the most recent full
backup. For example, if you had a full backup from Sunday and an incremental
for Monday, Tuesday, and Wednesday, you
would need four tapes to restore the data. Keep in mind that each tape in the
rotation is an additional step in the restore
process and an additional failure
point. One damaged
incremental tape, and you will
be unable to restore the data. Table 8.3
summarizes the various backup strategies.
|
Table 8.3 Backup Strategies
Full Backs up all
|
Depending
|
All files and
|
Does not
use the
|
data on a
|
on the amount
|
directories are
|
archive bit,
but
|
single tape
|
of data, full
|
backed up.
|
resets it after
|
or tape set.
|
backups can
|
data has
been
|
|
Restoring
|
take a
long time.
|
backed up.
|
|
data requires
|
|||
the fewest
|
|||
tapes.
|
|
Table 8.3 Backup Strategies Continued
Differential
|
Faster backups
|
Uses more
|
All files and
|
Uses the archive
|
than a full
|
tapes than a
|
directories
|
bit to determine
|
|
backup.
|
full backup.
|
that have
|
the files that
|
|
The restore
|
changed since
|
have changed,
|
||
process takes
|
the last full
or
|
but does not
|
||
longer than
|
differential
|
reset the archive
|
||
a full backup.
|
backup.
|
bit.
|
||
Incremental
|
Faster backup
|
Requires
|
The files and
|
Uses the archive
|
times.
|
multiple disks;
|
directories that
|
bit to determine
|
|
restoring data
|
have changed
|
the files that
|
||
takes more
|
since the
last
|
have changed,
|
||
time than
the
|
full or
|
and resets the
|
||
other backup
|
incremental
|
archive bit.
|
||
methods.
|
backup.
|
Tape Rotations
After you select a backup type, you are ready to choose a backup
rotation. Several backup rotation strategies are in use—some good, some bad,
and some really bad. The most common, and perhaps the best, rotation strategy
is grand- father, father,
son (GFS).
The GFS backup rotation is the most widely used—and for good reason. For
example, a GFS rotation may require 12 tapes: four tapes for daily backups
(son), five tapes for weekly backups (father), and three tapes for
monthly backups (grandfather).
Using this rotation
schedule, you can recover data from days,
weeks, or months earlier.
Some network administrators choose to add tapes to the monthly rota-
tion so that they can retrieve data even further
back, sometimes up to a year. In
most organizations, however, data
that is a week old is out of date, let alone six months or a year.
Backup Best Practices
Many details go into making a backup strategy a success. The following
are issues to consider as part of your backup plan:
. Offsite storage: Consider storing
backup tapes offsite so that in the event of
a disaster in a building, a current set of tapes
is still available offsite.
The offsite
tapes should be as current
as any onsite and should
be secure.
. Label tapes: The goal is
to restore the data as quickly as possible, and trying to find the tape you
need can be difficult if it isn’t marked.
Furthermore, this can prevent
you from recording
over a tape you need.
. New tapes: Like old cassette tapes,
the tape cartridges used for the backups wear out over time. One strategy used
to prevent this from becoming a problem is to periodically introduce new tapes
into the rota- tion schedule.
. Verify backups: Never assume that the backup
was successful. Seasoned administrators know that checking backup logs and performing periodic test restores are part of the backup process.
. Cleaning: You need to clean the tape drive
occasionally. If the inside gets dirty, backups can fail.
Hot and Cold Spares
The impact that a failed component has on a system or network depends
large- ly on the predisaster preparation and on the recovery strategies used.
Hot and cold spares represent
a strategy for recovering from failed components.
Hot Spares and Hot Swapping
Hot spares allow
system administrators to quickly recover
from component fail- ure. In a common use, a hot spare
enables a RAID system to automatically fail over to a spare hard drive should
one of the other drives in the RAID array fail. A hot spare does not require
any manual intervention. Instead, a redundant drive resides in the system at all times,
just waiting to take over if another
drive fails. The hot spare drive takes over automatically, leaving
the failed drive to be removed later. Even though hot-spare technology adds an extra level of
protec- tion to your system, after a drive has failed and the hot spare has been used, the situation should be remedied
as soon as possible.
Hot swapping is the ability to replace a failed component while the
system is running. Perhaps the most commonly identified hot-swap component is
the hard drive. In certain RAID configurations, when a hard drive crashes, hot
swapping allows you to simply
take the failed
drive out of the server
and install a new one.
The benefits of hot swapping are very clear in that it allows a failed
component to be recognized and replaced without compromising system availability. Depending on the system’s configuration, the new hardware
normally is recog- nized automatically by both the current hardware and the
operating system. Nowadays, most internal and external RAID subsystems support
the hot-swap- ping feature. Some hot-swappable components include power supplies
and hard disks.
Cold Spares and Cold Swapping
The term cold spare refers to a component, such as a hard disk, that resides
with- in a computer system but requires manual intervention in case of
component failure. A hot spare engages automatically, but a cold spare might
require con- figuration settings or some other action to engage it. A cold spare configuration typically requires
a reboot of the system.
The term cold spare has also been used to refer to a redundant
component that is stored
outside the actual system but is kept in case of component failure. To replace
the failed component with a cold spare, you need to power down the sys- tem.
Cold swapping refers to replacing components only after the system is
com- pletely powered off. This strategy is by far the least attractive for
servers, because the services provided by the server are unavailable for the
duration of the cold-swap procedure. Modern systems have come a long way to ensure
that cold swapping is a rare occurrence. For some situations and for
some compo- nents, however,
cold swapping is the only method to replace a failed component. The only real defense against
having to shut down the server is to have redun- dant components reside in the system.
Hot, Warm,
and Cold Sites
A disaster
recovery plan might include the provision for a recovery
site that can be brought into play quickly. These sites fall into three
categories: hot, warm,
and cold. The need for each of these types of sites depends largely on
the busi- ness you are in and the funds available. Disaster
recovery sites represent
the ulti- mate in precautions
for organizations that really need them. As a result, they don’t come cheap.
The basic concept of a disaster recovery site is that it can provide a
base from which the company
can be operated during a disaster.
The disaster recovery
site normally is not intended to provide a desk for every employee. It’s intended more as a means
to allow key personnel to continue the core business
functions.
In general, a cold recovery site is a site that can be up and operational
in a rela- tively short amount of time, such as a day or two. Provision of
services, such as telephone lines and power,
is taken care of, and the basic office furniture might be in place. But
there is unlikely to be any computer equipment, even though the building might
well have a network infrastructure and a room ready to act as a server room.
In most cases,
cold sites provide
the physical location
and basic services.
Cold sites are useful if you have some forewarning of a potential
problem. Generally speaking, cold sites are used by organizations that can
weather the storm for a day or two before they get back up and running. If you
are the regional office of a major
company, it might be possible to have one of the other
divisions take care of business until you are ready to go. But if you are the
one and only office in the company, you
might need something a little hotter.
For organizations with the dollars and the desire, hot recovery sites
represent the ultimate in fault tolerance strategies. Like cold recovery sites,
hot sites are designed to provide only enough facilities to continue the core
business func- tion, but hot recovery sites are set up to be ready to go at a moment’s notice.
A hot recovery site includes phone systems with the phone lines already
con- nected. Data networks
also are in place, with any necessary routers and switch- es plugged in and turned on. Desks have desktop PCs installed and waiting, and server areas are replete
with the necessary hardware to support
business-critical functions. In other words, within a few hours, the hot
site can become a fully functioning element of an organization.
The issue that confronts potential hot-recovery site users is simply that
of cost. Office space is expensive in the best of times, but having space
sitting idle 99.9 percent of the time can seem like a tremendously poor use of money. A very popular strategy
to get around this problem
is to use space provided
in a disas- ter recovery facility,
which is basically a building, maintained by a third-party company, in which various businesses rent
space. Space is usually apportioned according to how much each company pays.
Sitting between the hot and cold recovery
sites is the warm site. A warm site typ- ically has computers but is not
configured ready to go. This means that data might need to be upgraded or other manual
interventions might need to be per-
formed before the network is again operational. The time it takes to get a warm
site operational lands right in the middle of the other two options, as does
the cost.
No comments:
Post a Comment