History
RAID of all
types used to be confined to the realms of
enterprise servers,
mission critical IT equipment and very specialised applications.
They were exclusively
SCSI based solutions and while
RAID arrays
could be created via software that was a skimping on price that no
serious business considered. Hardware SCSI RAID solutions with
copious quantities of cache (for a
RAID card) were the norm.
Outside of the larger enterprises’ SAN
and NAS
mass storage setups smaller businesses relied on RAID more for
redundancy (RAID
1/mirroring) than speed (RAID
0/Striping).
Then RAID
reached IDE and, later, SATA. Manufacturers of motherboards
started incorporating “soft” RAID chips into motherboards
targeted at the SOHO market bringing RAID 0, RAID 1 - and quite
commonly combinations of the two - to the masses. Newer versions
of Microsoft Windows - like XP Pro - even supported completely
software controlled RAID (under certain conditions) without an
onboard/PCI Highpoint or Promise type controller chip.
Manufacturers of high-end PCs, like Poweroid
in the UK, first started offering RAID in SOHO systems a few years
ago. Other system integrators and VARs jumped onto the RAID
bandwagon to differentiate their performance products from their
run of the mill beige boxes.
These SOHO
RAID solutions were limited to striping and mirroring and the
general consensus among erudite consumers was that striping two
hard disks into a larger volume meant more speed while using one
disk as a mirror provided a protection against data loss. These
generalisations are largely myths. From our experience - and
the stats we’ve collected
from our customers - those with RAID 1 are marginally more likely
to lose data than those without any RAID at all. The even more
startling fact to emerge from our stats was that those with RAID 0
are six times more likely to suffer data loss than customers with
no RAID array in their PCs. We
examine these curious findings here.
Does
technology's Go-Faster Stripe actually add speed?
The issue
with RAID
0 has always been that splitting data across two hard disks
inevitably resulted in doubling the chances of data loss via hard
disk failure. It is the logical downside to a single striped
volume spanning two physical drives that if either disk fails no
data is recoverable. The risk rises as more drives are added to
the array. Don't let the "R" in RAID mislead - in a RAID
0 configuration there is no redundancy.
Associated
with a higher risk of
data loss RAID 0’s only attraction
remained it's perceived faster speed, and faster speeds are always
welcome. It had long been the gripe of video editors that standard
hard disks weren't fast enough for video. When technology
improvements like higher spindle speeds, larger caches, Tagged
Command Queuing (TCQ) etc. brought phenomenal speed increases to
the storage arena video editors complained that speeds still
weren't sufficient for advanced video work and the handling of
higher quality footage - like 10 bit video. Video isn't the
only application that takes all the speed that's thrown at it and
asks for more. Lots of other applications could use more speed
from the IDE subsystem. Since storage speed improvements just
haven't matched improvements in other areas like CPUs and GPUs -
and disk reading and writing is still the bottleneck in most
modern PCs - SIs, VARs and even home PC users having been flocking
to the technological amphetamine that is RAID 0 instead of
spending some time learning how
they can optimise their hard disk performance.
What made
striping even more attractive was the scalability of the
technology. In theory throughput keeps getting faster and faster
just by adding more drives to an array.
But very
little of that is true. RAID 0 does not always make for more
speed. In
fact striping may not make the blindest bit of difference to the
speed of the average home PC!
Reputed
technical websites like
StorageReview have
often commented that gains on RAID 0 vs single hard disk are
minimal at best. Test after test by some of the most reputed
technical websites have proved that RAID 0 does not significantly
improve desktop performance. Not even with the far higher risk of
a four disk RAID 0 array. Seriously! Very few home users tend to
be aware of all these technical studies and those that do very
often pooh-pooh the idea that the RAID configuration they spent a
lot of money on is not actually running faster.
If the
claim that RAID 0 is not all it's cracked up to be sounds
illogical then it's worth taking the time to read the reviews. A
search in Google should lead you to them. Except for a few limited
high I/O activities like video editing - and the typical application
benchmark - the speed gains are almost non-existent. For
the average PC user RAID 0 is as useful as a rear spoiler on an
800cc car. It looks good, it sounds impressive but it don't do
nuffin'.
RAID 0
has been striped (pun apologies) of it's only virtue, speed. If
striping increases risk of data loss but provides no speed gains
worth writing home about - why is it still so popular? Well, myths
are not easily dispelled. Marketing gumph designed to sell mobos
with RAID still boast about massive speed gains to be achieved.
Hard disk retailers would rather sell two disks than one. And
there's always product differentiation - our PC is faster
because it has RAID. Users are rarely told the other risks
because risk warnings don't shift stock.
Other risks? Yes, there are other risks. Isn't
drive failure considered not a very likely mishap? The other risks
are even bigger monsters and take the risk-reward ratio
firmly towards not using RAID. But hardly anyone seems to know
about these risks so they don't get discussed often. More
later….
RAID 1 -
Give up the ghost?
The other
ubiquitous RAID is the mirrored array offering redundancy and the
safety of an up-to-the-minute
backup drive to take over seemlessly
in the event of a drive failure. So the "R" in RAID does
justify its presence in a RAID
1 array.
Does RAID
1 do what it says on the tin?. Yes, it does. It protects from one
of the limitations in the RAID 0 array - hard disk failure. Just
to be petty we'll elaborate: It protects from one hard disk
failing.
There is
one other issue with RAID 1, and we'll go so far as to say it's a
risk. RAID 1 users are less likely to take regular backups or
ghosts/images of their system. A certain complacency seems to set
in when there's the perceived security of a mirrored drive.
There's the assumption that come what may ...a backup exists.
Except, of course, that it is not a backup.
The same
feature that provides the protection can also be the user’s
downfall. RAID 1 maintains a faithful copy on the second disk of
everything that’s on the first. Warts and all. Mistakes
made, files irrecoverably deleted, virus caused issues, shredding
etc are all duplicated on the second disk. Users tend to forget
that RAID 1 does not protect against errors, it protects only
against one disk going faulty.
There are
other instances where the RAID 1 insurance can provide very little
protection, including in disasters caused by fire, theft or
vandalism. RAID 1 is no substitute for regular backups.
Even better, don't give up the ghosting till you have
proper disaster
recovery planning in place.
Remember
the other risks referred to in the discussion of RAID 0? Most of
those risks do also apply to RAID 1, and they are covered in the
next section.
Risky
Array of Independent Disks
RAID 1
does not protect against the unlikely eventuality of both drives
failing together. What are the odds of that happening? Modern
disks are very reliable; wouldn't it be almost unheard of for two
drives to die at the same time? No, it's not! Hard drive failure
results not just from faulty manufacturing or wear and tear.
Drives can fail as a result of other components being faulty. Such
disasters can and often will take both drives.
Further,
the reliability of the modern hard disk is exaggerated. The quoted
Mean Time Between Failure (MTBF) for the average modern IDE drive
is about 1,000,000 hours. The MTBF for a system with 2 disks, A
and B, striped is 1/(1/MTBF A + 1/MTBF B)... or
500,000 hours. That's almost invincible! In the real world
however, we see approximately 2% of disks go faulty in the first
24 months. That would give the two drive user a 4% chance of a
disk failure. The extra risks from the RAID controller failing,
external faults like defective PSUs, power surges, shock damages
etc can be added up. If scientifically done this may push the
chances of a failure up to 10.27% or 8.43% or some other
"exact" figure depending on how the stats are compiled
...but it will be higher than 4%.
So far
we've seen that the risks of a drive failing are a lot higher than
MTBF figures suggest. But the biggest risks are not hardware
failures.
By far
the largest number of PCs (using RAID) that are returned as faulty
have perfectly working disks, controllers with no fault, PSUs
pumping out the right voltages to the right places etc. Yet the
user has lost all data and the Windows installation to boot (not
another pun?!).
Why? From
our survey of a sample of our customers here's how it tends to
happen:
The first
and foremost risk is that the RAID BIOS loses the information it
stores to track the allocation of the drives. We've seen this
caused by all manner of software particularly anti-virus programs.
Caught in time a simple recreation of the array (see last page)
resolves the problem in over 90% of the cases.
BIOS
changes, flashing the BIOS, resetting the BIOS, updating firmware
etc can cause an array to fail. BIOS changes happen not just by
hitting delete to enter setup. Software can make changes to
the BIOS.
Disk
managers, hard disk utilities, imaging and partitioning software
etc. can often confuse a RAID array.
Reinstalling
operating systems on top of existing installations or trying to
repair a Windows installation by reinstalling the OS can cause
problems.
And the
#1 cause of data loss - drum roll here - is user error. Very often
users panic at the "insert boot disk" message. Panic
causes users to make errors in recovering their PC to a fully
working state. Staying cool is the key.
Protection
and Recovery
In a
nuclear attack the accepted advice is to crawl under a table,
stick your head between your knees, and kiss your ass goodbye.
Slightly more helpful for RAID problems are these guidelines:
Right at
the start and before installing the operating system it is worth
playing around with the RAID BIOS, creating, deleting and
rebuilding arrays. Different makes of RAID BIOSes have different
setup screens. It pays to be very familiar with these screens and
their options - including with allocating disks to arrays, and
repairing arrays. Not to forget getting in and out of the RAID
BIOS setup screen. Make notes of these screens, take screenshots,
and keep them handy. When disaster strikes the last thing you'll
want is uncertainty about a certain option. "What happens if
I hit this key?" is a question you want to know the answer to
before the problem occurs.
Handle with
kid’s gloves. Trying to recover from a RAID problem is not for
the faint hearted. Just one or two wrong keystrokes could cause
complete data loss. Any user attempts at data
recovery can cause further damage to the data on the drive/s
and reduce the chances of a third party data recovery expert being
able to help.
So be careful.
At the first
signs of a problem analyse the situation prior to doing anything
drastic like re-building/repairing the RAID array, reallocating
disks to the array or deleting the array and re-creating it. Is
one of the hard disks dead? Or is it just a matter of a cable
getting loose/faulty power connector? After external issues like
connections and cables have been excluded the next step would be
to enter the RAID BIOS. For onboard RAID controllers the
motherboard usually provides instructions on accessing the RAID
BIOS; often instructions like "Press Control + F to enter
RAID BIOS" flash past the screen during POST.
Controller
faults: If the controller has died the chances are that none of
the data on the drive has been affected. Replacing the controller
should allow for complete recovery of all data.
If the
array has been lost it is still usually possible to recover all
the data provided the steps of creating the original array are
followed meticulously, especially with regards to choosing the
RAID type and allocating the drives to the array.
RAID
arrays are sensitive beasts. RAID was designed for a server
environment where any software that is installed on the system is
A) generally server grade software and validated for use on that
configuration and B) installed by professional IT personnel. RAID
is not tolerant of user error, and it is far from tolerant enough
to take the kind of software/game/freeware/pirated software abuse
hurled at the average home PC.
What about
Intel's ICH6R Southbridge with it's new Matrix RAID combining RAID
0 and RAID 1 in a two disk array? It's a great idea and initial
impressions are good but as it's a relatively new technology we'll
have to wait to see how users get on with it. What about using
RAID 0+1 or RAID 1+0 (despite the apparent similarities they are
not the same thing at all)? Or RAID
2 with it's funny hamming? Or RAID 5 with it’s parity disk?
These solutions dig deeper and deeper into enterprise territory so
they're fodder for another day.