RAID, not such a clever idea for a home pc

Poweroid are specialists in the manufacture of Quiet PCs, Video Editing Workstations and Dual Processor machines

RAID - Not such a clever idea for your home PC

This article examines the reasons why RAID may not be a good idea for the home PC. It explorers the main issues users have with RAID and looks at some of the solutions.

History

RAID of all types used to be confined to the realms of enterprise servers, mission critical IT equipment and very specialised applications. They were exclusively SCSI based solutions and while RAID arrays could be created via software that was a skimping on price that no serious business considered. Hardware SCSI RAID solutions with copious quantities of cache (for a RAID card) were the norm. Outside of the larger enterprises’ SAN and NAS mass storage setups smaller businesses relied on RAID more for redundancy (RAID 1/mirroring) than speed (RAID 0/Striping).

Then RAID reached IDE and, later, SATA. Manufacturers of motherboards started incorporating “soft” RAID chips into motherboards targeted at the SOHO market bringing RAID 0, RAID 1 - and quite commonly combinations of the two - to the masses. Newer versions of Microsoft Windows - like XP Pro - even supported completely software controlled RAID (under certain conditions) without an onboard/PCI Highpoint or Promise type controller chip. Manufacturers of high-end PCs, like Poweroid in the UK, first started offering RAID in SOHO systems a few years ago. Other system integrators and VARs jumped onto the RAID bandwagon to differentiate their performance products from their run of the mill beige boxes.

These SOHO RAID solutions were limited to striping and mirroring and the general consensus among erudite consumers was that striping two hard disks into a larger volume meant more speed while using one disk as a mirror provided a protection against data loss. These generalisations are largely myths. From our experience - and the stats we’ve collected from our customers - those with RAID 1 are marginally more likely to lose data than those without any RAID at all. The even more startling fact to emerge from our stats was that those with RAID 0 are six times more likely to suffer data loss than customers with no RAID array in their PCs. We examine these curious findings here.

Does technology's Go-Faster Stripe actually add speed?

The issue with RAID 0 has always been that splitting data across two hard disks inevitably resulted in doubling the chances of data loss via hard disk failure. It is the logical downside to a single striped volume spanning two physical drives that if either disk fails no data is recoverable. The risk rises as more drives are added to the array. Don't let the "R" in RAID mislead - in a RAID 0 configuration there is no redundancy.

Associated with a higher risk of data loss RAID 0’s only attraction remained it's perceived faster speed, and faster speeds are always welcome. It had long been the gripe of video editors that standard hard disks weren't fast enough for video. When technology improvements like higher spindle speeds, larger caches, Tagged Command Queuing (TCQ) etc. brought phenomenal speed increases to the storage arena video editors complained that speeds still weren't sufficient for advanced video work and the handling of higher quality footage - like 10 bit video. Video isn't the only application that takes all the speed that's thrown at it and asks for more. Lots of other applications could use more speed from the IDE subsystem. Since storage speed improvements just haven't matched improvements in other areas like CPUs and GPUs - and disk reading and writing is still the bottleneck in most modern PCs - SIs, VARs and even home PC users having been flocking to the technological amphetamine that is RAID 0 instead of spending some time learning how they can optimise their hard disk performance.

What made striping even more attractive was the scalability of the technology. In theory throughput keeps getting faster and faster just by adding more drives to an array.

But very little of that is true. RAID 0 does not always make for more speed. In fact striping may not make the blindest bit of difference to the speed of the average home PC!

Reputed technical websites like StorageReview have often commented that gains on RAID 0 vs single hard disk are minimal at best. Test after test by some of the most reputed technical websites have proved that RAID 0 does not significantly improve desktop performance. Not even with the far higher risk of a four disk RAID 0 array. Seriously! Very few home users tend to be aware of all these technical studies and those that do very often pooh-pooh the idea that the RAID configuration they spent a lot of money on is not actually running faster.

If the claim that RAID 0 is not all it's cracked up to be sounds illogical then it's worth taking the time to read the reviews. A search in Google should lead you to them. Except for a few limited high I/O activities like video editing - and the typical application benchmark - the speed gains are almost non-existent. For the average PC user RAID 0 is as useful as a rear spoiler on an 800cc car. It looks good, it sounds impressive but it don't do nuffin'.

RAID 0 has been striped (pun apologies) of it's only virtue, speed. If striping increases risk of data loss but provides no speed gains worth writing home about - why is it still so popular? Well, myths are not easily dispelled. Marketing gumph designed to sell mobos with RAID still boast about massive speed gains to be achieved. Hard disk retailers would rather sell two disks than one. And there's always product differentiation - our PC is faster because it has RAID. Users are rarely told the other risks because risk warnings don't shift stock.

Other risks? Yes, there are other risks. Isn't drive failure considered not a very likely mishap? The other risks are even bigger monsters and take the risk-reward ratio firmly towards not using RAID. But hardly anyone seems to know about these risks so they don't get discussed often. More later….

RAID 1 - Give up the ghost?

The other ubiquitous RAID is the mirrored array offering redundancy and the safety of an up-to-the-minute backup drive to take over seemlessly in the event of a drive failure. So the "R" in RAID does justify its presence in a RAID 1 array.

Does RAID 1 do what it says on the tin?. Yes, it does. It protects from one of the limitations in the RAID 0 array - hard disk failure. Just to be petty we'll elaborate: It protects from one hard disk failing.

There is one other issue with RAID 1, and we'll go so far as to say it's a risk. RAID 1 users are less likely to take regular backups or ghosts/images of their system. A certain complacency seems to set in when there's the perceived security of a mirrored drive. There's the assumption that come what may ...a backup exists. Except, of course, that it is not a backup.

The same feature that provides the protection can also be the user’s downfall. RAID 1 maintains a faithful copy on the second disk of everything that’s on the first. Warts and all. Mistakes made, files irrecoverably deleted, virus caused issues, shredding etc are all duplicated on the second disk. Users tend to forget that RAID 1 does not protect against errors, it protects only against one disk going faulty.

There are other instances where the RAID 1 insurance can provide very little protection, including in disasters caused by fire, theft or vandalism. RAID 1 is no substitute for regular backups. Even better, don't give up the ghosting till you have proper disaster recovery planning in place.

Remember the other risks referred to in the discussion of RAID 0? Most of those risks do also apply to RAID 1, and they are covered in the next section.

Risky Array of Independent Disks

RAID 1 does not protect against the unlikely eventuality of both drives failing together. What are the odds of that happening? Modern disks are very reliable; wouldn't it be almost unheard of for two drives to die at the same time? No, it's not! Hard drive failure results not just from faulty manufacturing or wear and tear. Drives can fail as a result of other components being faulty. Such disasters can and often will take both drives.

Further, the reliability of the modern hard disk is exaggerated. The quoted Mean Time Between Failure (MTBF) for the average modern IDE drive is about 1,000,000 hours. The MTBF for a system with 2 disks, A and B, striped is 1/(1/MTBF A + 1/MTBF B)... or 500,000 hours. That's almost invincible! In the real world however, we see approximately 2% of disks go faulty in the first 24 months. That would give the two drive user a 4% chance of a disk failure. The extra risks from the RAID controller failing, external faults like defective PSUs, power surges, shock damages etc can be added up. If scientifically done this may push the chances of a failure up to 10.27% or 8.43% or some other "exact" figure depending on how the stats are compiled ...but it will be higher than 4%.

So far we've seen that the risks of a drive failing are a lot higher than MTBF figures suggest. But the biggest risks are not hardware failures.

By far the largest number of PCs (using RAID) that are returned as faulty have perfectly working disks, controllers with no fault, PSUs pumping out the right voltages to the right places etc. Yet the user has lost all data and the Windows installation to boot (not another pun?!).

Why? From our survey of a sample of our customers here's how it tends to happen:

The first and foremost risk is that the RAID BIOS loses the information it stores to track the allocation of the drives. We've seen this caused by all manner of software particularly anti-virus programs. Caught in time a simple recreation of the array (see last page) resolves the problem in over 90% of the cases.

BIOS changes, flashing the BIOS, resetting the BIOS, updating firmware etc can cause an array to fail. BIOS changes happen not just by hitting delete to enter setup. Software can make changes to the BIOS.

Disk managers, hard disk utilities, imaging and partitioning software etc. can often confuse a RAID array.

Reinstalling operating systems on top of existing installations or trying to repair a Windows installation by reinstalling the OS can cause problems.

And the #1 cause of data loss - drum roll here - is user error. Very often users panic at the "insert boot disk" message. Panic causes users to make errors in recovering their PC to a fully working state. Staying cool is the key.

Protection and Recovery

In a nuclear attack the accepted advice is to crawl under a table, stick your head between your knees, and kiss your ass goodbye. Slightly more helpful for RAID problems are these guidelines:

Right at the start and before installing the operating system it is worth playing around with the RAID BIOS, creating, deleting and rebuilding arrays. Different makes of RAID BIOSes have different setup screens. It pays to be very familiar with these screens and their options - including with allocating disks to arrays, and repairing arrays. Not to forget getting in and out of the RAID BIOS setup screen. Make notes of these screens, take screenshots, and keep them handy. When disaster strikes the last thing you'll want is uncertainty about a certain option. "What happens if I hit this key?" is a question you want to know the answer to before the problem occurs.

Handle with kid’s gloves. Trying to recover from a RAID problem is not for the faint hearted. Just one or two wrong keystrokes could cause complete data loss. Any user attempts at data recovery can cause further damage to the data on the drive/s and reduce the chances of a third party data recovery expert being able to help. So be careful.

At the first signs of a problem analyse the situation prior to doing anything drastic like re-building/repairing the RAID array, reallocating disks to the array or deleting the array and re-creating it. Is one of the hard disks dead? Or is it just a matter of a cable getting loose/faulty power connector? After external issues like connections and cables have been excluded the next step would be to enter the RAID BIOS. For onboard RAID controllers the motherboard usually provides instructions on accessing the RAID BIOS; often instructions like "Press Control + F to enter RAID BIOS" flash past the screen during POST.

Controller faults: If the controller has died the chances are that none of the data on the drive has been affected. Replacing the controller should allow for complete recovery of all data.

If the array has been lost it is still usually possible to recover all the data provided the steps of creating the original array are followed meticulously, especially with regards to choosing the RAID type and allocating the drives to the array.

RAID arrays are sensitive beasts. RAID was designed for a server environment where any software that is installed on the system is A) generally server grade software and validated for use on that configuration and B) installed by professional IT personnel. RAID is not tolerant of user error, and it is far from tolerant enough to take the kind of software/game/freeware/pirated software abuse hurled at the average home PC.

What about Intel's ICH6R Southbridge with it's new Matrix RAID combining RAID 0 and RAID 1 in a two disk array? It's a great idea and initial impressions are good but as it's a relatively new technology we'll have to wait to see how users get on with it. What about using RAID 0+1 or RAID 1+0 (despite the apparent similarities they are not the same thing at all)? Or RAID 2 with it's funny hamming? Or RAID 5 with it’s parity disk? These solutions dig deeper and deeper into enterprise territory so they're fodder for another day.