The purpose of this post is to confirm the confidence I have in RAID technology as expressed in the earlier post "06-RAID". It is occaisioned by my recent plans to write a very different piece. As they say "it's always darkest before the dawn" .
Warning Signs
Summers can get pretty hot here in Auckland (average temperature 24 degrees - 75 to you Northies with 99% humidity) so it's no simple matter to keep a computer cool.
Some time ago (more than a year) I found that I had to re-boot my systems much more often on hot days than on cool days.
Since then it occured to me that these seemingly unrelated faults might be the result of systems overheating.
Googling through the list of pages that got a hit from "computer heat-related faults" you will find a surprisingly large and diverse range of suspects; slower system performace, programs not running when clicked, web browser locking up, are all listed and they were the sort of problems I was having that would prompt a reboot. The solution, it appeared, would be to add few more internal 12-volt fans.
I did ask my wife if we could turn her sewing room into an air-conditioned computer room with racks and a raised floor - but she said "No". Women - go figure.
Instead I added more and more internal fans into my 2 boxes, and while this seemed to aleviate the faults, it also made the room a lot noisier.
In an effort to reduce the office noise levels I had a cabinet built to house both of my Tower units and I included several ventilation holes that I hoped would be sufficient.
The result was actually a large desk-shaped oven. The multiple 6-inch vent holes weren't nearly enough to extract the traped heat so it just got hotter and hotter. I had to take off the cabinet doors and even the PC covers and then direct a large office fan at them to bring the operating tempurature down to a safe level. In the end, for a time, the office was hotter and noisier than ever. (more about this in our next article)
I mention my heat problems because for a long time I attributed all of my "unattributable" faults to it. In particular, the one that bugged me the most was the loss of my RAID array.
The first time I noticed it was a bit of a shock, but after a bit of investigations and reboots it became clear that the array hadn't dissapeared, it had only failed to mount after a reboot.
The boot drive ("C:\") is a separate physical HDD so the system boots up fine, it just has no "D:\" drive.
No error message was produced and in fact, the report from the HighPoint RAID management system told me that the array was "Normal" apart from the number 2 disk running a bit hot.

In order to recover the folders I had to power down the box and reboot at least once, and sometimes more, to get the virtual "D:\" drive back on line.
I sent off a note to the HighPoint group via their support web page, got an email back saying their support guy was on vacation and provided an alternate address to contact. No reply came back from the alternate email.
I used the support page to request an update on the status of my fault report and shortly thereafter I got an email saying that my trouble ticket has been updated - I logged back in to the support site only to find that the "update" they notified me about was my query asking them for an update.
Around this time I'd been posting my problem off to my various forums and one kind reader wrote me to point out that if I really wanted to back up a 1.5Tb RAID array, I'd need a 1.5Tb backup disk to do it. He was right of course, but it was a depressing kind of right - there is a good measure of fault tollerance built into the RAID software, but it is fault "tollerance" not fault "proof". If you lose a second disk before you can replace the first fail, you will lose the array.
The lack of progress on this issue and a growing sense of frustration with the supplier drove me to consider an article on "the failure of raid technology and its suppliers" - fortunately a lucky-un-lucky break intervened.
"We had to destroy the village in order to save it"
Have you ever had an intermittent fault on a system that you couldn't pin-point, but you knew it was in a particular subsystem, so you just whacked the subsystem with a hammer to get the whole thing replaced?
Fortunately for me, fate held the hammer this time.
As disk failures go, the "head crash" (see Wikipedia) has to be one of the most dramtic. It's a catastrophic hardware fault that occurs when a read-write head (works like the needle on a turntable) comes into contact with the surface of the disks platter which is spinning around at 7200 Rpm.
On February 1st. a noise that sounded very much like a high-speed dentist drill came screaming from my PC - checking the RAID Management page I could see that the number 2 drive had indeed failed. (I've put a sound file on my website if you want to hear it Baracuda Head Crash.wav)
Securing a replacement drive (a Western Digial) I had a go at getting it integrated into my system but the first attempt failed miserably ("no available drive found"). After I figuring out that the drive had to be formatted first it only took only a minute to install, and then another 8 hours to mirror the drive back in the array restoring my system to peak performance.

Since February 1 and with additional system cooling modifications, both servers have been running well although I still can't close the cabinet doors yet. My confidence in RAID technology is solidified, and I'm very happy to re-recommend a RAID 5 solution for any situation that requires a large logical drive for optimum disk utilization and data protection with a lower cost of ownership profile than simply doubling the number of disks.
One down, One to go
Bolstered by the sucess with the disk failure I pushed ahead for a solution to the dissapearing drive problem and sent another email off to HighPoint. I got a note back from them directing me to their Chineese website to download the latest drivers, bios, and web management tools. It sounded like a fob to get me off their backs, but those basic steps - even if they never seem to work, must be undertaken in order to move on to the next step.
On March 1 I found the driver, bios, and application files on their website and they were indeed different from the ones I'd obtained earlier from their US website (why didn't they just update the US files themselves?)
I installed the 4 new files and I guess it must have worked -- 4 or 5 reboots since the install and not a missing drive in sight!
I won't claim a final victory however. As with the currently accepted scientific theory: it's only true until it's not.
blog comments powered by Disqus