Monday, March 24, 2008

Virtualization Is Hot, But Don't Get Burned

Virtualization is getting a lot of buzz right now, and for good reason. I think it offers some real benefits in a variety of uses. I've been trying to write an article for the Alpha VAR newsletter discussing some of these that should be of interest to that audience for some time now, but I keep getting pulled away. Rather than keep waiting, I want to share an important point I was going to make - especially since it just bit me and I knew about it in advance.

One of the really useful things about virtualization is that your entire VM is usually just 2-5 files. This makes it easy to back up an entire system or move it to a different host system.

It also makes it really easy to get yourself into trouble. Of those few files, one of them is typically the virtual disk for the virtual machine. That one file represents the entire hard drive, so any corruption to it means none of your VM is usable.

Normally, that downside far outweighs the benefits because it is so easy to back up. And everyone backs up, right?

Well least week, in the midst of all of the scurrying here to release Alpha Five Platinum, one of the customer service rep's machine died. It wouldn't power on in the morning. After swapping a power supply, power switch, power cord and other troubleshooting, we determined that the motherboard was fried. The disk was still good though, so we were in pretty good shape.

For all the reasons that I haven't written about in that elusive article yet, James' PC was a perfect candidate for virtualization. We created a new VM, put his old physical disk into another system and fired up Ghost on both to clone the physical disk to the new virtual disk.

There are a number of ways to go from physical machine to virtual, but we have had a 100% success rate with this method. That was, until last week anyway. After cloning and firing up the VM, the Windows XP install went into the endless blue screen, reboot, blue screen cycle. When we finally got a screen shot of the blue screen, we saw that there was a hardware conflict. Something about the failed physical system was too different from the virtual hardware.

After a bunch of trial and error (and blue screens) we found a system in our QA lab that was apparently close enough to the original hardware to get XP booted. Now with a running Windows instance, we could use VMware Converter to do the p2v conversion since that is much better at dealing with hardware subtleties.

Finally after running overnight, VMware Converter gave us a usable VM for James to use. We had already lost much valuable time and needed to get back to work on Alpha Five, so we made a conscious decision to take a calculated risk and not set up proper snapshotting and backups for this new VM.

And in case you can't guess what happened next, the hard drive in the host system for James' VM began to fail.

We gambled. We knew we were gambling. We lost.

Just a few sectors on the drive are damaged so it needs a new disk but fortunately only one file is damaged. That one file is James' 28GB virtual disk file. Any attempt to copy it, move it, read it or repair it results in a CRC error.

As I write this, I'm waiting for yet another repair attempt to complete. The last one said it found and fixed everything, but then the move to the replacement hard drive failed at 96%. I hope this one works and if the timer on the progress meter that I've been staring at for entirely too long is accurate, I'll know in another 8 minutes.

If we don't get this virtual disk file repaired and usable, we still have the old physical disk to go back to. In this case, James will have magically jumped back about a week in time and lost all work he's done since. Well, not everything - we do have backups of his documents, email, etc. but getting this VM repaired is still the best way to go.

So all the old warnings about backing up that have historically applied to physical machines apply to virtual machines as well. With the corruption of one single file, we're potentially looking at the loss of 30GB, not to mention all of the time lost today when we could have simply copied a single file from a backup archive if we had gone ahead and set that up last week.

No comments: