[coyotos-dev] RAID sucks
Jonathan S. Shapiro
shap at eros-os.com
Sun Jul 22 21:13:28 EDT 2007
We have taken a minor setback. Not too bad, now that we have it fully
diagnosed, but it was good for two nights of lost sleep, and it is a
cautionary tale worth sharing.
The marketing droids would have you believe that RAID is an answer to
all of your storage wrongs. Bullshit. The most common cause of drive
failure in modern enclosures is heat. When multiple drives share an
enclosure, they tend to fail about the same time. We experienced this on
Friday night, when two drives died within about 3 hours late one night.
The drives took with them our primary office server, which implements
our source code repository, our primary backup storage, and our in-house
wiki. Fortunately, I am a Very Paranoid Fucker(TM), so we will be fully
recovered by Wednesday.
We have two RAID servers in our office. Each is equipped with 2TB of
disk storage, configured as Raid 0, so effectively 1TB each. The primary
machine, "office.eros-os.com", provides our Wiki, our issue tracking
system, our SCM repository, and our backup storage. We back up our
laptops at least once a day using an rsync script that does a full
system backup, but using delta tricks to give the storage cost of
incremental backups. These go to a directory tree on office that keeps a
history of the last 7 backups for any given client machine. [Since the
repository is append-only, this gives us a pretty robust scheme for
that.]
The second server, "doomsday", does a complete snapshot of office three
times a day. It keeps 252 snapshots, which amounts to three months.
These also use the rsync incremental trick, so the storage required
isn't that large. Doomsday is not externally accessable at all. The
*only* thing it runs is the second-tier backups. I named it "doomsday"
because I expected that doomsday would happen before we needed it for
recovery purposes.
On Friday 7/20, two drives on "office" started experiencing ECC errors
within about three hours of each other. Of course, they turned out to be
the two halves of a mirror. Bye bye RAID. Thankfully, the doomsday
machine is fine, and we will be able to recover fully.
The limiting factor in recovery would be funny if it wasn't so
irritating. Once half the drives are gone, you may as well upgrade all
of them. Unfortunately, all of our local computer stores are sold out of
750G SATA drives. They should arrive on Tuesday, and we'll rebuild
Tuesday night.
Anyway, for those of you looking at backup solutions, I recommend this
type of configuration. The doomsday machine was less than $2000 to
build, and it just saved the company. Servers are now cheaper than tape.
Food for thought.
shap
More information about the coyotos-dev
mailing list