Mike Johnson wrote:
Um, wow. You have to do all that [DRBD] to fail-over Cyrus? Ick. This is why maildir is so nice. Between IMAP/POP and SMTP, it's actually why maildir was created. Keep your spools on an NFS system and you can have multiple IMAP servers with simply an IP level load balancer and you're set. One of the IMAP servers dies? No big deal. The same can be said/done with SMTP. Both can easily scale to multiple systems. This relies on a reliable NFS system, but those aren't too expensive.
Well keep in mind that this buys you more than just fail over of Cyrus. It also is providing that "reliable NFS" system you describe, in that the data is all safely mirrored between two ultimately redundant computers (which may also provide their own redundancy against hardware failure). The maildir (usually read: qmail) setup you describe above only works in a situation where you have 3 servers or more. That's usually not a problem, but it just pushes the "redundant single data store" problem farther back in the mail system. Something still has to provide a single, redundant copy of the data. It could very well be DRBD serving up NFS from the 3rd (and now 4th) machine in your picture. :) Although at that point, unless load is a concern at the qmail level, you might as well integrate those 4 into a simple pair.
On DRBD, what happens if the gigabit link between the systems fails? Does it scrag your filesystem?
Nope, though the file systems will most likely go out of sync, depending on the circumstances. If you have an additional path to monitor fail over (a null modem serial cable between the boxes is highly recommended, as well as monitoring on the front-end Ethernet interfaces), then the secondary will realize that only the gig-e link has failed. It will receive no further updates of the file system until you repair this link. Once the link is repaired, there is a "fast" checksum for restoring sync between the two boxes, so that you don't have to copy over the entire block device to resynchronize them.
In the case that the gig-e link fails, and that's you're *only* way of knowing that the other system is up, the secondary node would shoot the primary in the head, mount up it's copy of the block device, and continue on with life. Now with out STONITH (i.e. a way to remotely power-off the other machine), you're possibly in for some trouble... you'd end up with a split-brain scenario, but that has only happened because you've got a seriously poorly configured HA setup. :)
So in short, it's not quite as bad as you've made it out, Mike. Although, I'll be glad to concede that a mail system in general is not the most convenient thing to make redundant with just two boxen.
--
TriLUG mailing list : http://www.trilug.org/mailman/listinfo/trilug
TriLUG Organizational FAQ : http://trilug.org/faq/
TriLUG Member Services FAQ : http://members.trilug.org/services_faq/
TriLUG PGP Keyring : http://trilug.org/~chrish/trilug.asc
