Mike Hogsett wrote to SpamAssassin: > Recently I installed five machines with SpamAssassin (spamd) and use > the DNS round robin method to do dumb load balancing between them. > > I finally finished altering all my users' .procmailrc files to include > the '-d hostname.tld' argument to spamc so that they would use one of > these 5 machines. > > Over this last weekend we experienced a huge influx of spam late at > night. We had nearly 14,000 SMTP connections in a single hour. Most > of these were rejected due to either of 'User unknown' or 'Domain of > sender address <> does not resolve|exist'. But each of the remaining > messages all needed to be piped through one of the 5 spamds. At some > point one of these spamds became wedged. Many spamcs finally gave up > attempting to connect to spamd due to this wedged spamd. > > What do others do to create fault tolerance in their spamassassin > installations?
In your case, I think it's more about efficient tuning than fault tolerance. Why exactly did spamd become ``wedged''? What do you mean by that? Did you exhaust some resource on the machine (VM, CPU, disk, sockets, etc?) For example, I've often seen systems configured with too many child processes allowed, so when the load hits, they keep forking until they start eating swap, and then, under that kind of load, things tend to spiral downward really quickly. If you want maximal performance, increase the efficiency of your underlying operating system by tuning OS parameters to allow an optimal number of open filehandles, mbufs, etc... add as much RAM as you can afford, strip out *all* unnecessary processes, and *then* do performance testing to determine the maximum connection limit you can sustain on an individual machine. Configure the hard limit lower than that with enough padding to account for growing rulesets and transient load spikes from anything else that might be running on the machine. You'd probably much rather have mail queue than tempfail or stop responding. By optimizing performance, you typically also optimize stability. Once you've done that, troubleshoot each separate issue individually, and change your configuration so the issue can't occur again. If the level of performance you obtain isn't sufficient for your needs, you can improve it with reactive methods. I.e., monitor each box, and if it fails (note that defining "failure" is not a trivial task for complex systems), collect forensics and attempt to restart it, or at least page an admin who should have the brains to fix it. This kind of general strategy works for N systems. Of course it works best for N=0, but most of us need more servers than that. :-) If you do have N > 1 server, you can add fault tolerance as you have with simple round-robin strategies. You probably won't get *much* better than that without spending a lot more money in HA clustering configurations, but you can have some significant gains by simply remembering that each machine is still critical to production. Administer each as if it were your only mail gateway. There are plenty of setups with only *one* spamd server that fail infrequently enough to make it impractical to deploy multiple machines. - Ryan -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
