Re: Fault Tolerance

Ryan Thompson 1 Jul 2004 00:10:11 -0000

Mike Hogsett wrote to SpamAssassin:

> Recently I installed five machines with SpamAssassin (spamd) and use
> the DNS round robin method to do dumb load balancing between them.
>
> I finally finished altering all my users' .procmailrc files to include
> the '-d hostname.tld' argument to spamc so that they would use one of
> these 5 machines.
>
> Over this last weekend we experienced a huge influx of spam late at
> night.  We had nearly 14,000 SMTP connections in a single hour.  Most
> of these were rejected due to either of 'User unknown' or 'Domain of
> sender address <> does not resolve|exist'.  But each of the remaining
> messages all needed to be piped through one of the 5 spamds.  At some
> point one of these spamds became wedged.  Many spamcs finally gave up
> attempting to connect to spamd due to this wedged spamd.
>
> What do others do to create fault tolerance in their spamassassin
> installations?


In your case, I think it's more about efficient tuning than fault
tolerance. Why exactly did spamd become ``wedged''? What do you mean by
that? Did you exhaust some resource on the machine (VM, CPU, disk,
sockets, etc?) For example, I've often seen systems configured with too
many child processes allowed, so when the load hits, they keep forking
until they start eating swap, and then, under that kind of load, things
tend to spiral downward really quickly.

If you want maximal performance, increase the efficiency of your
underlying operating system by tuning OS parameters to allow an optimal
number of open filehandles, mbufs, etc...  add as much RAM as you can
afford, strip out *all* unnecessary processes, and *then* do performance
testing to determine the maximum connection limit you can sustain on an
individual machine. Configure the hard limit lower than that with enough
padding to account for growing rulesets and transient load spikes from
anything else that might be running on the machine. You'd probably much
rather have mail queue than tempfail or stop responding.

By optimizing performance, you typically also optimize stability. Once
you've done that, troubleshoot each separate issue individually, and
change your configuration so the issue can't occur again.

If the level of performance you obtain isn't sufficient for your needs,
you can improve it with reactive methods. I.e., monitor each box, and if
it fails (note that defining "failure" is not a trivial task for complex
systems), collect forensics and attempt to restart it, or at least page
an admin who should have the brains to fix it.

This kind of general strategy works for N systems. Of course it works
best for N=0, but most of us need more servers than that. :-)

If you do have N > 1 server, you can add fault tolerance as you have
with simple round-robin strategies. You probably won't get *much* better
than that without spending a lot more money in HA clustering
configurations, but you can have some significant gains by simply
remembering that each machine is still critical to production.
Administer each as if it were your only mail gateway.

There are plenty of setups with only *one* spamd server that fail
infrequently enough to make it impractical to deploy multiple machines.

- Ryan

-- 
  Ryan Thompson <[EMAIL PROTECTED]>

  SaskNow Technologies - http://www.sasknow.com
  901-1st Avenue North - Saskatoon, SK - S7K 1Y4

        Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
  Toll-Free: 877-727-5669     (877-SASKNOW)     North America

Re: Fault Tolerance

Reply via email to