Re: Fault Tolerance

Bob Apthorpe 1 Jul 2004 04:31:47 -0000

Hi,

On Wed, 30 Jun 2004 18:10:23 -0600 (CST) Ryan Thompson <[EMAIL PROTECTED]> 
wrote:


> Mike Hogsett wrote to SpamAssassin:
> 
> > Recently I installed five machines with SpamAssassin (spamd) and use
> > the DNS round robin method to do dumb load balancing between them.
> >
> > I finally finished altering all my users' .procmailrc files to include
> > the '-d hostname.tld' argument to spamc so that they would use one of
> > these 5 machines.
> >
> > Over this last weekend we experienced a huge influx of spam late at
> > night.  We had nearly 14,000 SMTP connections in a single hour.  Most
> > of these were rejected due to either of 'User unknown' or 'Domain of
> > sender address <> does not resolve|exist'.  But each of the remaining
> > messages all needed to be piped through one of the 5 spamds.  At some
> > point one of these spamds became wedged.  Many spamcs finally gave up
> > attempting to connect to spamd due to this wedged spamd.
> >
> > What do others do to create fault tolerance in their spamassassin
> > installations?
> 
> In your case, I think it's more about efficient tuning than fault
> tolerance. Why exactly did spamd become ``wedged''? What do you mean by
> that? Did you exhaust some resource on the machine (VM, CPU, disk,
> sockets, etc?) For example, I've often seen systems configured with too
> many child processes allowed, so when the load hits, they keep forking
> until they start eating swap, and then, under that kind of load, things
> tend to spiral downward really quickly.

Optimizing your code and configuration buys you nothing if you burn a
network card, power supply, or disk. Don't mistake availability for
efficiency. I'm not arguing against optimization; it's just that it's
the solution to a different problem.

> If you do have N > 1 server, you can add fault tolerance as you have
> with simple round-robin strategies. You probably won't get *much* better
> than that without spending a lot more money in HA clustering
> configurations, but you can have some significant gains by simply
> remembering that each machine is still critical to production.
> Administer each as if it were your only mail gateway.

Round-robin DNS is poor man's loadbalancing. With normal DNS and one
server down, 100% of incoming requests fail, as opposed to 1/N with
RRDNS. 100% recovery means either recovering the failed system, moving
the failed system's IP address/traffic, or changing DNS and waiting for
TTLs to expire. The mean time to recover (MTTR) is generally large
except in the case moving IP addresses, and even then, moving IP
addresses may be a complex or impossible task. Pray the recovery process
is documented before you get that 3am page...

A better, lower-cost solution is LVS
(http://www.linuxvirtualserver.org/) or, if you can afford it, a pair of
hardware load balancers. Cisco Local Directors are obsolete but fairly
reliable and will probably handle anything you throw at them unless
you're handling Slashdot-level traffic.

General suggestions for improving system availability:

- Eliminate single-points-of-failure (SPOFs). Ideally, this means having
at least two of everything. Unless you have deep pockets, you probably
can't do this. Fine. Get reliability and cost numbers for your equipment
and figure out where you can most effectively add redundancy. Buying a
second datacenter is probably not an option; buying a second router
might be.

You may not have to spend any money to reduce SPOFs. Example: my last
employer had primary and secondary DNS servers plugged into the same
switch. Solution: configure a port and move a cable. Oh, and make sure
they aren't plugged into the same UPS or PDU.

- Beware of common-cause failure. Translation: diversify! If you get all
your hardware from Dell and one day they issue an advisory about their
PERC3 RAID controllers, all your hardware is at risk from the same
endemic fault. All the equipment in the same cabinet is at risk if one
piece shorts and starts on fire. And it's not just hardware, software,
or location. If that incompetent but politically-connected summer intern
has root on all the machines whose names start with 'b', they're at risk
from him making the same configuration mistake on each one.

Common-cause failures are hard to reduce especially if you enjoy the
economy that comes from standardizing on an OS or hardware vendor. If
you do diversify, you may reduce common cause failure but increase risk
due to human error since your operators have to maintain two types of
equipement instead of just one. And again, you probably can't afford a
second live datacenter unless you're an airline or phone company (this
is a separate issue than building a disaster recovery site or other
business continuity planning where you may actually have a second
datacenter for emergencies.)

- Look at more the just the computer. This is similar to security
analysis - keeping patched, using strong encryption, and using software
with a good security pedigree are all good practices but none of them
matter if the attacker has physical access to the machine. Consider
HVAC, power, fire protection, weather, etc. How often does your site
lose power and for how long? How long can you run without room cooling
and what's your plan for shedding load (powering down unnecessary
equipment)?

The trick is understanding how to model the risk to your systems so you
can make intelligent decisions about where to spend your availability
dollar. Understanding and controlling your system architecture and
operation are key, as is having good historical data. A little fault
tree analysis (FTA) and/or event tree analysis (ETA) can point out
subtle interactions between systems, often problems that can be resolved
with simple changes in configuration or operation. Obviously you can
spend as much or as little effort on this as you want; at some point you
need to decide what's good enough and what risks are simply not worth
defending against.

> There are plenty of setups with only *one* spamd server that fail
> infrequently enough to make it impractical to deploy multiple machines.

Failure probability is only half the story[1]; the other half is failure
consequences. Balancing cost, risk, and benefit is a black art.

-- Bob

[1] The classical definition of risk is the product of an event's
probability and its consqeuences. There are other exacerbating or
compensating factors, but the classical definition is suitable for this
case. It'd be a different story if the risk was people's kids being set
on fire.

Re: Fault Tolerance

Reply via email to