Re: Fault Tolerance

Mike Hogsett 1 Jul 2004 21:44:51 -0000

Bob, 

Thank you for your extensive response.  It gives me much to think about.


 - Michael Hogsett

> Hi,
> 
> On Wed, 30 Jun 2004 18:10:23 -0600 (CST) Ryan Thompson <[EMAIL PROTECTED]
> com> wrote:
> 
> > Mike Hogsett wrote to SpamAssassin:
> > 
> > > Recently I installed five machines with SpamAssassin (spamd) and use
> > > the DNS round robin method to do dumb load balancing between them.
> > >
> > > I finally finished altering all my users' .procmailrc files to include
> > > the '-d hostname.tld' argument to spamc so that they would use one of
> > > these 5 machines.
> > >
> > > Over this last weekend we experienced a huge influx of spam late at
> > > night.  We had nearly 14,000 SMTP connections in a single hour.  Most
> > > of these were rejected due to either of 'User unknown' or 'Domain of
> > > sender address <> does not resolve|exist'.  But each of the remaining
> > > messages all needed to be piped through one of the 5 spamds.  At some
> > > point one of these spamds became wedged.  Many spamcs finally gave up
> > > attempting to connect to spamd due to this wedged spamd.
> > >
> > > What do others do to create fault tolerance in their spamassassin
> > > installations?
> > 
> > In your case, I think it's more about efficient tuning than fault
> > tolerance. Why exactly did spamd become ``wedged''? What do you mean by
> > that? Did you exhaust some resource on the machine (VM, CPU, disk,
> > sockets, etc?) For example, I've often seen systems configured with too
> > many child processes allowed, so when the load hits, they keep forking
> > until they start eating swap, and then, under that kind of load, things
> > tend to spiral downward really quickly.
> 
> Optimizing your code and configuration buys you nothing if you burn a
> network card, power supply, or disk. Don't mistake availability for
> efficiency. I'm not arguing against optimization; it's just that it's
> the solution to a different problem.
> 
> > If you do have N > 1 server, you can add fault tolerance as you have
> > with simple round-robin strategies. You probably won't get *much* better
> > than that without spending a lot more money in HA clustering
> > configurations, but you can have some significant gains by simply
> > remembering that each machine is still critical to production.
> > Administer each as if it were your only mail gateway.
> 
> Round-robin DNS is poor man's loadbalancing. With normal DNS and one
> server down, 100% of incoming requests fail, as opposed to 1/N with
> RRDNS. 100% recovery means either recovering the failed system, moving
> the failed system's IP address/traffic, or changing DNS and waiting for
> TTLs to expire. The mean time to recover (MTTR) is generally large
> except in the case moving IP addresses, and even then, moving IP
> addresses may be a complex or impossible task. Pray the recovery process
> is documented before you get that 3am page...
> 
> A better, lower-cost solution is LVS
> (http://www.linuxvirtualserver.org/) or, if you can afford it, a pair of
> hardware load balancers. Cisco Local Directors are obsolete but fairly
> reliable and will probably handle anything you throw at them unless
> you're handling Slashdot-level traffic.
> 
> General suggestions for improving system availability:
> 
> - Eliminate single-points-of-failure (SPOFs). Ideally, this means having
> at least two of everything. Unless you have deep pockets, you probably
> can't do this. Fine. Get reliability and cost numbers for your equipment
> and figure out where you can most effectively add redundancy. Buying a
> second datacenter is probably not an option; buying a second router
> might be.
> 
> You may not have to spend any money to reduce SPOFs. Example: my last
> employer had primary and secondary DNS servers plugged into the same
> switch. Solution: configure a port and move a cable. Oh, and make sure
> they aren't plugged into the same UPS or PDU.
> 
> - Beware of common-cause failure. Translation: diversify! If you get all
> your hardware from Dell and one day they issue an advisory about their
> PERC3 RAID controllers, all your hardware is at risk from the same
> endemic fault. All the equipment in the same cabinet is at risk if one
> piece shorts and starts on fire. And it's not just hardware, software,
> or location. If that incompetent but politically-connected summer intern
> has root on all the machines whose names start with 'b', they're at risk
> from him making the same configuration mistake on each one.
> 
> Common-cause failures are hard to reduce especially if you enjoy the
> economy that comes from standardizing on an OS or hardware vendor. If
> you do diversify, you may reduce common cause failure but increase risk
> due to human error since your operators have to maintain two types of
> equipement instead of just one. And again, you probably can't afford a
> second live datacenter unless you're an airline or phone company (this
> is a separate issue than building a disaster recovery site or other
> business continuity planning where you may actually have a second
> datacenter for emergencies.)
> 
> - Look at more the just the computer. This is similar to security
> analysis - keeping patched, using strong encryption, and using software
> with a good security pedigree are all good practices but none of them
> matter if the attacker has physical access to the machine. Consider
> HVAC, power, fire protection, weather, etc. How often does your site
> lose power and for how long? How long can you run without room cooling
> and what's your plan for shedding load (powering down unnecessary
> equipment)?
> 
> The trick is understanding how to model the risk to your systems so you
> can make intelligent decisions about where to spend your availability
> dollar. Understanding and controlling your system architecture and
> operation are key, as is having good historical data. A little fault
> tree analysis (FTA) and/or event tree analysis (ETA) can point out
> subtle interactions between systems, often problems that can be resolved
> with simple changes in configuration or operation. Obviously you can
> spend as much or as little effort on this as you want; at some point you
> need to decide what's good enough and what risks are simply not worth
> defending against.
> 
> > There are plenty of setups with only *one* spamd server that fail
> > infrequently enough to make it impractical to deploy multiple machines.
> 
> Failure probability is only half the story[1]; the other half is failure
> consequences. Balancing cost, risk, and benefit is a black art.
> 
> -- Bob
> 
> [1] The classical definition of risk is the product of an event's
> probability and its consqeuences. There are other exacerbating or
> compensating factors, but the classical definition is suitable for this
> case. It'd be a different story if the risk was people's kids being set
> on fire.
>

Re: Fault Tolerance

Reply via email to