Bob, Thank you for your extensive response. It gives me much to think about.
- Michael Hogsett > Hi, > > On Wed, 30 Jun 2004 18:10:23 -0600 (CST) Ryan Thompson <[EMAIL PROTECTED] > com> wrote: > > > Mike Hogsett wrote to SpamAssassin: > > > > > Recently I installed five machines with SpamAssassin (spamd) and use > > > the DNS round robin method to do dumb load balancing between them. > > > > > > I finally finished altering all my users' .procmailrc files to include > > > the '-d hostname.tld' argument to spamc so that they would use one of > > > these 5 machines. > > > > > > Over this last weekend we experienced a huge influx of spam late at > > > night. We had nearly 14,000 SMTP connections in a single hour. Most > > > of these were rejected due to either of 'User unknown' or 'Domain of > > > sender address <> does not resolve|exist'. But each of the remaining > > > messages all needed to be piped through one of the 5 spamds. At some > > > point one of these spamds became wedged. Many spamcs finally gave up > > > attempting to connect to spamd due to this wedged spamd. > > > > > > What do others do to create fault tolerance in their spamassassin > > > installations? > > > > In your case, I think it's more about efficient tuning than fault > > tolerance. Why exactly did spamd become ``wedged''? What do you mean by > > that? Did you exhaust some resource on the machine (VM, CPU, disk, > > sockets, etc?) For example, I've often seen systems configured with too > > many child processes allowed, so when the load hits, they keep forking > > until they start eating swap, and then, under that kind of load, things > > tend to spiral downward really quickly. > > Optimizing your code and configuration buys you nothing if you burn a > network card, power supply, or disk. Don't mistake availability for > efficiency. I'm not arguing against optimization; it's just that it's > the solution to a different problem. > > > If you do have N > 1 server, you can add fault tolerance as you have > > with simple round-robin strategies. You probably won't get *much* better > > than that without spending a lot more money in HA clustering > > configurations, but you can have some significant gains by simply > > remembering that each machine is still critical to production. > > Administer each as if it were your only mail gateway. > > Round-robin DNS is poor man's loadbalancing. With normal DNS and one > server down, 100% of incoming requests fail, as opposed to 1/N with > RRDNS. 100% recovery means either recovering the failed system, moving > the failed system's IP address/traffic, or changing DNS and waiting for > TTLs to expire. The mean time to recover (MTTR) is generally large > except in the case moving IP addresses, and even then, moving IP > addresses may be a complex or impossible task. Pray the recovery process > is documented before you get that 3am page... > > A better, lower-cost solution is LVS > (http://www.linuxvirtualserver.org/) or, if you can afford it, a pair of > hardware load balancers. Cisco Local Directors are obsolete but fairly > reliable and will probably handle anything you throw at them unless > you're handling Slashdot-level traffic. > > General suggestions for improving system availability: > > - Eliminate single-points-of-failure (SPOFs). Ideally, this means having > at least two of everything. Unless you have deep pockets, you probably > can't do this. Fine. Get reliability and cost numbers for your equipment > and figure out where you can most effectively add redundancy. Buying a > second datacenter is probably not an option; buying a second router > might be. > > You may not have to spend any money to reduce SPOFs. Example: my last > employer had primary and secondary DNS servers plugged into the same > switch. Solution: configure a port and move a cable. Oh, and make sure > they aren't plugged into the same UPS or PDU. > > - Beware of common-cause failure. Translation: diversify! If you get all > your hardware from Dell and one day they issue an advisory about their > PERC3 RAID controllers, all your hardware is at risk from the same > endemic fault. All the equipment in the same cabinet is at risk if one > piece shorts and starts on fire. And it's not just hardware, software, > or location. If that incompetent but politically-connected summer intern > has root on all the machines whose names start with 'b', they're at risk > from him making the same configuration mistake on each one. > > Common-cause failures are hard to reduce especially if you enjoy the > economy that comes from standardizing on an OS or hardware vendor. If > you do diversify, you may reduce common cause failure but increase risk > due to human error since your operators have to maintain two types of > equipement instead of just one. And again, you probably can't afford a > second live datacenter unless you're an airline or phone company (this > is a separate issue than building a disaster recovery site or other > business continuity planning where you may actually have a second > datacenter for emergencies.) > > - Look at more the just the computer. This is similar to security > analysis - keeping patched, using strong encryption, and using software > with a good security pedigree are all good practices but none of them > matter if the attacker has physical access to the machine. Consider > HVAC, power, fire protection, weather, etc. How often does your site > lose power and for how long? How long can you run without room cooling > and what's your plan for shedding load (powering down unnecessary > equipment)? > > The trick is understanding how to model the risk to your systems so you > can make intelligent decisions about where to spend your availability > dollar. Understanding and controlling your system architecture and > operation are key, as is having good historical data. A little fault > tree analysis (FTA) and/or event tree analysis (ETA) can point out > subtle interactions between systems, often problems that can be resolved > with simple changes in configuration or operation. Obviously you can > spend as much or as little effort on this as you want; at some point you > need to decide what's good enough and what risks are simply not worth > defending against. > > > There are plenty of setups with only *one* spamd server that fail > > infrequently enough to make it impractical to deploy multiple machines. > > Failure probability is only half the story[1]; the other half is failure > consequences. Balancing cost, risk, and benefit is a black art. > > -- Bob > > [1] The classical definition of risk is the product of an event's > probability and its consqeuences. There are other exacerbating or > compensating factors, but the classical definition is suitable for this > case. It'd be a different story if the risk was people's kids being set > on fire. >
