Hi, On Wed, 30 Jun 2004 18:10:23 -0600 (CST) Ryan Thompson <[EMAIL PROTECTED]> wrote:
> Mike Hogsett wrote to SpamAssassin: > > > Recently I installed five machines with SpamAssassin (spamd) and use > > the DNS round robin method to do dumb load balancing between them. > > > > I finally finished altering all my users' .procmailrc files to include > > the '-d hostname.tld' argument to spamc so that they would use one of > > these 5 machines. > > > > Over this last weekend we experienced a huge influx of spam late at > > night. We had nearly 14,000 SMTP connections in a single hour. Most > > of these were rejected due to either of 'User unknown' or 'Domain of > > sender address <> does not resolve|exist'. But each of the remaining > > messages all needed to be piped through one of the 5 spamds. At some > > point one of these spamds became wedged. Many spamcs finally gave up > > attempting to connect to spamd due to this wedged spamd. > > > > What do others do to create fault tolerance in their spamassassin > > installations? > > In your case, I think it's more about efficient tuning than fault > tolerance. Why exactly did spamd become ``wedged''? What do you mean by > that? Did you exhaust some resource on the machine (VM, CPU, disk, > sockets, etc?) For example, I've often seen systems configured with too > many child processes allowed, so when the load hits, they keep forking > until they start eating swap, and then, under that kind of load, things > tend to spiral downward really quickly. Optimizing your code and configuration buys you nothing if you burn a network card, power supply, or disk. Don't mistake availability for efficiency. I'm not arguing against optimization; it's just that it's the solution to a different problem. > If you do have N > 1 server, you can add fault tolerance as you have > with simple round-robin strategies. You probably won't get *much* better > than that without spending a lot more money in HA clustering > configurations, but you can have some significant gains by simply > remembering that each machine is still critical to production. > Administer each as if it were your only mail gateway. Round-robin DNS is poor man's loadbalancing. With normal DNS and one server down, 100% of incoming requests fail, as opposed to 1/N with RRDNS. 100% recovery means either recovering the failed system, moving the failed system's IP address/traffic, or changing DNS and waiting for TTLs to expire. The mean time to recover (MTTR) is generally large except in the case moving IP addresses, and even then, moving IP addresses may be a complex or impossible task. Pray the recovery process is documented before you get that 3am page... A better, lower-cost solution is LVS (http://www.linuxvirtualserver.org/) or, if you can afford it, a pair of hardware load balancers. Cisco Local Directors are obsolete but fairly reliable and will probably handle anything you throw at them unless you're handling Slashdot-level traffic. General suggestions for improving system availability: - Eliminate single-points-of-failure (SPOFs). Ideally, this means having at least two of everything. Unless you have deep pockets, you probably can't do this. Fine. Get reliability and cost numbers for your equipment and figure out where you can most effectively add redundancy. Buying a second datacenter is probably not an option; buying a second router might be. You may not have to spend any money to reduce SPOFs. Example: my last employer had primary and secondary DNS servers plugged into the same switch. Solution: configure a port and move a cable. Oh, and make sure they aren't plugged into the same UPS or PDU. - Beware of common-cause failure. Translation: diversify! If you get all your hardware from Dell and one day they issue an advisory about their PERC3 RAID controllers, all your hardware is at risk from the same endemic fault. All the equipment in the same cabinet is at risk if one piece shorts and starts on fire. And it's not just hardware, software, or location. If that incompetent but politically-connected summer intern has root on all the machines whose names start with 'b', they're at risk from him making the same configuration mistake on each one. Common-cause failures are hard to reduce especially if you enjoy the economy that comes from standardizing on an OS or hardware vendor. If you do diversify, you may reduce common cause failure but increase risk due to human error since your operators have to maintain two types of equipement instead of just one. And again, you probably can't afford a second live datacenter unless you're an airline or phone company (this is a separate issue than building a disaster recovery site or other business continuity planning where you may actually have a second datacenter for emergencies.) - Look at more the just the computer. This is similar to security analysis - keeping patched, using strong encryption, and using software with a good security pedigree are all good practices but none of them matter if the attacker has physical access to the machine. Consider HVAC, power, fire protection, weather, etc. How often does your site lose power and for how long? How long can you run without room cooling and what's your plan for shedding load (powering down unnecessary equipment)? The trick is understanding how to model the risk to your systems so you can make intelligent decisions about where to spend your availability dollar. Understanding and controlling your system architecture and operation are key, as is having good historical data. A little fault tree analysis (FTA) and/or event tree analysis (ETA) can point out subtle interactions between systems, often problems that can be resolved with simple changes in configuration or operation. Obviously you can spend as much or as little effort on this as you want; at some point you need to decide what's good enough and what risks are simply not worth defending against. > There are plenty of setups with only *one* spamd server that fail > infrequently enough to make it impractical to deploy multiple machines. Failure probability is only half the story[1]; the other half is failure consequences. Balancing cost, risk, and benefit is a black art. -- Bob [1] The classical definition of risk is the product of an event's probability and its consqeuences. There are other exacerbating or compensating factors, but the classical definition is suitable for this case. It'd be a different story if the risk was people's kids being set on fire.
