Inserting my 2 cents here since that is all that it is worth.
In backing up what Matt said, let me relate a similar example of a problem that occurred a year and a half ago to a major IT security products vendor: At about 6:15 AM PT on a week day in the middle of a normal busy week, their content filtering servers begin to become unresponsive. At first, it was intermittent and hard to pinpoint. But within about 45 minutes, they stopped responding completely. Well, their appliances did what they were designed to do by default configuration, fail safe. Block all access if the content filtering server does not respond. All one had to do though was to log onto the appliance and change the failsafe block to allow. But this is where the fun (not) began. There are hundreds or more of library's, both public and private, as well as schools, that are using those appliances and that content filtering service. Guess what? They are bound by law to have content filtering in place, meaning they could not turn the fail safe off. Companies and schools and libraries started screaming bloody murder and demanded a resolution an hour ago. The content filtering service was finally restored about 2:30 PM if I recall correctly. So, what happened? I mean this is a big company and it should have things in place to prevent this. Right? They did. As much as some one would expect them to. They had 4 servers. The servers were fine, they were still running. There were no software changes, and in fact their tests showed the servers were still responding. They were located at a location with multiple internet connections, and all tests showed the internet connections were all up and working. Power was flowing fine and all UPSs as well as the generator were all fine. Finally, after about 2 hours, the problem was found: My understanding is that a single module in a enterprise router failed but in a way that was hard to find. Once found, the hardware vendor sent a replacement part by courier to replace. My understanding is that it cost them well over 10 grand to eliminate that one single point of failure. And that was just for the hardware. Just goes to prove once again that in IT, 80% of the result is 20% of the cost. That remain 20% of result is what costs the 80%. John T From: Message Sniffer Community [mailto:[EMAIL PROTECTED] On Behalf Of Matt Sent: Friday, May 18, 2007 9:44 PM To: Message Sniffer Community Subject: [sniffer] Re: Appriver issue I have something that I would also like to clear up. When I indicated that AppRiver had removed it's contact page, it likely just wasn't operating at the time that I was attempting to access it. Considering their issues, it would not be a surprise to see other issues like this caused, but it seemed suspicious since their home page was working and not their contact page. I did note that it was working by the time that it was pointed out that it was up. In no way did I ever believe that Pete or Sniffer had any direct involvement in the system that created these problems, and in no way should this reflect badly on Pete or Sniffer as far as I am concerned. I was slightly miffed after getting off the phone with them where their reaction quite clearly indicated that they were aware of the issue. I suggested that they take their servers off-line due to the issues that were being caused, but I was probably barking up the wrong tree. The servers weren't taken off line for another hour or so, or maybe this is when the delivery servers caught up with the queued E-mail destined for my client. I'm not sure why they didn't act on this sooner. When you have a loop, it is important to stop it, and their multi-homing made it difficult for others to block. One user received about 500 copies of the same message (and also called them), and there were other examples that we saw which were much more limited. I do hope that they didn't choose to introduce new software at 11 a.m. ET on the busiest E-mail day of the week, and that this was only when the problems surfaced... Everyone that deals with significant volumes of E-mail has issues from time to time, and I wouldn't draw conclusions about AppRiver based on just this one circumstance. I would imagine that it is hard to plan for how to deal with a broad scale looping issue, and I'm sure this was a learning experience for them. Matt