Inserting my 2 cents here since that is all that it is worth.

 

In backing up what Matt said, let me relate a similar example of a problem
that occurred a year and a half ago to a major IT security products vendor:

 

At about 6:15 AM PT on a week day in the middle of a normal busy week, their
content filtering servers begin to become unresponsive. At first, it was
intermittent and hard to pinpoint. But within about 45 minutes, they stopped
responding completely. Well, their appliances did what they were designed to
do by default configuration, fail safe. Block all access if the content
filtering server does not respond. All one had to do though was to log onto
the appliance and change the failsafe block to allow. But this is where the
fun (not) began. There are hundreds or more of library's, both public and
private, as well as schools, that are using those appliances and that
content filtering service. Guess what? They are bound by law to have content
filtering in place, meaning they could not turn the fail safe off. Companies
and schools and libraries started screaming bloody murder and demanded a
resolution an hour ago. The content filtering service was finally restored
about 2:30 PM if I recall correctly. 

 

So, what happened? I mean this is a big company and it should have things in
place to prevent this. Right?

 

They did. As much as some one would expect them to.

 

They had 4 servers. The servers were fine, they were still running. There
were no software changes, and in fact their tests showed the servers were
still responding. They were located at a location with multiple internet
connections, and all tests showed the internet connections were all up and
working. Power was flowing fine and all UPSs as well as the generator were
all fine. Finally, after about 2 hours, the problem was found: My
understanding is that a single module in a enterprise router failed but in a
way that was hard to find. Once found, the hardware vendor sent a
replacement part by courier to replace.

 

My understanding is that it cost them well over 10 grand to eliminate that
one single point of failure. And that was just for the hardware.

 

Just goes to prove once again that in IT, 80% of the result is 20% of the
cost. That remain 20% of result is what costs the 80%.

 

John T

 

From: Message Sniffer Community [mailto:[EMAIL PROTECTED] On Behalf
Of Matt
Sent: Friday, May 18, 2007 9:44 PM
To: Message Sniffer Community
Subject: [sniffer] Re: Appriver issue

 

I have something that I would also like to clear up.

When I indicated that AppRiver had removed it's contact page, it likely just
wasn't operating at the time that I was attempting to access it.
Considering their issues, it would not be a surprise to see other issues
like this caused, but it seemed suspicious since their home page was working
and not their contact page.  I did note that it was working by the time that
it was pointed out that it was up.

In no way did I ever believe that Pete or Sniffer had any direct involvement
in the system that created these problems, and in no way should this reflect
badly on Pete or Sniffer as far as I am concerned.

I was slightly miffed after getting off the phone with them where their
reaction quite clearly indicated that they were aware of the issue.  I
suggested that they take their servers off-line due to the issues that were
being caused, but I was probably barking up the wrong tree.  The servers
weren't taken off line for another hour or so, or maybe this is when the
delivery servers caught up with the queued E-mail destined for my client.
I'm not sure why they didn't act on this sooner.  When you have a loop, it
is important to stop it, and their multi-homing made it difficult for others
to block.  One user received about 500 copies of the same message (and also
called them), and there were other examples that we saw which were much more
limited.  I do hope that they didn't choose to introduce new software at 11
a.m. ET on the busiest E-mail day of the week, and that this was only when
the problems surfaced...

Everyone that deals with significant volumes of E-mail has issues from time
to time, and I wouldn't draw conclusions about AppRiver based on just this
one circumstance.  I would imagine that it is hard to plan for how to deal
with a broad scale looping issue, and I'm sure this was a learning
experience for them.

Matt




 

Reply via email to