Hello Matt,


Saturday, May 19, 2007, 12:44:25 AM, you wrote:


<snip/>


>


I was slightly miffed after getting off the phone with them where their reaction quite clearly indicated that they were aware of the issue.  I suggested that they take their servers off-line due to the issues that were being caused, but I was probably barking up the wrong tree.  The servers weren't taken off line for another hour or so, or maybe this is when the delivery servers caught up with the queued E-mail destined for my client.  I'm not sure why they didn't act on this sooner.  When you have a loop, it is important to stop it, and their multi-homing made it difficult for others to block.


The response time was actually nearly immediate, but the effects lingered for a while due to a number of factors. I was one of the folks who detected the onset of this event when I detected packet loss. The entire combined technical team was engaged in the problem within minutes (single digits) of detecting the problem.


The event presented as a DoS attack due to the heavy traffic and it's effects. After a short analysis (again, single digit minutes) we became aware of the true problem and the appropriate team began immediately correcting the issue. 


Shutting down the servers was not necessary and not a viable solution (though I'm sure it was considered). Even if that choice had been made the effects would have been the same due to the size of the system. That is - it would have taken the same amount of time to shut down the servers as it did to correct the software - and then there would have been significantly more collateral damage as a result.


Given the circumstances the best choice was made and I'm amazed at how quickly the entire team was able to become positively involved (and coordinated) in solving the problem, mitigating damage, and recovering normal operations -- all while handling an understandably huge inrush of support calls.


<snip/>


>


Everyone that deals with significant volumes of E-mail has issues from time to time, and I wouldn't draw conclusions about AppRiver based on just this one circumstance.  I would imagine that it is hard to plan for how to deal with a broad scale looping issue, and I'm sure this was a learning experience for them.


Clearly, and thanks for that!


There have already been a number of procedural changes and new tools developed from this event; the investigation is ongoing; and additional system changes will be forthcoming to help make these kinds of events far less likely, and even to help harden subsystems against the effects of these events whether they are caused unintentionally (such as this one) or otherwise.


_M


-- 

Pete McNeil

Chief Scientist,

Arm Research Labs, LLC.

#############################################################
This message is sent to you because you are subscribed to
  the mailing list <sniffer@sortmonster.com>.
To unsubscribe, E-mail to: <[EMAIL PROTECTED]>
To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]>
To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]>
Send administrative queries to  <[EMAIL PROTECTED]>



Reply via email to