Last Thursday, the routing ADs notified the routing wgs that there had been a problem with IETF mail. The problem caused some mail messages to ietf lists to be dropped. It was said that all senders whose messages had been dropped were notified individually.
A general alert to the wgs was thought to be unnecessary. However, the clarification below is not quite as definite that the discovery of the lost messages and notification of the senders was "foolproof". So anyone who thinks that they sent mail last Thursday should look at the mailing list archive. Anyone who sent mail to ANY mailing list served by the IETF mailman server should look at the appropriate archive. --Sandy ________________________________________ From: [email protected] [[email protected]] on behalf of Glen [[email protected]] Sent: Thursday, August 29, 2013 7:39 PM To: Glen Barney Subject: Mailman service interruption yesterday Greetings: Yesterday the IETF experienced a 5.5-hour service outage on its Mailman list processor. In this failure, Mailman started dropping messages, rather than bouncing them, as it should have done. This was problematic, because there was not an immediate indication that something was wrong, nor was there any automated or manual way that this outage could have been detected quickly. Fortunately, Pete Resnick, noted that several of his emails were not being delivered, and notified the emergency alert server at that time. Following Pete's alert, Steve Young, the IETF system administrator, logged in to the system, and was able to analyze the problem and effect resolution. He was able to restore service to Mailman and verify that mail was now flowing directly. Steve reported that the cause of the outage seems to have been a permissions setting on a directory that was incorrect. We are still investigating whether the broken setting was due to software changes, or a filesystem problem, and will continue to take steps to ensure that the current system remains healthy and operational. Because of the nature of this outage, Steve was not, unfortunately, able to recover the lost email messages themselves. But he did spend several hours processing the mail logs during the outage period, and sent notifications to all the senders that he was able to locate in the log during the outage period, asking those individuals to resend their email messages. While not a foolproof response, this was a good course of action to take in an attempt to recover the lost messages. Service has been up and running continuously since 2100PDT yesterday, and no further outages have been observed. No other services were impacted or interrupted during this time. Although not specifically mentioned in the release notes, we attribute this to the older version of Mailman currently in use. We note that we are deploying new IETF servers with the latest versions of Linux and Mailman (and other support software) to our colocation facilities next week, and hope to migrate the IETF to those servers soon. This should increase reliability and - we hope! - resolve whatever bug may have caused Mailman to drop these emails. In addition it has come to my attention that no notification was sent out generally following this outage. I was out sick today, at a string of medical appointments, but I failed to make clear to AMS that someone else would need to send a notification in my absence. I apologize for the delay, therefore, in getting this notification sent out. Thank you for your patience. As always, if there are any questions, please let me know. Glen Barney IT Director AMS (IETF Secretariat) _______________________________________________ sidr mailing list [email protected] https://www.ietf.org/mailman/listinfo/sidr
