Last Thursday, the routing ADs notified the routing wgs that there had been a 
problem with IETF mail.  The problem caused some mail messages to ietf lists to 
be dropped.  It was said that all senders whose messages had been dropped were 
notified individually.

A general alert to the wgs was thought to be unnecessary.  However, the 
clarification below is not quite as definite that the discovery of the lost 
messages and notification of the senders was "foolproof".

So anyone who thinks that they sent mail last Thursday should look at the 
mailing list archive.  Anyone who sent mail to ANY mailing list served by the 
IETF mailman server should look at the appropriate archive.

--Sandy

________________________________________
From: [email protected] [[email protected]] on behalf of Glen 
[[email protected]]
Sent: Thursday, August 29, 2013 7:39 PM
To: Glen Barney
Subject: Mailman service interruption yesterday

Greetings:

Yesterday the IETF experienced a 5.5-hour service outage on its
Mailman list processor.

In this failure, Mailman started dropping messages, rather than
bouncing them, as it should have done.  This was problematic, because
there was not an immediate indication that something was wrong, nor
was there any automated or manual way that this outage could have been
detected quickly.  Fortunately, Pete Resnick, noted that several of
his emails were not being delivered, and notified the emergency alert
server at that time.

Following Pete's alert, Steve Young, the IETF system administrator,
logged in to the system, and was able to analyze the problem and
effect resolution.  He was able to restore service to Mailman and
verify that mail was now flowing directly.

Steve reported that the cause of the outage seems to have been a
permissions setting on a directory that was incorrect.  We are still
investigating whether the broken setting was due to software changes,
or a filesystem problem, and will continue to take steps to ensure
that the current system remains healthy and operational.

Because of the nature of this outage, Steve was not, unfortunately,
able to recover the lost email messages themselves.  But he did spend
several hours processing the mail logs during the outage period, and
sent notifications to all the senders that he was able to locate in
the log during the outage period, asking those individuals to resend
their email messages.  While not a foolproof response, this was a good
course of action to take in an attempt to recover the lost messages.

Service has been up and running continuously since 2100PDT yesterday,
and no further outages have been observed.  No other services were
impacted or interrupted during this time.

Although not specifically mentioned in the release notes, we attribute
this to the older version of Mailman currently in use.  We note that
we are deploying new IETF servers with the latest versions of Linux
and Mailman (and other support software) to our colocation facilities
next week, and hope to migrate the IETF to those servers soon.  This
should increase reliability and - we hope! - resolve whatever bug may
have caused Mailman to drop these emails.

In addition it has come to my attention that no notification was sent
out generally following this outage.  I was out sick today, at a
string of medical appointments, but I failed to make clear to AMS that
someone else would need to send a notification in my absence.  I
apologize for the delay, therefore, in getting this notification sent
out.

Thank you for your patience.  As always, if there are any questions,
please let me know.

Glen Barney
IT Director
AMS (IETF Secretariat)
_______________________________________________
sidr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/sidr

Reply via email to