Re: After network outage postfix found not running

2021-12-23 Thread jdebert
On Thu, 23 Dec 2021 17:16:10 -0700
Bob Proulx  wrote:

> Wietse Venema wrote:
> > Postfix was only the messenger of bad news. It does not
> > spontaneously self-destruct.  
> 
> I have always found Postfix to be extremely reliable and robust.
> Which was why this happening on two different systems was such an
> oddity.
> 
> Bob

From my own observations on debian:

systemd's default config does not wait for the network before starting
postfix and will not retry. If it is actually set up to wait, then
systemd is ignoring that bit.

--


Re: After network outage postfix found not running

2021-12-23 Thread Bob Proulx
Viktor Dukhovni wrote:
> Could a watchdog timer have killed master(8) if it were suspended
> long enough?

Seems plausible.  I could see something in the code timing out since
things would be blocked waiting for I/O for so long.a

> Demi Marie Obenour:
> > My intuition is that either some timeout somewhere got hit, or that
> > some I/O failed (rather than being queued forever) and caused an error
> > paging in some code.  That would cause Postfix to die with SIGBUS.
> 
> If the file system was unavailable, then yes, failure to page in
> some code would be fatal.

This is a good brainstorm.  I wasn't thinking about the swap side of
memory.  It seems very plausible to me that a paged out block might
have been needed.  And that might have timed out and been reported as
a an I/O failure.  Which would have killed the process.  Or possibly
the reverse.  The system may have tried to page out a block and the
writing of that block may have timed out as well.

> > Do you have Postfix set to automatically be restarted if it crashes?

No.  Postfix is very reliable and robust.  It has never been needed.

And I think I will resist the urge to add automated restarting of
postfix now too.  Because this was a very unusual situation.  I know
we always fight the last war.  I doubt this will be a repeating
problem.  But it would add a layer of snag that another admin might
not be expecting.

Plus I have now learned that if the network is offline for any
significant time then all affected systems should be rebooted as a
precautionary.  And a reboot is always okay.  Systems reboot just
fine.

Instead I think I will add a watchdog of some sort that would
automatically detect this type of network attached storage outage and
then automatically reboot the system if it detects that it is
recovering from such a state.  That's harder to do.  But it solves the
problem for the entire system globally.

> I expect that the restart would fail for the same reason as you
> describe above.

I would expect that it would block waiting for I/O and simply wait to
start.  It would stack up as another process that increases the load
average.  And then eventually when the disk request was serviced then
it would continue and start then.

Thank you everyone for brainstorming along with me.  It's a good
learning experience.  And I think I know I need a way to detect that
the network attached block storage has been offline too long and that
the system when recovered from that needs to be rebooted.

Bob


Re: After network outage postfix found not running

2021-12-23 Thread Bob Proulx
Wietse Venema wrote:
> Postfix was only the messenger of bad news. It does not
> spontaneously self-destruct.

I have always found Postfix to be extremely reliable and robust.
Which was why this happening on two different systems was such an oddity.

Bob


Re: After network outage postfix found not running

2021-12-23 Thread Bob Proulx
Matus UHLAR - fantomas wrote:
> it's still possible that:
> - postfix was killed by e.g. OOM killer, in which case it could not log that.

I disable the OOM with vm.overcommit_memory = 2 so that particular
thing won't be it.

> - the logs were lost because of systemd's log limits

That is possible.  The two failing systems were ones running systemd.
I am not a fan.  I am looking at rsyslog logging.

> there are multiple lined of postfix/master.
> 
> it also could be systemd restarting postfix and giving up after some time

I don't believe systemd will try to restart postfix.

Good ideas though.  Thank you for brainstorming along with me.

Bob


Re: After network outage postfix found not running

2021-12-23 Thread Viktor Dukhovni
Could a watchdog timer have killed master(8) if it were suspended
long enough?

> On 23 Dec 2021, at 1:57 pm, Wietse Venema  wrote:
> 
>> My intuition is that either some timeout somewhere got hit, or that
>> some I/O failed (rather than being queued forever) and caused an error
>> paging in some code.  That would cause Postfix to die with SIGBUS.
> 
> If the file system was unavailable, then yes, failure to page in 
> some code would be fatal.

-- 
Viktor.



Re: After network outage postfix found not running

2021-12-23 Thread Wietse Venema
Demi Marie Obenour:
> My intuition is that either some timeout somewhere got hit, or that
> some I/O failed (rather than being queued forever) and caused an error
> paging in some code.  That would cause Postfix to die with SIGBUS.

If the file system was unavailable, then yes, failure to page in 
some code would be fatal.

> Do you have Postfix set to automatically be restarted if it crashes?

I expect that the restart would fail for the same reason as you
describe above.

Wietse


Re: After network outage postfix found not running

2021-12-23 Thread Matus UHLAR - fantomas

Bob Proulx:
> Any ideas on why postfix would not be running after such an event on
> two of the systems but okay on the others?



Wietse Venema wrote:

LOGS. Postfix logs a sh*load, including processes that fail to
start. If the systems were unable to record this in LOGS, then you
will never know.


On 22.12.21 21:41, Bob Proulx wrote:

I guess we will never know then.  Because I showed the relevant logs.
I would have showed more but the large message was rejected due to
size.  But there wasn't anything more clueful than the logs I showed.

It's not terribly important.  It was just an oddity.  Because Postfix
is so very reliable that it was unusual to see on two systems it had
stopped.  But again it is very unusual to have the root file system
blocking for so long.


it's still possible that:
- postfix was killed by e.g. OOM killer, in which case it could not log
that.
- the logs were lost because of systemd's log limits

there are multiple lined of postfix/master.

it also could be systemd restarting postfix and giving up after some time

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
BSE = Mad Cow Desease ... BSA = Mad Software Producents Desease


Re: After network outage postfix found not running

2021-12-23 Thread Wietse Venema
Bob Proulx:
> Wietse Venema wrote:
> > Bob Proulx:
> > > Any ideas on why postfix would not be running after such an event on
> > > two of the systems but okay on the others?
> > 
> > LOGS. Postfix logs a sh*load, including processes that fail to
> > start. If the systems were unable to record this in LOGS, then you
> > will never know.
> 
> I guess we will never know then.  Because I showed the relevant logs.
> I would have showed more but the large message was rejected due to
> size.  But there wasn't anything more clueful than the logs I showed.

Postfix was only the messenger of bad news. It does not
spontaneously self-destruct.

Wietse


Re: After network outage postfix found not running

2021-12-22 Thread Demi Marie Obenour
On 12/22/21 8:26 PM, Bob Proulx wrote:
> Everything is good so no stress about anything here but I am poking at
> the log files with a stick after a strange incident.  Perhaps this
> tripped over some problem that discussing it might either enlighten me
> or perhaps unlikely improve things.  Who knows?
> 
> The GNU Savannah software forge had a network outage in the data
> center lasting about eight hours.  It was the dark of the night and
> things were fixed quickly once the admins woke up at sunrise and went
> to the data center to fix things.  Due to the timing this was an
> unusually long network outage event.
> 
> I would like to describe four of the VMs of interest here.  Two were
> okay after networking returned.  But two were found afterward without
> postfix running.  I am curious if the why is somehow useful or
> interesting to know.
> 
> All of the systems have their root block storage on a ceph network
> attached storage pool.  Which of course meant that the root file
> system was unavailable for the full time of the eight hour outage.
> Therefore if some bit of file data is cached and not expired then the
> Linux kernel can service the request.  If not and if it needs to read
> the data then it attempts a network read and of course blocks waiting
> for network I/O.
> 
> Of course cron jobs were still running.  And stacking up processes
> blocked on I/O waiting.  One server achieved a load average of 520 and
> was perfectly fine recovering after networking was restored.  Another
> reached a load of 68 but afterward the postfix daemons were found to
> be not running.  In summary: Two of the four had no discernible
> failures.  Two of the four were found with postfix not running
> afterward.  Postfix seems to have been the only noticed failure.
> Which I found rather unusual.  And perhaps noteworthy.  But this is a
> very unusual situation where the root file system is unavailable for
> an extended period of time.
> 
> I am simply reviewing things afterward now.  Trying to understand and
> perhaps improve things.  Since two of them failed.  But again there is
> no stress here.  Everything is all good now.  And this is a highly
> unusual system event.  Because those two were found without postfix
> running I rebooted all of the servers subsequently as a preventative
> maintenance action.  Even though others seemed perfectly okay
> afterward.  Because almost certainly there would be other as yet not
> found problems.
> 
> Any ideas on why postfix would not be running after such an event on
> two of the systems but okay on the others?
> 
> Bob

My intuition is that either some timeout somewhere got hit, or that
some I/O failed (rather than being queued forever) and caused an error
paging in some code.  That would cause Postfix to die with SIGBUS.

Do you have Postfix set to automatically be restarted if it crashes?

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

OpenPGP_0xB288B55FFF9C22C1.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: After network outage postfix found not running

2021-12-22 Thread Bob Proulx
Wietse Venema wrote:
> Bob Proulx:
> > Any ideas on why postfix would not be running after such an event on
> > two of the systems but okay on the others?
> 
> LOGS. Postfix logs a sh*load, including processes that fail to
> start. If the systems were unable to record this in LOGS, then you
> will never know.

I guess we will never know then.  Because I showed the relevant logs.
I would have showed more but the large message was rejected due to
size.  But there wasn't anything more clueful than the logs I showed.

It's not terribly important.  It was just an oddity.  Because Postfix
is so very reliable that it was unusual to see on two systems it had
stopped.  But again it is very unusual to have the root file system
blocking for so long.

Bob


Re: After network outage postfix found not running

2021-12-22 Thread Wietse Venema
Bob Proulx:
> Any ideas on why postfix would not be running after such an event on
> two of the systems but okay on the others?

LOGS. Postfix logs a sh*load, including processes that fail to
start. If the systems were unable to record this in LOGS, then you
will never know.

Wietse


After network outage postfix found not running

2021-12-22 Thread Bob Proulx
Everything is good so no stress about anything here but I am poking at
the log files with a stick after a strange incident.  Perhaps this
tripped over some problem that discussing it might either enlighten me
or perhaps unlikely improve things.  Who knows?

The GNU Savannah software forge had a network outage in the data
center lasting about eight hours.  It was the dark of the night and
things were fixed quickly once the admins woke up at sunrise and went
to the data center to fix things.  Due to the timing this was an
unusually long network outage event.

I would like to describe four of the VMs of interest here.  Two were
okay after networking returned.  But two were found afterward without
postfix running.  I am curious if the why is somehow useful or
interesting to know.

All of the systems have their root block storage on a ceph network
attached storage pool.  Which of course meant that the root file
system was unavailable for the full time of the eight hour outage.
Therefore if some bit of file data is cached and not expired then the
Linux kernel can service the request.  If not and if it needs to read
the data then it attempts a network read and of course blocks waiting
for network I/O.

Of course cron jobs were still running.  And stacking up processes
blocked on I/O waiting.  One server achieved a load average of 520 and
was perfectly fine recovering after networking was restored.  Another
reached a load of 68 but afterward the postfix daemons were found to
be not running.  In summary: Two of the four had no discernible
failures.  Two of the four were found with postfix not running
afterward.  Postfix seems to have been the only noticed failure.
Which I found rather unusual.  And perhaps noteworthy.  But this is a
very unusual situation where the root file system is unavailable for
an extended period of time.

I am simply reviewing things afterward now.  Trying to understand and
perhaps improve things.  Since two of them failed.  But again there is
no stress here.  Everything is all good now.  And this is a highly
unusual system event.  Because those two were found without postfix
running I rebooted all of the servers subsequently as a preventative
maintenance action.  Even though others seemed perfectly okay
afterward.  Because almost certainly there would be other as yet not
found problems.

Any ideas on why postfix would not be running after such an event on
two of the systems but okay on the others?

Bob

I am abbreviating everything because it was too large for the mailing
list on the first sending of this message. :-)

Here is one of the config files for the two where postfix was found
not running.  I'll put the long details below.  The other two are
similar.  This is on an older Trisquel OS.  Trisquel is a fork of
Ubuntu.  So basically think Ubuntu here.

root@vcs1:~# postconf -n
alias_database = hash:/etc/aliases
alias_maps = hash:/etc/aliases
append_dot_mydomain = no
biff = no
canonical_maps = hash:/etc/postfix/canonical
compatibility_level = 2
inet_interfaces = loopback-only
inet_protocols = ipv4
mailbox_size_limit = 0
masquerade_domains = savannah.gnu.org
masquerade_exceptions = root
mydestination = $myhostname, localhost.$mydomain, localhost
myhostname = vcs1.savannah.gnu.org
mynetworks = 127.0.0.0/8 [:::127.0.0.0]/104 [::1]/128
myorigin = /etc/mailname
non_smtpd_milters = unix:/var/run/opendkim/opendkim.sock
readme_directory = no
recipient_delimiter = +
relayhost =
smtp_tls_session_cache_database = btree:${data_directory}/smtp_scache
smtpd_banner = $myhostname ESMTP $mail_name (Debian/GNU)
smtpd_milters = unix:/var/run/opendkim/opendkim.sock
smtpd_relay_restrictions = permit_mynetworks permit_sasl_authenticated 
defer_unauth_destination
smtpd_tls_cert_file = /etc/ssl/certs/ssl-cert-snakeoil.pem
smtpd_tls_key_file = /etc/ssl/private/ssl-cert-snakeoil.key
smtpd_tls_session_cache_database = btree:${data_directory}/smtpd_scache
smtpd_use_tls = yes

And here is the smaller of the log events.  Timezone is US/Eastern for both.

...everything is fine...
Dec 20 01:45:45 vcs1 postfix/qmgr[1996]: A25F0A908C: removed
Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event: 
read timeout for service public/pickup
Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event: 
read timeout for service public/pickup
Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event: 
read timeout for service public/pickup
...I trimmed out 27 identical lines just to reduce this...
Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event: 
read timeout for service public/pickup
Dec 20 09:43:32 vcs1 postfix/master[1983]: warning: unix_trigger_event: 
read timeout for service public/pickup
Dec 20 09:43:32 vcs1 postfix/master[1983]: warning: unix_trigger_event: 
read timeout for service public/pickup
Dec 20