Package: systemd
Version: 241-5

Ever since the switch to systemd we have been experiencing significant problems 
with our diskless nodes, where if the NFS connection is dropped for any reason 
(NFS server reboot, network router state reset, etc.) there is a high chance 
the diskless nodes will enter an unrecoverable state and require a hard reset 
(power off, power on).

While we've been working around this for a while and assumed it was just a 
Debian quirk, I was able to obtain the following trace from the console of a 
hung system today:

[820689.313769] nfs: server 192.168.1.1 not responding, still trying
[820693.530338] nfs: server 192.168.1.1 not responding, still trying
[820693.530451] nfs: server 192.168.1.1 not responding, still trying
[820696.994677] nfs: server 192.168.1.1 not responding, still trying
[820697.218891] nfs: server 192.168.1.1 not responding, still trying
[820697.698918] nfs: server 192.168.1.1 not responding, still trying
[820698.106834] nfs: server 192.168.1.1 not responding, still trying
[820721.177609] nfs: server 192.168.1.1 not responding, still trying
[820725.466102] nfs: server 192.168.1.1 not responding, still trying
[820818.681006] watchdog: BUG: soft lockup - CPU#2 stuck for 21s! 
[systemd-logind:273]
[820932.960202] INFO: task openvpn:5096 blocked for more than 120 seconds.
[820937.889046] nfs: server 192.168.1.1 OK
[820937.889226] nfs: server 192.168.1.1 OK
[820937.889374] nfs: server 192.168.1.1 OK
[820937.889381] nfs: server 192.168.1.1 OK
[820937.889448] nfs: server 192.168.1.1 OK
[820937.889503] nfs: server 192.168.1.1 OK
[820937.889574] nfs: server 192.168.1.1 OK
[820937.889665] nfs: server 192.168.1.1 OK
[820937.889670] nfs: server 192.168.1.1 OK
[820937.889674] nfs: server 192.168.1.1 OK
[820937.903880] systemd-journald[171]: Failed to open system journal: 
Permission denied
[820938.083071] systemd[1]: systemd-journald.service: Main process exited, 
code=killed, status=6/ABRT
[820938.111157] systemd[1]: systemd-journald.service: Failed to kill control 
group /system.slice/systemd-journald.service, ignoring: Permission denied
[820938.124774] systemd[1]: systemd-journald.service: Failed to kill control 
group /system.slice/systemd-journald.service, ignoring: Permission denied
[820938.131244] systemd[1]: systemd-journald.service: Unit entered failed state.
[820938.131418] systemd[1]: systemd-journald.service: Failed with result 
'watchdog'.
[820938.144754] systemd[1]: systemd-udevd.service: Main process exited, 
code=killed, status=6/ABRT
[820938.170807] systemd[1]: systemd-udevd.service: Failed to kill control group 
/system.slice/systemd-udevd.service, ignoring: Permission denied
[820938.177666] systemd[1]: systemd-udevd.service: Unit entered failed state.
[820938.177798] systemd[1]: systemd-udevd.service: Failed with result 
'watchdog'.
[820938.189036] systemd[1]: systemd-udevd.service: Service has no hold-off 
time, scheduling restart.

This fairly clearly puts the blame somewhere in systemd, which makes sense as 
our older non-systemd machines recover perfectly fine from even extended NFS 
server failures.  At minimum the systemd watchdog should probably be disabled 
while the NFS server is unavailable.

Reply via email to