Re: So it seems "umount -f /nfs/mount" still doesn't work.....
At Tue, 30 Jun 2020 12:52:37 -0700, "Greg A. Woods" wrote: Subject: So it seems "umount -f /nfs/mount" still doesn't work. > Curiously the kernel now does something I didn't quite expect when one tries to reboot a system with a stuck mount. I was able to see this as I was running a kernel that verbosely documents all its shutdown unmounts and detaches. In prior times I had reached for the power switch. At first it just hangs: lilbit# reboot -q [ 1131744.8297338] syncing disks... 3 3 done [ 1131744.9797408] unmounting 0xc1f27000 /more/work (more.local:/work)... [ 1131744.9907053] ok [ 1131744.9907053] unmounting 0xc1f24000 /more/archive (more.local:/archive)... [ 1131745.0004431] ok [ 1131745.0004431] unmounting 0xc1f21000 /more/home (more.local:/home)... [ 1131745.0097426] ok [ 1131745.0097426] unmounting 0xc1f1f000 /once/build (once.local:/build)... [ 1131745.0097426] ok [ 1131745.0210854] unmounting 0xc1f1b000 /future/build (future.local:/build)... [ 1131745.0210854] ok [ 1131745.0304676] unmounting 0xc1f11000 /building/build (building.local:/build)... this is me hitting ^T to try to see what's going on [ 1131753.2800902] load: 0.52 cmd: reboot 7414 [fstcnt] 0.00u 0.16s 0% 424k [ 1132107.6651517] load: 0.48 cmd: reboot 7414 [fstcnt] 0.00u 0.16s 0% 424k [ 1133247.8436109] load: 0.48 cmd: reboot 7414 [fstcnt] 0.00u 0.16s 0% 424k then I hit ^C and immediately it proceeded ^C[ 1133249.3636755] unmounting 0xc1f0f000 /proc (procfs)... [ 1133249.3636755] ok [ 1133249.3636755] unmounting 0xc1f0d000 /dev/pts (ptyfs)... [ 1133249.3788641] unmounting 0xc1ecb000 /kern (kernfs)... [ 1133249.3843127] ok [ 1133249.3843127] unmounting 0xc1ec9000 /cache (/dev/wd1a)... [ 1133249.7636916] ok [ 1133249.7636916] unmounting 0xc1ec6000 /home (/dev/wd0g)... [ 1133249.7736976] unmounting 0xc1dd7000 /usr/pkg (/dev/wd0f)... [ 1133250.0737098] unmounting 0xc1ab1000 /var (/dev/wd0e)... [ 1133250.1537121] unmounting 0xc1804000 / (/dev/wd0a)... [ 1133251.0337515] unmounting 0xc1f11000 /building/build (building.local:/build)... [ 1133251.0469644] unmounting 0xc1f0d000 /dev/pts (ptyfs)... [ 1133251.0469644] unmounting 0xc1ec6000 /home (/dev/wd0g)... [ 1133251.0579007] unmounting 0xc1dd7000 /usr/pkg (/dev/wd0f)... [ 1133251.0637673] unmounting 0xc1ab1000 /var (/dev/wd0e)... [ 1133251.0637673] unmounting 0xc1804000 / (/dev/wd0a)... [ 1133251.0750403] sd0: detached [ 1133251.0750403] scsibus0: detached [ 1133251.0750403] gpio1: detached [ 1133251.0853614] sysbeep0: detached [ 1133251.0853614] midi0: detached [ 1133251.0853614] wd1: detached [ 1133251.0949369] uhub0: detached [ 1133251.0949369] com1: detached [ 1133251.0949369] usb0: detached [ 1133251.1045456] gpio0: detached [ 1133251.1045456] ohci0: detached [ 1133251.1045456] pchb0: detached [ 1133251.1151702] unmounting 0xc1f11000 /building/build (building.local:/build)... [ 1133251.1151702] unmounting 0xc1f0d000 /dev/pts (ptyfs)... [ 1133251.1279509] unmounting 0xc1ec6000 /home (/dev/wd0g)... [ 1133251.1279509] unmounting 0xc1dd7000 /usr/pkg (/dev/wd0f)... [ 1133251.1393918] unmounting 0xc1ab1000 /var (/dev/wd0e)... [ 1133251.1448739] unmounting 0xc1804000 / (/dev/wd0a)... [ 1133251.1448739] forcefully unmounting /building/build (building.local:/build)... [ 1133251.1587138] forceful unmount of /building/build failed with error -3 [ 1133251.1653872] rebooting... So it seems there's some contention between the internal attempt to unmount the stuck NFS filesystem(s), and the reboot system call itself, but if the reboot command is interrupted, then the kernel can get on with its shutdown procedures, and eventually it actually forces the unmount of the stuck NFS filesystem. Another interesting thing to note is that /future/build was also stuck as future.local is offline at this time. However that's the filesystem I tried to clear first by hand with "umount -f /future/build", but that was stuck, apparently in the same call to nfs_reconnect(). It seems it had done enough that when the reboot() triggered unmounting that it could complete the unmount without problems. (The other mounts on more.local and once.local were responding so they unmounted normally.) -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpSyQ4PZfAFq.pgp Description: PGP signature
Re: pg_jobc going negative?
Date:Mon, 29 Jun 2020 23:22:52 +0200 From:Kamil Rytarowski Message-ID: | Ping? This kernel crash is blocking GDB/etc and it is an instant crash. Sorry, been side-tracked, will get to it soon. kre
So it seems "umount -f /nfs/mount" still doesn't work.....
So I've been meaning to test crash(8) for some time now, but had never found the occasion. Well today I found a good reason. I've been re-arranging some servers and ended up leaving some NFS mounts in place when I should not have. This of course causes no end of problems if the nightly cron jobs try to run as various processes can get stuck, e.g. the "find"s that run. The good news is that it is now possible to kill these processes (whereas in older releases they were forever unkillable despite using interruptable mounts). However "umount -f /nfs/mount" is still not working. It appears, from "ps", that it gets stuck in a socket call. # umount -f /future/build [ 2436610.5372572] load: 0.88 cmd: umount 2940 [tstile] 0.00u 0.00s 0% 236k [ 2436611.6273278] load: 0.88 cmd: umount 2940 [tstile] 0.00u 0.00s 0% 236k here I killed the stuck find processes from another window [ 2436643.7493935] load: 1.03 cmd: umount 2940 [nfscn2] 0.00u 0.00s 0% 236k [ 2436645.2894833] load: 1.03 cmd: umount 2940 [nfscn2] 0.00u 0.00s 0% 236k [ 2436788.9486762] load: 1.00 cmd: umount 2940 [nfscn2] 0.00u 0.01s 0% 236k [ 2436794.2290140] load: 1.00 cmd: umount 2940 [nfscn2] 0.00u 0.01s 0% 236k [ 2437106.0289114] load: 1.04 cmd: umount 2940 [nfscn2] 0.00u 0.02s 0% 236k [ 2441119.0864077] load: 1.00 cmd: umount 2940 [nfscn2] 0.00u 0.14s 0% 236k [ 2441394.6442117] load: 1.08 cmd: umount 2940 [nfscn2] 0.00u 0.16s 0% 236k Note also that fstat(1) is entirely useless at finding those stuck processes as it just gets stuck itself: # fstat /future/build [ 2436819.2006111] load: 1.22 cmd: fstat 6722 [tstile] 0.00u 0.00s 0% 440k [ 2436820.8607171] load: 1.22 cmd: fstat 6722 [tstile] 0.00u 0.00s 0% 440k Note as you'll see below this particular example is from a not-very-current current system, but I see the exact same behaviour on a stock 9.0 and on a much more current 9.99.64. So, let's see if we can figure out why "umount -f" isn't forcing things: 12:00 [1.60] # ps alx UID PID PPID CPU PRI NI VSZRSS WCHAN STAT TTY TIME COMMAND 0 0 00 125 0 0 159628 uvm DKl ? 1013:51.76 [system] 0 1 0 671 85 0 3968472 waitIs ? 0:00.01 init 32767 187 10 85 0 3060 1272 select Ss ? 1:05.62 /usr/sbin/mdnsd 0 249 10 85 0 8732 1216 kqueue Ss ? 0:06.69 /usr/sbin/syslogd -s 0 283 10 85 0 26552 1200 nanoslp Ss ? 0:08.00 /usr/pkg/sbin/xenwatchdogd 30 15 19 528 10 85 0 5256480 select Ss ? 1:38.50 /usr/sbin/rwhod -u _rwhod 0 563 10 85 0 46832 1944 select Isl ? 0:00.00 /usr/pkg/sbin/xl create /etc/xen/more.conf 0 574 10 85 0 26552984 nanoslp Ss ? 0:07.99 /usr/pkg/sbin/xenwatchdogd 30 15 0 591 10 85 0 33000 1864 select Sl ? 63:21.44 /usr/pkg/sbin/xenconsoled --log=guest 12 595 9130 85 0 6216 1228 kqueue I? 0:02.83 qmgr -l -t unix -u 0 599 10 85 0 9504 8712 pause Ss ? 2:51.72 /usr/sbin/ntpd -p /var/run/ntpd.pid -g 1000 609 19060 85 0 13660 7076 select Is ? 0:00.81 xterm -class UXTerm 0 611 1 20 85 0 2344316 kqueue Is ? 0:00.00 /usr/sbin/powerd 0 622 14 85 0 6756 1076 select Is ? 0:00.00 /usr/sbin/sshd 0 650 10 85 0 31720 1884 select I? 0:08.23 /usr/pkg/sbin/xenstored --pid-file /var/run/xenstored.pi 0 658 10 85 0 26512 3044 select Is ? 0:00.10 conserver: master: port 782, 1 local, 0 remote 0 913 10 85 0 2788864 kqueue Is ? 0:11.68 /usr/libexec/postfix/master -w 0 997 10 85 0 2544620 nanoslp Is ? 0:05.55 /usr/sbin/cron 1000 1002 13272 174 85 0 21204 4344 select Ss ? 0:00.15 xterm -class UXTerm 0 1004 10 85 0 2796864 kqueue Ss ? 0:00.02 /usr/sbin/inetd -l 0 1906 10040 85 0 2624684 select I? 0:00.01 rshd -L 0 3812 10040 85 0 2624684 select I? 0:00.01 rshd -L 1000 4007 38120 85 0 26472 20676 select Is ? 0:14.22 xterm -class UXTerm 0 5999 10040 85 0 2620648 select I? 0:00.01 rshd -L 0 8861 6580 85 0 26512 3124 select S? 0:04.76 conserver: group 1: port 65506, 1 console 1000 9003 161310 85 0 13648 1680 select Ss ? 0:02.53 xload 0 13272 1004 699 85 0 6124652 select S? 0:00.01 rshd -L 1000 14064 59990 85 0 21428 7896 select Ss ? 0:00.38 xterm -class UXTerm 0 15270 10040 85 0 6028652 select I? 0:00.01 rshd -L 12 16008 9130 85 0