Re: So it seems "umount -f /nfs/mount" still doesn't work.....

2020-06-30 Thread Greg A. Woods
At Tue, 30 Jun 2020 12:52:37 -0700, "Greg A. Woods"  wrote:
Subject: So it seems "umount -f /nfs/mount" still doesn't work.
> 

Curiously the kernel now does something I didn't quite expect when one
tries to reboot a system with a stuck mount.  I was able to see this as
I was running a kernel that verbosely documents all its shutdown
unmounts and detaches.  In prior times I had reached for the power switch.

At first it just hangs:

lilbit# reboot -q
[ 1131744.8297338] syncing disks... 3 3 done
[ 1131744.9797408] unmounting 0xc1f27000 /more/work (more.local:/work)...
[ 1131744.9907053] ok
[ 1131744.9907053] unmounting 0xc1f24000 /more/archive (more.local:/archive)...
[ 1131745.0004431] ok
[ 1131745.0004431] unmounting 0xc1f21000 /more/home (more.local:/home)...
[ 1131745.0097426] ok
[ 1131745.0097426] unmounting 0xc1f1f000 /once/build (once.local:/build)...
[ 1131745.0097426] ok
[ 1131745.0210854] unmounting 0xc1f1b000 /future/build (future.local:/build)...
[ 1131745.0210854] ok
[ 1131745.0304676] unmounting 0xc1f11000 /building/build 
(building.local:/build)...

   this is me hitting ^T to try to see what's going on 

[ 1131753.2800902] load: 0.52  cmd: reboot 7414 [fstcnt] 0.00u 0.16s 0% 424k
[ 1132107.6651517] load: 0.48  cmd: reboot 7414 [fstcnt] 0.00u 0.16s 0% 424k
[ 1133247.8436109] load: 0.48  cmd: reboot 7414 [fstcnt] 0.00u 0.16s 0% 424k

    then I hit ^C and immediately it proceeded 

^C[ 1133249.3636755] unmounting 0xc1f0f000 /proc (procfs)...
[ 1133249.3636755] ok
[ 1133249.3636755] unmounting 0xc1f0d000 /dev/pts (ptyfs)...
[ 1133249.3788641] unmounting 0xc1ecb000 /kern (kernfs)...
[ 1133249.3843127] ok
[ 1133249.3843127] unmounting 0xc1ec9000 /cache (/dev/wd1a)...
[ 1133249.7636916] ok
[ 1133249.7636916] unmounting 0xc1ec6000 /home (/dev/wd0g)...
[ 1133249.7736976] unmounting 0xc1dd7000 /usr/pkg (/dev/wd0f)...
[ 1133250.0737098] unmounting 0xc1ab1000 /var (/dev/wd0e)...
[ 1133250.1537121] unmounting 0xc1804000 / (/dev/wd0a)...
[ 1133251.0337515] unmounting 0xc1f11000 /building/build 
(building.local:/build)...
[ 1133251.0469644] unmounting 0xc1f0d000 /dev/pts (ptyfs)...
[ 1133251.0469644] unmounting 0xc1ec6000 /home (/dev/wd0g)...
[ 1133251.0579007] unmounting 0xc1dd7000 /usr/pkg (/dev/wd0f)...
[ 1133251.0637673] unmounting 0xc1ab1000 /var (/dev/wd0e)...
[ 1133251.0637673] unmounting 0xc1804000 / (/dev/wd0a)...
[ 1133251.0750403] sd0: detached
[ 1133251.0750403] scsibus0: detached
[ 1133251.0750403] gpio1: detached
[ 1133251.0853614] sysbeep0: detached
[ 1133251.0853614] midi0: detached
[ 1133251.0853614] wd1: detached
[ 1133251.0949369] uhub0: detached
[ 1133251.0949369] com1: detached
[ 1133251.0949369] usb0: detached
[ 1133251.1045456] gpio0: detached
[ 1133251.1045456] ohci0: detached
[ 1133251.1045456] pchb0: detached
[ 1133251.1151702] unmounting 0xc1f11000 /building/build 
(building.local:/build)...
[ 1133251.1151702] unmounting 0xc1f0d000 /dev/pts (ptyfs)...
[ 1133251.1279509] unmounting 0xc1ec6000 /home (/dev/wd0g)...
[ 1133251.1279509] unmounting 0xc1dd7000 /usr/pkg (/dev/wd0f)...
[ 1133251.1393918] unmounting 0xc1ab1000 /var (/dev/wd0e)...
[ 1133251.1448739] unmounting 0xc1804000 / (/dev/wd0a)...
[ 1133251.1448739] forcefully unmounting /building/build 
(building.local:/build)...
[ 1133251.1587138] forceful unmount of /building/build failed with error -3
[ 1133251.1653872] rebooting...


So it seems there's some contention between the internal attempt to
unmount the stuck NFS filesystem(s), and the reboot system call itself,
but if the reboot command is interrupted, then the kernel can get on
with its shutdown procedures, and eventually it actually forces the
unmount of the stuck NFS filesystem.

Another interesting thing to note is that /future/build was also stuck
as future.local is offline at this time.  However that's the filesystem
I tried to clear first by hand with "umount -f /future/build", but that
was stuck, apparently in the same call to nfs_reconnect().  It seems it
had done enough that when the reboot() triggered unmounting that it
could complete the unmount without problems.  (The other mounts on
more.local and once.local were responding so they unmounted normally.)

-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpSyQ4PZfAFq.pgp
Description: PGP signature


Re: pg_jobc going negative?

2020-06-30 Thread Robert Elz
Date:Mon, 29 Jun 2020 23:22:52 +0200
From:Kamil Rytarowski 
Message-ID:  

  | Ping? This kernel crash is blocking GDB/etc and it is an instant crash.

Sorry, been side-tracked, will get to it soon.

kre



So it seems "umount -f /nfs/mount" still doesn't work.....

2020-06-30 Thread Greg A. Woods
So I've been meaning to test crash(8) for some time now, but had never
found the occasion.

Well today I found a good reason.  I've been re-arranging some servers
and ended up leaving some NFS mounts in place when I should not have.
This of course causes no end of problems if the nightly cron jobs try to
run as various processes can get stuck, e.g. the "find"s that run.  The
good news is that it is now possible to kill these processes (whereas in
older releases they were forever unkillable despite using interruptable
mounts).  However "umount -f /nfs/mount" is still not working.  It
appears, from "ps", that it gets stuck in a socket call.

# umount -f /future/build
[ 2436610.5372572] load: 0.88  cmd: umount 2940 [tstile] 0.00u 0.00s 0% 236k
[ 2436611.6273278] load: 0.88  cmd: umount 2940 [tstile] 0.00u 0.00s 0% 236k

 here I killed the stuck find processes from another window 

[ 2436643.7493935] load: 1.03  cmd: umount 2940 [nfscn2] 0.00u 0.00s 0% 236k
[ 2436645.2894833] load: 1.03  cmd: umount 2940 [nfscn2] 0.00u 0.00s 0% 236k
[ 2436788.9486762] load: 1.00  cmd: umount 2940 [nfscn2] 0.00u 0.01s 0% 236k
[ 2436794.2290140] load: 1.00  cmd: umount 2940 [nfscn2] 0.00u 0.01s 0% 236k
[ 2437106.0289114] load: 1.04  cmd: umount 2940 [nfscn2] 0.00u 0.02s 0% 236k
[ 2441119.0864077] load: 1.00  cmd: umount 2940 [nfscn2] 0.00u 0.14s 0% 236k
[ 2441394.6442117] load: 1.08  cmd: umount 2940 [nfscn2] 0.00u 0.16s 0% 236k


Note also that fstat(1) is entirely useless at finding those stuck
processes as it just gets stuck itself:

# fstat /future/build
[ 2436819.2006111] load: 1.22  cmd: fstat 6722 [tstile] 0.00u 0.00s 0% 440k
[ 2436820.8607171] load: 1.22  cmd: fstat 6722 [tstile] 0.00u 0.00s 0% 440k


Note as you'll see below this particular example is from a
not-very-current current system, but I see the exact same behaviour on a
stock 9.0 and on a much more current 9.99.64.

So, let's see if we can figure out why "umount -f" isn't forcing things:


12:00 [1.60] # ps alx
  UID   PID  PPID  CPU PRI NI   VSZRSS WCHAN   STAT TTY   TIME 
COMMAND
0 0 00 125  0 0 159628 uvm DKl  ?   1013:51.76 
[system]
0 1 0  671  85  0  3968472 waitIs   ?  0:00.01 init 
32767   187 10  85  0  3060   1272 select  Ss   ?  1:05.62 
/usr/sbin/mdnsd 
0   249 10  85  0  8732   1216 kqueue  Ss   ?  0:06.69 
/usr/sbin/syslogd -s 
0   283 10  85  0 26552   1200 nanoslp Ss   ?  0:08.00 
/usr/pkg/sbin/xenwatchdogd 30 15 
   19   528 10  85  0  5256480 select  Ss   ?  1:38.50 
/usr/sbin/rwhod -u _rwhod 
0   563 10  85  0 46832   1944 select  Isl  ?  0:00.00 
/usr/pkg/sbin/xl create /etc/xen/more.conf 
0   574 10  85  0 26552984 nanoslp Ss   ?  0:07.99 
/usr/pkg/sbin/xenwatchdogd 30 15 
0   591 10  85  0 33000   1864 select  Sl   ? 63:21.44 
/usr/pkg/sbin/xenconsoled --log=guest 
   12   595   9130  85  0  6216   1228 kqueue  I?  0:02.83 qmgr 
-l -t unix -u 
0   599 10  85  0  9504   8712 pause   Ss   ?  2:51.72 
/usr/sbin/ntpd -p /var/run/ntpd.pid -g 
 1000   609  19060  85  0 13660   7076 select  Is   ?  0:00.81 
xterm -class UXTerm 
0   611 1   20  85  0  2344316 kqueue  Is   ?  0:00.00 
/usr/sbin/powerd 
0   622 14  85  0  6756   1076 select  Is   ?  0:00.00 
/usr/sbin/sshd 
0   650 10  85  0 31720   1884 select  I?  0:08.23 
/usr/pkg/sbin/xenstored --pid-file /var/run/xenstored.pi
0   658 10  85  0 26512   3044 select  Is   ?  0:00.10 
conserver: master: port 782, 1 local, 0 remote 
0   913 10  85  0  2788864 kqueue  Is   ?  0:11.68 
/usr/libexec/postfix/master -w 
0   997 10  85  0  2544620 nanoslp Is   ?  0:05.55 
/usr/sbin/cron 
 1000  1002 13272  174  85  0 21204   4344 select  Ss   ?  0:00.15 
xterm -class UXTerm 
0  1004 10  85  0  2796864 kqueue  Ss   ?  0:00.02 
/usr/sbin/inetd -l 
0  1906  10040  85  0  2624684 select  I?  0:00.01 rshd 
-L 
0  3812  10040  85  0  2624684 select  I?  0:00.01 rshd 
-L 
 1000  4007  38120  85  0 26472  20676 select  Is   ?  0:14.22 
xterm -class UXTerm 
0  5999  10040  85  0  2620648 select  I?  0:00.01 rshd 
-L 
0  8861   6580  85  0 26512   3124 select  S?  0:04.76 
conserver: group 1: port 65506, 1 console 
 1000  9003 161310  85  0 13648   1680 select  Ss   ?  0:02.53 
xload 
0 13272  1004  699  85  0  6124652 select  S?  0:00.01 rshd 
-L 
 1000 14064  59990  85  0 21428   7896 select  Ss   ?  0:00.38 
xterm -class UXTerm 
0 15270  10040  85  0  6028652 select  I?  0:00.01 rshd 
-L 
   12 16008   9130  85  0