daily CVS update output

2023-11-08 Thread NetBSD source update


Updating src tree:
P src/distrib/sets/makeflist
P src/distrib/sets/makeobsolete
P src/distrib/sets/makeplist
P src/distrib/sets/makesrctars
P src/distrib/sets/makesums
P src/distrib/sets/maketars
P src/doc/CHANGES
P src/etc/Makefile
P src/external/bsd/tcpdump/tcpdump2netbsd
P src/share/man/man4/gpiosim.4
P src/sys/dev/gpio/gpiosim.c
P src/sys/sys/un.h

Updating xsrc tree:


Killing core files:




Updating file list:
-rw-rw-r--  1 srcmastr  netbsd  42966873 Nov  9 03:04 ls-lRA.gz


Re: random lockups (now suspecting zfs)

2023-11-08 Thread Greg Troxel
Stephen Borrill  writes:

> On Sat, 4 Nov 2023, Simon Burge wrote:
>> Greg Troxel wrote:
>>
>>> So to me this feels like a locking botch in a rare path in zfs.
>>
>> This appears to be the case.  Chuck Silvers has some understanding of
>> the problem and I'm helping test, but at this stage there isn't a fix
>> available. :/
>
> It's interesting that you see the lockups during pkgsrc builds, i.e. a
> period where there is lots of file creation. We use zfs on backup
> systems that pull in data with rsync. During the initial runs (where
> every file is new) we usually get a couple of lockups, but during day
> to day operation (few changes) it is reliable. These are on physical
> and virtual machines running NetBSD 9 with the rule of thumb of 1GB
> RAM per TB of storage obeyed, but no patches besides setting MAXPHYS
> in the module to 32k for Xen.

I just had another problem, on the non-xen 32GB machine (which has 3.5T
in the pool, only half full).

The machine wasn't really doing much; X running with xfce, a few xterms,
ssh client, pidgin, and idling firefox with I think 24 tabs.

I found it mostly normal and was using an ssh session, and then switched
to the firefox virtual desktop which failed to redraw.  I tried to kill
firefox (because firefox hanging is not so strange :-() and found that a
few of the tabs appeared to be stuck in flt_noram and zio_buf.  I think
there might have been a different wchan earlier that was zfs but not
zio_buf.

I think it got in this state due to firefox leaking memory (in SIZE but
not RES?).

(So it might be a missing wakeup on flt_noram, but lock not released
seems plausible also.  Totally guessing here.)

(As I was composing this message (in tmux on another machine), the
firefox lockup deteriorated to more things hanging and then a total
lockup.  I was unable to ctrl-alt-f1 to get back to the text console.
It is responding to mdns queries and pings and sshd answers but I see
"local version string" and the "remote protocol" line does not appear.

I should try LOCKDEBUG on the package building box (where if it doesn't
work right that's much more ok!).



10853   129  9817  0  85  0  2762264 180704 uvnfp2   DEl  ttyp57:12.20 
(firefox)
10853   994  9817  0   0  00  0 -Zttyp50:00.00 
(firefox)
10853  1867  9817  0  85  0  3423848 723944 uvnfp2   DEl  ttyp5  146:52.32 
(firefox)
10853  7407  9817  0  85  0 20169184 355160 flt_nora DEl  ttyp5   63:48.76 
(firefox)
10853  7630  9817  0   0  00  0 -Zttyp50:00.00 
(firefox)
10853  8451  9817  0  85  0  2712376 126588 uvnfp2   DEl  ttyp57:09.93 
(firefox)
10853  8504  9817  0  85  0  2744608 143008 uvnfp2   DEl  ttyp59:56.45 
(firefox)
10853  9817 1 21 117  0 12939188 948252 zio_buf_ DEl  ttyp5  303:41.53 
(firefox)
10853 11066  9817  0  85  0  2821832 225664 >db_ DEl  ttyp51:34.01 
(firefox)
10853 11769  9817  0  85  0  2849780 232172 uvnfp2   DEl  ttyp59:19.27 
(firefox)
10853 12055  9817  0  85  0  2832852 144304 uvnfp2   DEl  ttyp58:49.22 
(firefox)
10853 13075  9817  0  85  0  2782516 193652 plpg DEl  ttyp59:00.21 
(firefox)
10853 15399  9817  0  85  0  2822236 249496 uvnfp2   DEl  ttyp5   10:12.41 
(firefox)
10853 15991  9817  0  85  0  2775316 187104 uvnfp2   DEl  ttyp57:13.63 
(firefox)
10853 16033  9817  0   0  00  0 -Zttyp50:00.00 
(firefox)
10853 16877  9817  0  85  0  2731156 148896 uvnfp2   DEl  ttyp51:59.22 
(firefox)
10853 17275  9817  0   0  00  0 -Zttyp50:00.00 
(firefox)
10853 19768  9817  0  85  0  2760188 152880 uvnfp2   DEl  ttyp57:11.17 
(firefox)
10853 21618  9817  0   0  00  0 -Zttyp50:00.00 
(firefox)
10853 24342  9817  0  85  0  2737588 148452 uvnfp2   DEl  ttyp5   11:51.61 
(firefox)
10853 24956  9817  0  85  0  2981764 336852 uvnfp2   DEl  ttyp5   20:20.13 
(firefox)
10853 26368  9817  0  85  0  3164560 240992 uvnfp2   DEl  ttyp5   19:28.72 
(firefox)
10853 26981  9817   1123  85  0  3659088 770432 flt_nora DEl  ttyp5   84:09.22 
(firefox)
10853 27139  9817  0   0  00  0 -Zttyp50:00.00 
(firefox)
10853 29076  9817   2270  85  0  2975552 261064 flt_nora DEl  ttyp5   88:44.15 
(firefox)

top says

Memory: 14G Act, 6989M Inact, 88M Wired, 549M Exec, 13G File, 228M Free
Swap: 40G Total, 348M Used, 40G Free / Pools: 11G Used

so it did get into paging

vmstat -s:
 4096 bytes per page
   16 page colors
  8079588 pages managed
58419 pages free
  3733123 pages active
  1789074 pages inactive
1 pages paging
22427 pages wired
1 reserve pagedaemon pages
   40 reserve kernel pages
   252503 boot kernel pages
  2817259 kernel pool pages
  2027769 anonymous pages
  3376469 cached file pages
   140490 cached executable pages
 2048 minimum free pages
 2730 target free pages