Stephen Borrill writes:
> On Sat, 4 Nov 2023, Simon Burge wrote:
>> Greg Troxel wrote:
>>
>>> So to me this feels like a locking botch in a rare path in zfs.
>>
>> This appears to be the case. Chuck Silvers has some understanding of
>> the problem and I'm helping test, but at this stage there isn't a fix
>> available. :/
>
> It's interesting that you see the lockups during pkgsrc builds, i.e. a
> period where there is lots of file creation. We use zfs on backup
> systems that pull in data with rsync. During the initial runs (where
> every file is new) we usually get a couple of lockups, but during day
> to day operation (few changes) it is reliable. These are on physical
> and virtual machines running NetBSD 9 with the rule of thumb of 1GB
> RAM per TB of storage obeyed, but no patches besides setting MAXPHYS
> in the module to 32k for Xen.
I just had another problem, on the non-xen 32GB machine (which has 3.5T
in the pool, only half full).
The machine wasn't really doing much; X running with xfce, a few xterms,
ssh client, pidgin, and idling firefox with I think 24 tabs.
I found it mostly normal and was using an ssh session, and then switched
to the firefox virtual desktop which failed to redraw. I tried to kill
firefox (because firefox hanging is not so strange :-() and found that a
few of the tabs appeared to be stuck in flt_noram and zio_buf. I think
there might have been a different wchan earlier that was zfs but not
zio_buf.
I think it got in this state due to firefox leaking memory (in SIZE but
not RES?).
(So it might be a missing wakeup on flt_noram, but lock not released
seems plausible also. Totally guessing here.)
(As I was composing this message (in tmux on another machine), the
firefox lockup deteriorated to more things hanging and then a total
lockup. I was unable to ctrl-alt-f1 to get back to the text console.
It is responding to mdns queries and pings and sshd answers but I see
"local version string" and the "remote protocol" line does not appear.
I should try LOCKDEBUG on the package building box (where if it doesn't
work right that's much more ok!).
10853 129 9817 0 85 0 2762264 180704 uvnfp2 DEl ttyp57:12.20
(firefox)
10853 994 9817 0 0 00 0 -Zttyp50:00.00
(firefox)
10853 1867 9817 0 85 0 3423848 723944 uvnfp2 DEl ttyp5 146:52.32
(firefox)
10853 7407 9817 0 85 0 20169184 355160 flt_nora DEl ttyp5 63:48.76
(firefox)
10853 7630 9817 0 0 00 0 -Zttyp50:00.00
(firefox)
10853 8451 9817 0 85 0 2712376 126588 uvnfp2 DEl ttyp57:09.93
(firefox)
10853 8504 9817 0 85 0 2744608 143008 uvnfp2 DEl ttyp59:56.45
(firefox)
10853 9817 1 21 117 0 12939188 948252 zio_buf_ DEl ttyp5 303:41.53
(firefox)
10853 11066 9817 0 85 0 2821832 225664 >db_ DEl ttyp51:34.01
(firefox)
10853 11769 9817 0 85 0 2849780 232172 uvnfp2 DEl ttyp59:19.27
(firefox)
10853 12055 9817 0 85 0 2832852 144304 uvnfp2 DEl ttyp58:49.22
(firefox)
10853 13075 9817 0 85 0 2782516 193652 plpg DEl ttyp59:00.21
(firefox)
10853 15399 9817 0 85 0 2822236 249496 uvnfp2 DEl ttyp5 10:12.41
(firefox)
10853 15991 9817 0 85 0 2775316 187104 uvnfp2 DEl ttyp57:13.63
(firefox)
10853 16033 9817 0 0 00 0 -Zttyp50:00.00
(firefox)
10853 16877 9817 0 85 0 2731156 148896 uvnfp2 DEl ttyp51:59.22
(firefox)
10853 17275 9817 0 0 00 0 -Zttyp50:00.00
(firefox)
10853 19768 9817 0 85 0 2760188 152880 uvnfp2 DEl ttyp57:11.17
(firefox)
10853 21618 9817 0 0 00 0 -Zttyp50:00.00
(firefox)
10853 24342 9817 0 85 0 2737588 148452 uvnfp2 DEl ttyp5 11:51.61
(firefox)
10853 24956 9817 0 85 0 2981764 336852 uvnfp2 DEl ttyp5 20:20.13
(firefox)
10853 26368 9817 0 85 0 3164560 240992 uvnfp2 DEl ttyp5 19:28.72
(firefox)
10853 26981 9817 1123 85 0 3659088 770432 flt_nora DEl ttyp5 84:09.22
(firefox)
10853 27139 9817 0 0 00 0 -Zttyp50:00.00
(firefox)
10853 29076 9817 2270 85 0 2975552 261064 flt_nora DEl ttyp5 88:44.15
(firefox)
top says
Memory: 14G Act, 6989M Inact, 88M Wired, 549M Exec, 13G File, 228M Free
Swap: 40G Total, 348M Used, 40G Free / Pools: 11G Used
so it did get into paging
vmstat -s:
4096 bytes per page
16 page colors
8079588 pages managed
58419 pages free
3733123 pages active
1789074 pages inactive
1 pages paging
22427 pages wired
1 reserve pagedaemon pages
40 reserve kernel pages
252503 boot kernel pages
2817259 kernel pool pages
2027769 anonymous pages
3376469 cached file pages
140490 cached executable pages
2048 minimum free pages
2730 target free pages