Following up on zfs-related lockups from netbsd-users@: It's not just me. Multiple people have posted in previous replies that they see this.
I see processes in flt_noram5 and they persistently remain there after RAM becomes available. I see processes in what I think is zilog (zfs intent log?). (ps and ddb unhelpfully truncate these fields). I see processes in zio_buf_? (again unhelpfully truncated). There are some processes in tstile, but that just means they are waiting for the same thing something else is waiting for, as I understand it. After reducing maxvnodes from 1600000 (default value on 32GB system) to 500000 lockups are less frequent. lockups are provoked by programs, in zfs, doing: - reading large amuonts of data quickly - deleting large numbers of files quickly I think therefore we have multiple problems: - zfs operations should block userland if resources are over threshold (or more than X% over, if there is some background cleanup intended to usually work without blocking) - there is a bug with missing wakeups or some other locking problem under memory pressure, that somehow only happens with zfs or pools. (I'm saying pools because zfs allocates massive amounts of pool storage, and that typically does not happen on non-zfs systems.) Questions: - Is anyone seeing lockups on netbsd-10, other than that they think they can pin on flaky hardware or accessing an odd device? - Is there a way in ddb to issue a wakeup on flt_noram5? - If I wanted to change the kernel to every so often (30s?) issue a wakeup to flt_noram5, where/how should I do this? Or, should there be a once/second that goes to the next process and wakes it up, as a debug option? Or, why I am wrong to want to do this? - Somehow, processes waiting on pools do not get woken up when presumably the pool code was waiting on RAM, and RAM becomes available. Or at least it seems that way. How is this supposed to work? - My belief is that even if zfs is piggy, the system should not lock up, and that absent bugs I would be complaining "zfs piggyness leads to paging out stuff and making the system slow" instead. Correct?