Thanks to riastradh@, this tuned out to be caused by an (UDP, hard) HFS mount
combined with a mis-configured IPFilter that blocked all but the first fragment
of a fragmented NFS reply (e.g., readdir) combined with a NetBSD design error
(or so Taylor says) that a vnode lock may be held accross I/O, in this case,
network I/O.
It should be reproducable with a default NFS mount and a
block in all with frag-body
IPFilter rule and then trying to readdir.
Now, in some cases, the machine in question recovered after fixing the filter
rules, in others, it didn't, forcing a reboot. This strikes me as a bug because
the same lock-up could as well have been caused by network problems instead of
ipfilter mis-configuration.
It looks like the operation to which the reply was lost sometimes doesn't get
retried. Do we have some weird bug where the first fragment arriving stops the
timeout but the blocking of the remaining fragments cause it to wedge?