Edgar Fuß <[email protected]> writes: > Thanks to riastradh@, this tuned out to be caused by an (UDP, hard) > HFS mount combined with a mis-configured IPFilter that blocked all but > the first fragment of a fragmented NFS reply (e.g., readdir) combined > with a NetBSD design error (or so Taylor says) that a vnode lock may > be held accross I/O, in this case, network I/O.
Holding a vnode lock across IO seems like a bug to me too. Marking the vnode as having an in-process operation so others can lock/read/report-that-status/unlock seems ok. But I'm sure you already know that vnode locking is hard. > It looks like the operation to which the reply was lost sometimes > doesn't get retried. Do we have some weird bug where the first > fragment arriving stops the timeout but the blocking of the remaining > fragments cause it to wedge? Probably not. fragments sit until there's a packet and then the packet is sent to the stack. So the NFS code is almost certainly totally unaware of the arrival of the first fragment.
