On Wed, Dec 1, 2021 at 1:28 PM Alan Somers <[email protected]> wrote:
> On Wed, Dec 1, 2021 at 11:25 AM Warner Losh <[email protected]> wrote: > > > > > > > > On Wed, Dec 1, 2021, 11:16 AM Alan Somers <[email protected]> wrote: > >> > >> On a stable/13 build from 16-Sep-2021 I see frequent ZFS deadlocks > >> triggered by HDD timeouts. The timeouts are probably caused by > >> genuine hardware faults, but they didn't lead to deadlocks in > >> 12.2-RELEASE or 13.0-RELEASE. Unfortunately I don't have much > >> additional information. ZFS's stack traces aren't very informative, > >> and dmesg doesn't show anything besides the usual information about > >> the disk timeout. I don't see anything obviously related in the > >> commit history for that time range, either. > >> > >> Has anybody else observed this phenomenon? Or does anybody have a > >> good way to deliberately inject timeouts? CAM makes it easy enough to > >> inject an error, but not a timeout. If it did, then I could bisect > >> the problem. As it is I can only reproduce it on production servers. > > > > > > What SIM? Timeouts are tricky because they have many sources, some of > which are nonlocal... > > > > Warner > > mpr(4) > Is this just a single drive that's acting up, or is the controller initialized as part of the error recovery? If a single drive, are there multiple timeouts that happen at the same time such that we timeout a request while we're waiting for the abort command we send to the firmware to be acknowledged? Would you be able to run a kgdb script to see if you're hitting a situation that I fixed in mpr that would cause I/O to never complete in this rather odd circumstance? If you can, and if it is, then there's a change I can MFC :). Warner
