Some time back, I recall seeing people talking on the lists about some sort of discouragingly common issue with processes getting stuck in tstile waits. (I've tried to scare up the relevant list mail on mail-archive.netbsd.org, so far to no avail.)
I can't recall how long ago that was, nor what version(s) it was under. But I recently starting having one of my machines do something very similar: occasionally (under circumstances largely uncharacterized so far) one of my machines wedges, with most of userland stuck in apparently permanent tstile waits. ("Apparently permanent" = I've left the machine for sometimes as much as multiple hours, without effect.) The machine in question is still running my 5.2 derivative. Anything that doesn't leave the kernel still works. Or, let me be precise: one of the machine's primary functions is packet routing, and that still works, provided it's forwarding between real-hardware interfaces; it is inference on my part that other purely-in-kernel things would work. But, because the problem seems to be userland processes getting stuck waiting for locks, it would not surprise me for pure-kernel things like packet forwarding to keep working. Debugging this has been slow, because it wedges only occasionally. I once had it happen twice in the same day, but, based on unscientific feel, I would say the expected MTBF is about a week. That really slows down the edit-build-test-debug cycle. I set the machine up with serial console; breaking to ddb works (that's how I could tell what processes were blocked on). I added debugging code, suitable for calling from ddb. It just recently wedged again, and the results are puzzling enough I wanted to run them past anyone here with the leisure and inclination to offer suggestions, whether simply based on what I've found or based on memories of the "dreaded" tstile issue from the past. My debugging simply dumps out the state of turnstiles. I then captured that with the machine on the other end of the console serial line. Between ps output and my debugging output, I can then track down what process is blocked waiting on what other process. In today's hang, I found: Many processes blocked on 7599 1 3 3 4 d3e682a0 nc tstile blocked on 13479 1 3 3 4 d1383a00 multi tstile blocked on 557 1 3 0 84 d2251500 mailwrapper puffsrpl Many processes blocked on 17440 1 3 1 4 d1e38a20 bozohttpd tstile blocked on 10493 1 3 1 4 d68ef800 bozohttpd tstile blocked on 17512 1 3 2 4 d1f3bce0 xferwatch tstile blocked on 3985 1 3 1 84 d1c3ba80 bozohttpd puffsrpl This "explains" why this is a relatively new thing; this machine has been using a puffs filesystem for only a month and a half or so (approximately since about May 9th). So I went looking for the puffs backing process in the ps listing, fully expecting to find it stuck in tstile. I didn't: 8169 1 3 3 84 d286e840 gitfs puffsget 23930 1 3 1 84 d5a4f0e0 gitfs piperd 21881 1 3 1 84 d1c3b800 gitfs select gitfs is something I wrote that uses puffs to provide a filesystem view of git repos. There was also 23176 1 3 1 84 d1f3b7e0 git puffsrpl which surprised me; I would not expect a git process to be having anything to do with anything under the puffs mount point and thus with no reason to wait on puffsrpl. But then, I wouldn't expect mailwrapper to, either - the puffs filesystem forms part of my /export/ftp area, which usually is not touched by anything but bozohttpd and ftpd. It's not a question of trying to page out a dirty page, either (that being the only plausible reason that comes to mind for arm's-length processes to be accessing the puffs filesystem); the puffs mount point is read-only (and synchronous, noexec, nodev, union, though I wouldn't expect those to matter). The gitfs process does fork git subprocesses under some circumstances; a filesystem access that does that normally will hang until the git subprocess finishes. I don't know whether process 23176 was forked by gitfs or not; if it was, that could in theory have produced the deadlock. However, there is a second data point. I have another machine, not exposed to the world, which I was doing some gitfs development on. I also ran, on that machine, some small games which I then displayed over X connections to my desktop machine. On a few occasions, when I was doing nothing but playing one of those games, the game process would wedge in a tstile wait partway through a simple motion animation. While I wasn't keeping careful records (it was only today that I had any reason to think puffs was involved), I think there was no puffs mountpoint active at least some of those times, and even if there was, it most certainly wasn't being actively accessed and thus not forking git subprocesses. gitfs uses puffs, but not libpuffs - I can talk about why, if anyone cares, but it would be difficult to keep it from veering into a rant, and I see little point. I'm going to be trying to come up with a way to capture additional useful information, but I would welcome any thoughts anyone may have, even if only of the "I suspect you might want to look at the $THING" sort. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTML mo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B