Re: Anyone recall the dreaded tstile issue?

2022-07-22 Thread Patrick Welche
On Fri, Jul 22, 2022 at 06:36:19PM +0700, Robert Elz wrote:
> Date:Fri, 22 Jul 2022 11:24:46 +0100
> From:Patrick Welche 
> Message-ID:  
> 
>   | Having not seen the dreaded turnstile issue in ages, a 
> NetBSD-9.99.99/amd64
>   | got stuck on shutdown last night with:
> 
> How long did you wait?

Rewind... the tstile and the 100% pgdaemon cpu use happened long before
the shutdown. I am used to waiting a long time for swap to clear on
shutdown after a pbulk run to /tmp. This was different.

Cheers,

Patrick


Re: Anyone recall the dreaded tstile issue?

2022-07-22 Thread Robert Elz
Date:Fri, 22 Jul 2022 11:24:46 +0100
From:Patrick Welche 
Message-ID:  

  | Having not seen the dreaded turnstile issue in ages, a NetBSD-9.99.99/amd64
  | got stuck on shutdown last night with:

How long did you wait?

I have seen situations where it takes 10-15 mins to sync everything to
drives (I have plenty of RAM available) - which I think is another
problem - but not this one.

It isn't the case that every time we find something "stuck" on a tstile
wait that the system is broken - they're just locks, sometimes processes
are going to need to wait for one.

In the kind of scenario described, things like sync and halt will need
to wait for all the filesystems to be flushed - if that's going to take
a long time (which it really shouldn't, but that's a different issue)
then it is going to take a long time.

The other day I managed to crash my system (my fault, though really what
I did - yanking a USB drive mid write - shouldn't really cause a crash,
just mangled data) in the middle of the afternoon.   It rebooted easily
enough, wapbl replaying logs kept all the filesystems safe enough (I think
the drive I pulled needs a little more attention, but that's a different
problem) but then I discovered that files I had written about 02:00 in the
morning (more than 12 hours earlier) were all full of zeroes - the data
must have been sitting in RAM all that time, and nothing had bothered to
send it to the drive.   That's not good ...   We also seem to no longer
have the ancient update(8) which used to issue a sync every 30 secs, to
attempt to minimize this kind of problem.

kre



Re: Anyone recall the dreaded tstile issue?

2022-07-22 Thread Patrick Welche
Having not seen the dreaded turnstile issue in ages, a NetBSD-9.99.99/amd64
got stuck on shutdown last night with:

db{0}> ps
PIDLID S CPU FLAGS   STRUCT LWP *   NAME WAIT
4761  4761 3   0 0   83a478576940   sync tstile
5499  5499 3   140   83a486cb2740   halt tstile
5262  5262 3   0   180   83a4862ecac0 pickup kqueue
4060  4060 3   040   83a486cb2300   halt tstile
...

Today's very unusual use pattern was that I created several 5GB
files in its large tmpfs /tmp and then moved them to a ZFS partition.

As I recollect, I then did a git log in github/netbsd/src, and nothing
happened. top showed pagedaemon at 100%. ^C in the git xterm eventually
returned, so no more git process.

>From memory, doing a bt/a on pgdaemon's *lwp just showed
uvm_pageout
uvm_availmem

Things carried on for a while, and a subsequent git log worked, but
then mutt got stuck in tstile. AFAIR it was the only process in
tstile. I then shutdown -p now. Wondered why not much was happening.
Repeated the shutdown in the same xterm which was still responsive.
ddb then showed me the above. Evenutally I hit the power-off switch.

So for me, no PUFFS was involved, but git, tmpfs, ZFS were.


Cheers,

Patrick


Re: Anyone recall the dreaded tstile issue?

2022-07-20 Thread Koning, Paul



> On Jul 19, 2022, at 6:45 PM, Mouse  wrote:
> 
> 
> [EXTERNAL EMAIL] 
> 
>>> [...kernel coredump...kernel stack traces...]
>>> [...dig up a way to get userland stack traces...] 
> 
>> I did a pile of work on GDB, 10 or more years ago, to add that
>> capability for the non-standard system coredumps we used in
>> EqualLogic.  It's not a simple change, if you want it to be fairly
>> automatic.  Part of it means looking into user mode memory, but
>> another part is loading all the right symbol tables with the right
>> relocations.
> 
> I'd be satisfied with a no-symbols stack trace, the kind of thing you
> could get from a (userland) coredump without the matching binary.  If I
> didn't already have a fair idea what the problem is, I'd be trying to
> add a way to do that.

Good point, though I've found that no-symbols is painful when shared
libraries are involved because it makes it very hard to figure out
what that number translates to.

>> Unfortunately I'm not in a position to dig up that code, adapt it to
>> the standard NetBSD kernel dumps, and contribute it.  I could ask for
>> approval to contribute my changes as-is for some interested person to
>> adapt.  If that's interesting, please let me know and I can pursue
>> it.
> 
> I probably am not in a position to do that work at all, and certainly
> not to do it in a way that would be useful to NetBSD.
> 
> I don't know whether anyone else is.  But it does occur to me that
> having the underlying code available would make it more likely that
> someone would pick it up in the future.

That makes sense.  Ok, I'll bring up the question with management.

paul




Re: Anyone recall the dreaded tstile issue?

2022-07-19 Thread Mouse
>> [...kernel coredump...kernel stack traces...]
>> [...dig up a way to get userland stack traces...] 

> I did a pile of work on GDB, 10 or more years ago, to add that
> capability for the non-standard system coredumps we used in
> EqualLogic.  It's not a simple change, if you want it to be fairly
> automatic.  Part of it means looking into user mode memory, but
> another part is loading all the right symbol tables with the right
> relocations.

I'd be satisfied with a no-symbols stack trace, the kind of thing you
could get from a (userland) coredump without the matching binary.  If I
didn't already have a fair idea what the problem is, I'd be trying to
add a way to do that.

> Unfortunately I'm not in a position to dig up that code, adapt it to
> the standard NetBSD kernel dumps, and contribute it.  I could ask for
> approval to contribute my changes as-is for some interested person to
> adapt.  If that's interesting, please let me know and I can pursue
> it.

I probably am not in a position to do that work at all, and certainly
not to do it in a way that would be useful to NetBSD.

I don't know whether anyone else is.  But it does occur to me that
having the underlying code available would make it more likely that
someone would pick it up in the future.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Anyone recall the dreaded tstile issue?

2022-07-19 Thread Koning, Paul



> On Jul 19, 2022, at 9:57 AM, Mouse  wrote:
> 
> 
> [EXTERNAL EMAIL] 
> 
> I've been exchanging email off-list about this with a few people.  One
> of them remarked that a kernel coredump would help.
> 
> Yesterday it wedged again.  I got a kernel coredump...and, well, as I
> put it in off-list mail:
> 
>>> I now realize I don't know how to coax [process stack traces] out of
>>> a kernel core.  I don't recall hearing of any sort of postmortem
>>> ddb.  I have a the corresponding netbsd.gdb, and I found gdb's
>>> target kvm, but I haven't manged to get a stack trace for any
>>> process out of it.
> 
> The response turned out to be exactly the cluesticking I needed to get
> stack traces.

I did a pile of work on GDB, 10 or more years ago, to add that
capability for the non-standard system coredumps we used in EqualLogic.
It's not a simple change, if you want it to be fairly automatic.  Part
of it means looking into user mode memory, but another part is loading
all the right symbol tables with the right relocations.

Unfortunately I'm not in a position to dig up that code, adapt it to
the standard NetBSD kernel dumps, and contribute it.  I could ask for
approval to contribute my changes as-is for some interested person to
adapt.  If that's interesting, please let me know and I can pursue it.

pul




Re: Anyone recall the dreaded tstile issue?

2022-07-19 Thread Mouse
>> I have found enough time to have a quick look at the code, and I've
>> found a code path that looks like what I was looking for: something
>> that blocks until a git subprocess finishes without handling
>> filesystem requests in the meantime.  I'll need to reorganize the
>> code a little to fix that.

> Doesn't thie problem need a deeper fix?  It is annoying that a poorly
> programmed userland process can kill the system.

No more so than a puffs backing process that mounts the filesystem and
then goes into an infinite loop.  This sort of failure mode is one of
the risks of userland-backed filesystems.  It just took me a while to
track it down, largely for complete lack of prior experience with
anything of the sort.

Or consider while (1) { fork(); }.

Also, and not directly related to your remark, I now suspect that if
I'd told ddb to kill the gitfs process(es), the system would have come
unstuck.

I don't see this as fundamentally different (though admittedly less
severe than) dd if=/dev/urandom of=/dev/mem: if root does something
sloppy, Bad Things can happen.  I think you will have trouble
completely preventing userland from wedging or crashing the system
while still preserving a useful system.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Anyone recall the dreaded tstile issue?

2022-07-19 Thread Taylor R Campbell
> Date: Tue, 19 Jul 2022 19:15:54 +
> From: Emmanuel Dreyfus 
> 
> On Tue, Jul 19, 2022 at 01:40:34PM -0400, Mouse wrote:
> > I have found enough time to have a quick look at the code, and I've
> > found a code path that looks like what I was looking for: something
> > that blocks until a git subprocess finishes without handling filesystem
> > requests in the meantime.  I'll need to reorganize the code a little to
> > fix that.
> 
> Doesn't thie problem need a deeper fix? It is annoying that a poorly
> programmed userland process can kill the system.

This is why userspace file systems can be risky, even if convenient or
safer in certain dimensions.  Comprehensively eliminating all the
potential deadlocks is difficult.


Re: Anyone recall the dreaded tstile issue?

2022-07-19 Thread Emmanuel Dreyfus
On Tue, Jul 19, 2022 at 01:40:34PM -0400, Mouse wrote:
> I have found enough time to have a quick look at the code, and I've
> found a code path that looks like what I was looking for: something
> that blocks until a git subprocess finishes without handling filesystem
> requests in the meantime.  I'll need to reorganize the code a little to
> fix that.

Doesn't thie problem need a deeper fix? It is annoying that a poorly
programmed userland process can kill the system.

-- 
Emmanuel Dreyfus
m...@netbsd.org


Re: Anyone recall the dreaded tstile issue?

2022-07-19 Thread Mouse
> Is your git process multithreaded, or just forking in the traditional
> manner?

The latter.  I don't thread, at least not in C.

> To check for this, I'd check to see if your process is linked against
> libpthread, just in case something is spawning threads without your
> knowledge.

I can't find any trace of libpthread anywhere in the build procedure
(one .a library of mine is the only explicit library), and strings - on
the executable, piped into egrep thread, finds no hits.

I have found enough time to have a quick look at the code, and I've
found a code path that looks like what I was looking for: something
that blocks until a git subprocess finishes without handling filesystem
requests in the meantime.  I'll need to reorganize the code a little to
fix that.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Anyone recall the dreaded tstile issue?

2022-07-19 Thread Brian Buhrow
hello.  Is your git process multithreaded, or just forking in the 
traditional manner?  If
it's multithreaded, then you'll want stack traces for each thread.  To check 
for this, I'd
check to see if your process is linked against libpthread, just in case 
something is spawning
threads without your knowledge.

-thanks
-Brian



Re: Anyone recall the dreaded tstile issue?

2022-07-19 Thread Mouse
I've been exchanging email off-list about this with a few people.  One
of them remarked that a kernel coredump would help.

Yesterday it wedged again.  I got a kernel coredump...and, well, as I
put it in off-list mail:

>> I now realize I don't know how to coax [process stack traces] out of
>> a kernel core.  I don't recall hearing of any sort of postmortem
>> ddb.  I have a the corresponding netbsd.gdb, and I found gdb's
>> target kvm, but I haven't manged to get a stack trace for any
>> process out of it.

The response turned out to be exactly the cluesticking I needed to get
stack traces.

I've now got (kernel) stack traces.  They explain very neatly how
unrelated processes end up in puffsrpl - it's the vnode version of the
memory-pressure theory I mentioned (as implausible) upthread:

#0  0xc04b7beb in mi_switch ()
#1  0xc04b3dbb in sleepq_block ()
#2  0xc048eb0f in cv_wait_sig ()
#3  0xc038b3ea in puffs_msg_wait ()
#4  0xc038b547 in puffs_msg_wait2 ()
#5  0xc038ff40 in puffs_vnop_inactive ()
#6  0xc05281f8 in VOP_INACTIVE ()
#7  0xc051b7bc in vclean ()
#8  0xc051d36a in getcleanvnode ()
#9  0xc051d52e in getnewvnode ()
#10 0xc0404aa3 in ffs_vget ()
#11 0xc03f3a45 in ffs_valloc ()
#12 0xc042f052 in ufs_makeinode ()
#13 0xc04309fa in ufs_create ()
#14 0xc05290af in VOP_CREATE ()
#15 0xc0525df2 in vn_open ()
#16 0xc0521d44 in sys_open ()
#17 0xc05a9fcf in syscall ()
#18 0xc010058e in syscall1 ()

#0  0xc04b7beb in mi_switch ()
#1  0xc04b3dbb in sleepq_block ()
#2  0xc048eb0f in cv_wait_sig ()
#3  0xc038b3ea in puffs_msg_wait ()
#4  0xc038b547 in puffs_msg_wait2 ()
#5  0xc038ff40 in puffs_vnop_inactive ()
#6  0xc05281f8 in VOP_INACTIVE ()
#7  0xc051b7bc in vclean ()
#8  0xc051d36a in getcleanvnode ()
#9  0xc051d52e in getnewvnode ()
#10 0xc0404aa3 in ffs_vget ()
#11 0xc03f3a45 in ffs_valloc ()
#12 0xc042f052 in ufs_makeinode ()
#13 0xc04309fa in ufs_create ()
#14 0xc05290af in VOP_CREATE ()
#15 0xc0525df2 in vn_open ()
#16 0xc0521d44 in sys_open ()
#17 0xc05a9fcf in syscall ()
#18 0xc010058e in syscall1 ()

#0  0xc04b7beb in mi_switch ()
#1  0xc04b3dbb in sleepq_block ()
#2  0xc048eb0f in cv_wait_sig ()
#3  0xc038b3ea in puffs_msg_wait ()
#4  0xc038b547 in puffs_msg_wait2 ()
#5  0xc038ff40 in puffs_vnop_inactive ()
#6  0xc05281f8 in VOP_INACTIVE ()
#7  0xc051b7bc in vclean ()
#8  0xc051d36a in getcleanvnode ()
#9  0xc051d52e in getnewvnode ()
#10 0xc0404aa3 in ffs_vget ()
#11 0xc042d59b in ufs_lookup ()
#12 0xc052917c in VOP_LOOKUP ()
#13 0xc0516ddb in lookup ()
#14 0xc05175c5 in namei ()
#15 0xc05205a6 in sys_access ()
#16 0xc05a9fcf in syscall ()
#17 0xc010058e in syscall1 ()

(Arguments are not shown because I made a stupid mistake; I did not
have a netbsd.gdb available.  But the above traces are informative
enough, to me.)

There was a git process, it was a child of the main gitfs process, and
it was in puffsrpl (it's the last of the above stack traces).

So my best-guess theory now is that I have a codepath somewhere in
gitfs that forks git and waits for it to finish _without_ processing
other puffs requests while waiting.  There should be no such, but I
can't explain this any other way.  The gitfs process is blocked in
select, but that's exactly what I'd expect.

I now would like _userland_ stack traces.  The kernel stack trace for
the main gitfs process is exactly what I'd expect

#0  0xc04b7beb in mi_switch ()
#1  0xc04b3dbb in sleepq_block ()
#2  0xc04e60ed in pollcommon ()
#3  0xc04e639f in sys_poll ()
#4  0xc05a9fcf in syscall ()
#5  0xc010058e in syscall1 ()

but waiting for git to finish could very well be in poll() waiting for
git to print output.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Anyone recall the dreaded tstile issue?

2022-07-17 Thread Mouse
> As I remember, and the web can probably confirm, running lockdebug
> under 5.x doesn't work at all!

Well, it works in that a kernel builds, boots, and runs well enough
that I haven't noticed any issues yet.  (amd64 and i386; I haven't
tried others yet.)

Whether it will record useful information in case of another hang, that
is another story.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Anyone recall the dreaded tstile issue?

2022-07-17 Thread Brian Buhrow
Hello.   As I remember, and the web can probably confirm, running 
lockdebug under 5.x
doesn't work at all!
I think you'll find a question on this very point from me some years ago n our 
archives.
-thanks
-Brian



re: Anyone recall the dreaded tstile issue?

2022-07-16 Thread matthew green
> I got two off-list mails suggesting LOCKDEBUG.  I tried that today -
> I added LOCKDEBUG and VNODE_LOCKDEBUG both (the latter I found in the
> ALL config when grepping for LOCKDEBUG).
>
> That kernel panicked promptly on boot
>
> panic: vop_read: vp: locked 0, expected 1

just use LOCKDEBUG for now.

this is VNODE_LOCKDEBUG and it might be buggy since no one
uses it, but LOCKDEBUG is quite common.


.mrg.


Re: Anyone recall the dreaded tstile issue?

2022-07-16 Thread Mouse
I got two off-list mails suggesting LOCKDEBUG.  I tried that today -
I added LOCKDEBUG and VNODE_LOCKDEBUG both (the latter I found in the
ALL config when grepping for LOCKDEBUG).

That kernel panicked promptly on boot

panic: vop_read: vp: locked 0, expected 1

with a stack trace that goes breakpoint, panic, VOP_READ, vn_rdwr,
check_exec, execve1, sys_execve, start_init.  I haven't yet dug any
deeper; I don't know whether this is part of my problem (probably not,
I would guess, because of the apparent puffs connection), a bug I
introduced and just never noticed before, or something broken with one
(or both!) of those options.  I'll be investigating to figure out which
of those it is, but, for the moment, LOCKDEBUG is off the table. :(

I'm also going to be using a different machine for at least some of my
testing; rebooting this machine this much is...inconvenient.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Anyone recall the dreaded tstile issue?

2022-07-16 Thread Mouse
One other thing of note.

I just now looked at the record I captured of the last hang.  I looked
for other processes waiting on puffsrpl, on the theory that if there
are two unexpected such, there may be others.

There are:

557  1 3   084   d2251500mailwrapper puffsrpl
3985 1 3   184   d1c3ba80  bozohttpd puffsrpl
198161 3   084   d23d9360  smtpd puffsrpl
557  1 3   084   d2251500mailwrapper puffsrpl
3985 1 3   184   d1c3ba80  bozohttpd puffsrpl
236191 3   184   d68ef080  bozohttpd puffsrpl
3698 1 3   084   d2eafca0 as puffsrpl
231761 3   184   d1f3b7e0git puffsrpl
7871 1 3   184   d1e382a0  bozohttpd puffsrpl

bozohttpd, those do not surprise me.  At least half the hits I get
these days are fetching from the puffs mount (it's
/export/ftp/pub/mouse/git-unpacked; /export/ftp is also the root
directory used by bozohttpd.)

The git and one of the mailwrappers were at the end of the tstile wait
chains I found.  That leaves the other mailwrapper, the smtpd, and the
as.

None of those five would I expect to be going anywhere near the puffs
mount point.  That an as is running does not surprise me; I had a
build-of-the-world going when this happened.  But it shouldn't've been
touching anything outside of /usr/src, the build-into directories, and
/tmp (or moral equivalent - /usr/tmp, or /var/tmp, or some such).

I'm going to have to get stack traces next hang.  I've now set up a
dump partition, so I have a fighting chance of getting a kernel dump.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Anyone recall the dreaded tstile issue?

2022-07-16 Thread Mouse
Oh, and one thing I don't think I've said yet:

Thank you very much, everyone who's even thought about this issue!
I've seen at least two off-list emails already in addition to the
on-list ones.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Anyone recall the dreaded tstile issue?

2022-07-16 Thread Mouse
>> My best guess at the moment is that there is a deadlock loop where
>> something tries to touch the puffs filesystem, the user process
>> forks a child as part of that operation, and the child gets locked
>> up trying to access the puffs filesystem.
> That is possible, [...]

> A more common case, I believe, is [...failing to unlock in an error
> path...]

> The function [...] which causes the problem is no longer active, no
> amount of stack tracing will find it.  The process which called it
> might not even still exist, it might have received the error return,
> and exited.

I find the notion of a nonexistent process holding a lock disturbing,
but of course that's just a human-layer issue.

> Finding this kind of thing requires very careful and thorough code
> reading, analysing every lock, and making sure that lock gets
> released, somewhere, on every possible path after it is taken.

Well...if I wanted to debug that, I would probably grow each lock by
some kind of indication (a PC value, or __FILE__ and __LINE__) of where
it was last locked.  Then, once the culprit lock is found

> The best you can really hope for from examining the wedged system is
> to find which lock is (usually "might be") the instigator of it all.
> That can help narrow the focus of code investigation.

At the moment, I'd be happy with that much.  But the system has only
the one puffs mount, which is off under /export/ftp, not anywhere that
is, for example, expected to be on anyone's $PATH, and all the
X-waiting-on-Y-waiting-on-Z chains end up with someone waiting on
puffsrpl.  And the puffs userland processes show no indication of being
stuck in the "holding a lock that's not getting released" sense.  So
there probably is nothing here that could be caught by, for example,
in-kernel deadlock detection.

I am basically certain it has _something_ to do with the puffs
filesystem, because of the puffsrpl waits and because it started
happening shortly after I added the puffs mount.  The real puzzle, for
me, in this latest hang are why/how the mailwrapper and git processes
ended up waiting for puffsrpl.  I will allocate a piece of disk for a
kernel coredump, so I can do detailed post-mortem on a wedged system.
(The machine's main function is to forward packets; I can't really keep
it in ddb for hours while I pore over details of a lockup.)

I will also add timeouts in the puffs userland code, so that if a
forked git process takes too long, it is nuked, with the access that
led to it returning an error - and, of course, logging all over the
place.

I'm also going to change ddb's ps listing to include PPID; in this last
hang, I would have liked to have known whether the git process were a
child of the gitfs process.

I will also take that "other system" I mentioned, make a puffs mount,
and then start playing that game.  If I can get it to tstile in a
reasonable time frame, it will greatly accelerate debugging this.

> Mouse, start with the code you added ... make sure there are no
> problems like this buried in it somewhere (your own code, and
> everything it calls).

I haven't touched the puffs kernel code.  Of course, that doesn't mean
it doesn't have any such issues, but it makes it seem less likely to
me.  While it doesn't rule out such problems in any of my other
changes, it makes that too less likely; it would have to be a bug that
remained latent until I started using puffs

> If that ends up finding nothing, then the best course if action might
> be to use a fairly new kernel.

Possibly, but unless a new kernel can be built with 5.2's compiler, I
run right back into the licensing issue that's the reason I froze at
5.2 to begin with.  I'd also have to port at least a few of my kernel
changes to the new kernel.

I may have to resort to that, but I'd much rather avoid it; even if the
licensing turns out to not be an issue, it would be a lot of work.

> I haven't seen a tstile lockup in ages, [...]

I never saw them at all until I started playing with puffs.  The major
reason I'm reluctant to suspect your "lock held by a nonexistent
process" theory (presumably with the culprit somewhere puffs-related)
here is those two processes waiting on puffsrpl which I would not
expect to be touching the puffs mountpoint at all.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Anyone recall the dreaded tstile issue?

2022-07-16 Thread Robert Elz
Date:Sat, 16 Jul 2022 00:48:59 -0400 (EDT)
From:Mouse 
Message-ID:  <202207160448.aaa09...@stone.rodents-montreal.org>

  | That's what I was trying to do with my looking at "X is tstiled waiting
  | for Y, who is tstiled waiting for Z, who is..." and looking at the
  | non-tstiled process(se) at the ends of those chains.

That can sometimes help, but this is a difficult ussue to debug, as
often the offender is long gone before anyone notices.

  | My best guess at the moment is that there is a deadlock loop where
  | something tries to touch the puffs filesystem, the user process forks a
  | child as part of that operation, and the child gets locked up trying to
  | access the puffs filesystem.

That is possible, as is the case where locking is carried out
improperly (I lock a then try to lock b, you lock b then try to
lock a) - but those are the easier cases to find.

A more common case, I believe, is

func()
{
lock(something);
/*
 * do some work
 */
 if (test for something strange) {
/*
 * this should not happen
 */
return EINVAL;
}
/*
 * more stuff
 */
 unlock(something),
 return answer,
}

where I am sure you can what's missing in this short segment ...  real
code is typically much messier, and the locks not always that explicit,
they can be acquired/released as side effects of other function calls.

The function (func here) which causes the problem is no longer
active, no amount of stack tracing will find it.  The process
which called it might not even still exist, it might have
received the error return, and exited.

Finding this kind of thing requires very careful and thorough
code reading, analysing every lock, and making sure that lock
gets released, somewhere, on every possible path after it is taken.
The best you can really hope for from examining the wedged system
is to find which lock is (usually "might be") the instigator of it all.
That can help narrow the focus of code investigation.

Mouse, start with the code you added ... make sure there are
no problems like this buried in it somewhere (your own code, and
everything it calls).   If that ends up finding nothing, then
the best course if action might be to use a fairly new kernel.
Some very good people (none of whom is me, so I can lather praise)
have done some very good work in fixing most if the issues we
used to have.  I haven't seen a tstile lockup in ages, and I used
to quite often (fortunately mostly ones that affected comparatively
little, but over time, things get more and more clogged, until a
reboot - whuch can rarely be clean in this state - is required).

kre


Re: Anyone recall the dreaded tstile issue?

2022-07-15 Thread Mouse
> My take away from all the discussion was that the best way to find
> the problem was to look at all the processes that weren't in tstile
> wait and see what they're doing.

That's what I was trying to do with my looking at "X is tstiled waiting
for Y, who is tstiled waiting for Z, who is..." and looking at the
non-tstiled process(se) at the ends of those chains.

In this particular case there were two, both waiting on puffsrpl.

> It sounds like collecting stack traces on the processes that are in
> puffsrpl wait would be a good start and that might give you a clue as
> to what might be getting stuck.
> [...]
> I'm guessing there's some deadlock between puffs and some of the
> other filesystem code on the system.

My best guess at the moment is that there is a deadlock loop where
something tries to touch the puffs filesystem, the user process forks a
child as part of that operation, and the child gets locked up trying to
access the puffs filesystem.

This shouldn't happen, of course.  The children forked by the puffs
backing process should never go anywhere near the puffs mount point,
and, furthermore, even if one puffs request locks up, the rest of the
filesystem should carry on just fine.  The only way that makes any
sense is if something common to all of puffs, or some such, is locked.

But there are at least two processes waiting on puffsrpl in this latest
hang that have no business anywhere near the puffs mountpoint.  So
there is something I don't understand going on.

Stack traces are a good step, yes; thank you for suggesting them.  I'm
also considering recording (by some means that, like an in-kernel ring
buffer, is dumpable from ddb) every puffs request and response, so if
some request is being slow I can at least find out about it.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Anyone recall the dreaded tstile issue?

2022-07-15 Thread Mouse
>> [...tstile...]
> I also recall getting the situation when working on perfuse, but it
> was limited to processes operating on the puffs filesystem.

My front-runner theory at the moment is that I've got something like
that, but at some point a process holding a much more widely-used lock
joins the puffs-based lockup and everything goes downhill from there.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Anyone recall the dreaded tstile issue?

2022-07-15 Thread Brian Buhrow
hello.  If memory serves correct, this problem was discussed relative 
to NetBSD-5 when
Andrew Doran was working on the smp improvements to the kernel.  As manu 
pointed out, it could
be a result of a number of scenarios.  My take away from all the discussion was 
that the best
way to find the problem was to look at all the processes that weren't in tstile 
wait and see
what they're doing.  Everything in tstile wait is basically waiting in line for 
another
resource that's currently in use.  In other words, a lot of tstile processes is 
a symptom of a
problem, not the problem itself.  It sounds like collecting stack traces on the 
processes that
are in puffsrpl wait would be a good start and that might give you a clue as to 
what might be
getting stuck.  It also might be a good idea to get stack traces on all of the 
kernel threads
to see what they're doing.  I'm guessing there's some deadlock between puffs 
and some of the
other filesystem code on the system.

-thanks
-Brian


Re: Anyone recall the dreaded tstile issue?

2022-07-15 Thread Emmanuel Dreyfus
On Fri, Jul 15, 2022 at 07:46:58PM -0400, Mouse wrote:
> Some time back, I recall seeing people talking on the lists about some
> sort of discouragingly common issue with processes getting stuck in
> tstile waits.  (I've tried to scare up the relevant list mail on
> mail-archive.netbsd.org, so far to no avail.)

There are many scenario to doom most of userland into tstile waits. I
often experienced the following:
- snapshot with backing store on a partition that gets full
- I/O to a disk that weent out for lunch

I also recall getting the situation when working on perfuse, but it was
limited to processes operating on the puffs filesystem.


-- 
Emmanuel Dreyfus
m...@netbsd.org