At 10:04 AM 4/22/2002 -0400, Charles Lane wrote:
>The ones that I've seen are hanging in the "my_waitpid" code.   Here's
>what seems to be going on:
>
>    --> pipe to/from subprocess, it does it's thing and exits
>    --> pipe code picks up exit via termination ast, deletes pipe structs
>    --> my_waitpid called from Perl:
>        doesn't find match to pipe (it was deleted)...
>        does a getjpi and finds a termination mbx
>        tries to read termination mbx, hangs forever

Urk.  I considered the my_waitpid stuff, but had ruled it out because I 
trusted its ability to know who was a pipe subprocess and who wasn't.

>Now, the piping code does *not* set up termination mailboxes...it uses
>LIB$SPAWN to create subprocesses, and LIB$SPAWN does not give you that
>option.
>
>Why?  It looks like LIB$SPAWN is using a termination mailbox
>internally, to trigger the termination AST that we're waiting for.
>
>So if we are sucessful in opening a channel to the termination mbx and
>grabbing the termination message, we'll mess up whatever code was
>waiting for that message.  

Really?  What prevents two readers from reading the same thing?

>But if we don't grab the termination message,
>we hang forever.
>
>Triggering this problem is timing dependant (I triggered it on a variety
>of the torture tests... a bit of delay here or there could change which
>test was more likely to hang), because it has to occur:
>    (a) AFTER the termination message goes to the piping code, so that
>        the pid is removed from the "open pipes" list.
>and (b) BEFORE the process is finally deleted by VMS, so that getjpi
>        still returns sucessfully.
>
>Possible action items:
>    keep a list of pipe/subprocess PIDs around to match with waitpid calls
>        (a memory leak if we keep all of them...just the last N perhaps?)
>    get rid of the attempts to grab termination mailboxes


What about putting a timeout on the $qiow that is reading from the 
termination mailbox?  If it completes with a timeout, we can requeue it, but 
in the meanwhile whatever else was pending should have a chance to fire.

We really need to get the pipe torture tests or some subset of them into the 
test suite so we catch these problems when they first arise.

Reply via email to