Craig A. Berry wrote:
> At 10:56 PM -0400 4/20/02, Charles Lane wrote:
> >As for the "hangs", any idea what bit of the torture tests it's hanging
> >on?

> Well, after my patch it hangs in test_pipe.pl after the test labeled
> "TEST7:  close write pipe on beast while reading,  should see 5
> writes, 5 reads+EOF."  I do see the reads, writes, and EOF, but then
> the hang.  At that point, from looking at ANLYZE/SYSTEM --> SHOW
> PROCESS/CHANNEL, the parent has two mailboxes open and busy and  the
> child has a third mailbox open and busy, but the I/O request queue is
> empty for all three mailboxes.  Parent and child are both in LEF
> state.  My guess would be it's sitting in my_waitpid on the
> sys$waitfr(pipe_ef) call but I haven't proven that yet.

I did a build over the weekend; haven't finished with testing yet, but
I'm also seeing hangs in the pipe torture tests.

The ones that I've seen are hanging in the "my_waitpid" code.   Here's
what seems to be going on:

    --> pipe to/from subprocess, it does it's thing and exits
    --> pipe code picks up exit via termination ast, deletes pipe structs
    --> my_waitpid called from Perl:
        doesn't find match to pipe (it was deleted)...
        does a getjpi and finds a termination mbx
        tries to read termination mbx, hangs forever

Now, the piping code does *not* set up termination mailboxes...it uses
LIB$SPAWN to create subprocesses, and LIB$SPAWN does not give you that
option.

Why?  It looks like LIB$SPAWN is using a termination mailbox
internally, to trigger the termination AST that we're waiting for.

So if we are sucessful in opening a channel to the termination mbx and
grabbing the termination message, we'll mess up whatever code was
waiting for that message.  But if we don't grab the termination message,
we hang forever.

Triggering this problem is timing dependant (I triggered it on a variety
of the torture tests... a bit of delay here or there could change which
test was more likely to hang), because it has to occur:
    (a) AFTER the termination message goes to the piping code, so that
        the pid is removed from the "open pipes" list.
and (b) BEFORE the process is finally deleted by VMS, so that getjpi
        still returns sucessfully.

Possible action items:
    keep a list of pipe/subprocess PIDs around to match with waitpid calls
        (a memory leak if we keep all of them...just the last N perhaps?)
    get rid of the attempts to grab termination mailboxes
--
 Drexel University       \V                    --Chuck Lane
======]---------->--------*------------<-------[===========
     (215) 895-1545     _/ \  Particle Physics
FAX: (215) 895-5934     /\ /~~~~~~~~~~~        [EMAIL PROTECTED]

Reply via email to