At 04:52 PM 4/22/2002 -0400, Charles Lane wrote: >(it sounded like you were assuming that a read doesn't remove the >message; it does)
I guess I was assuming that whatever writes the termination message will satisfy an arbitrary number of readers. I'm not sure where I got this impression and it may be completely wrong. Somewhere I could swear I saw an example (possibly in the Apache sources?) of reading a termination mailbox where the reader discarded a message and requeued itself if the pid didn't match (perhaps was a different child of the same parent) or the type of message was anything other than process termination (since the same mbx is used, as you indicate below, for other purposes). This would only work if the writer was willing to satisfy multiple requests since otherwise, as you suggest, whoever got there first would steal the message from the others. > It >> looked to me like the only time we could fail to recognize a pipe subprocess >> was when my_pclose had already initiated shutdown with either $delprc or >> $forcex. >Not true, I was seeing the hang even when the child process exits >normally. In which case the process would still be in the midst of getting deleted, right? >> --- vms/vms.c;-0 Tue Apr 9 14:26:06 2002 >> +++ vms/vms.c Mon Apr 22 12:14:09 2002 >> memset((void *) &trmmsg, 0, sizeof(trmmsg)); >> - sts = sys$qiow(0,mbxchan,IO$_READVBLK,&qio_iosb,0,0, >> + sts = sys$qiow(0,mbxchan,IO$_READVBLK|IO$M_WRITERCHECK,&qio_iosb,0,0, >> &trmmsg,ACC$K_TERMLEN,0,0,0,0); > >it's that WRITERCHECK that's bailing you out here; my bet is that >the termination message is going back to the parent process, and the >child is deleted. After that happens, there is no "writer" and the >read will terminate with an error. The WRITERCHECK made no discernible difference by itself. I left it in because it seemed like the right thing to do, though if the parent has a read/write channel open then the WRITERCHECK will be ignored. It was definitely the addition of a check for process deletion in progress that kept it from hanging. >Which isn't that bad a way of getting a "wait for termination" without >having to do polling, but it still is problematic. > >For example (and I have a program that does stuff like this): > > Process A creates a "general purpose message mailbox", opens read/write > channel to that mailbox, queues a read to mbx. > Process A spawns child B, with the term mbx -> general purpose mbx > Process A spawns child C, with the term mbx -> general purpose mbx Eek! Why is process A using its general purpose mbx as the termination mbx of its children? That will definitely cause hilarity, if not mayhem. It couldn't happen in the piping code since you use lib$spawn instead of sys$creprc. >In this case, the WRITERCHECK won't help, because A maintains a r/w >channel to the mailbox. (the write is used to put "internal" messages >in its input queue). > >Perl is waiting for child B, so you queue a read to the mbx; child B >terminates, but since process A had a read queued first, it gets the >termination message. > >Then A queues another read. > >Then child C exits, giving another termination message, or perhaps >something else writes to the mailbox (a DECnet event, for example). > >Since Perl's waitpid queued its read before A's, waitpid will get the >message and A won't. Hilarity ensues. It's got to be a bit more complicated than that. Perhaps the writer can satisfy multiple readers but not an arbitrary number of them, i.e., perhaps it knows how many children it has (not how many read channels are open) and what I've done is create a sort of musical chairs situation where one child always gets left without a mailbox message. >In the above example, the problem is that while a CHILD subprocess has >a single termination mailbox, that mailbox can also be used by the >PARENT for many disparate purposes, and when we *can* break into the >communication between parent and child, we probably shouldn't unless >we know exactly what we're doing. Which is not something one can do >in a "general purpose" utility routine. Possibly not. Unless someone can help us out with how termination mailboxes really work, we should probably ditch waitpid's attempt to use them for now.
