Hi Reuti (and others),

> And now the odd thing: the jobscript (with the mpirun) is gone on the
> head node of this parallel job, but all the spawned qrsh processes
> are still there:

I'm glad that someone else can almost reproduce my problem.
On the suspicion that my application was not ignoring usr1/usr2, I added a
signal handler that simply outputs "ignoring SIGUSR*". The shell script now
has

trap 'echo script usr1' USR1
trap 'echo script usr2' USR2

> So in the SGE case: usr1 should be caught by the mpirun (and not
> terminate it), which will notify the daemons to stop each ones child
> processes. This would simulate a real suspend, performed by OpenMPI.

Using qmod -sj to suspend the job (sending the usr1 warning signal), I have
the same behaviour as before. Interestingly enough, I get two messages:

    mpirun: Forwarding signal 10 to job
    The daemon received a signal 10.

After these messages, only the sge-shepherd and mpirun are alive - the job
and qrsh processes are gone. Some time later, the following message also
appears:

    mpirun: Forwarding signal 12 to job

after which, no processes are left, *except* the mpirun, which I need to
kill by hand.

In case the configuration is a factor, the cluster machines are running with
a stock SuSE 9.2 (Linux 2.6.8-24-smp and/or 2.6.8-24.16-smp).

The openmpi configuration:
            ./configure \
                --prefix=$OPENMPI_ARCH_PATH \
                --enable-shared \
                --disable-static \
                --disable-mpi-f77 \
                --disable-mpi-f90 \
                --disable-mpi-profile \
                --disable-mpi-cxx

/mark

This e-mail message and any attachments may contain legally privileged, 
confidential or proprietary Information, or information otherwise protected by 
law of ArvinMeritor, Inc., its affiliates, or third parties. This notice serves 
as marking of its “Confidential” status as defined in any confidentiality 
agreements concerning the sender and recipient. If you are not the intended 
recipient(s), or the employee or agent responsible for delivery of this message 
to the intended recipient(s), you are hereby notified that any dissemination, 
distribution or copying of this e-mail message is strictly prohibited. If you 
have received this message in error, please immediately notify the sender and 
delete this e-mail message from your computer.


Reply via email to