Re: [OMPI users] signal handling (part 2)

Reuti Tue, 13 Mar 2007 08:33:03 -0400

Am 12.03.2007 um 21:29 schrieb Ralph Castain:

But now we are going beyond Mark's initial problem.


Back to the initial problem: suspending a parallel job in SGE leads to:

19924  1786 19924 S     \_ sge_shepherd-45250 -bg

19926 19924 19926 Ts | \_ /bin/sh /var/spool/sge/node39/job_scripts/45250

19927 19926 19926 T     |       \_ mpirun -np 4 /home/reuti/mpihello

19928 19927 19926 T | \_ qrsh -inherit -noshell -nostdin -V node39 /home/reuti/local/openmpi-1.2rc3/bin/orted --no-daemonize --bootpr19934 19928 19926 T | | \_ /usr/sge/utilbin/lx24-x86/rsh -n -p 36878 node39 exec '/usr/sge/utilbin/lx24-x86/qrsh_starter''/var/spo19929 19927 19926 T | \_ qrsh -inherit -noshell -nostdin -V node44 /home/reuti/local/openmpi-1.2rc3/bin/orted --no-daemonize --bootpr19935 19929 19926 T | | \_ /usr/sge/utilbin/lx24-x86/rsh -n -p 55907 node44 exec '/usr/sge/utilbin/lx24-x86/qrsh_starter''/var/spo19930 19927 19926 T | \_ qrsh -inherit -noshell -nostdin -V node41 /home/reuti/local/openmpi-1.2rc3/bin/orted --no-daemonize --bootpr19939 19930 19926 T | | \_ /usr/sge/utilbin/lx24-x86/rsh -n -p 59798 node41 exec '/usr/sge/utilbin/lx24-x86/qrsh_starter''/var/spo19931 19927 19926 T | \_ qrsh -inherit -noshell -nostdin -V node38 /home/reuti/local/openmpi-1.2rc3/bin/orted --no-daemonize --bootpr19938 19931 19926 T | \_ /usr/sge/utilbin/lx24-x86/rsh -n -p 35136 node38 exec '/usr/sge/utilbin/lx24-x86/qrsh_starter''/var/spo

19932  1786 19932 S     \_ sge_shepherd-45250 -bg
19933 19932 19933 Ss        \_ /usr/sge/utilbin/lx24-x86/rshd -l

19936 19933 19936 S \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/node39/active_jobs/45250.1/1.node39 noshell19937 19936 19937 S \_ /home/reuti/local/openmpi-1.2rc3/bin/orted --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 5 --vpid_st

19940 19937 19937 R                     \_ /home/reuti/mpihello

The job is still running, and only the master task is stopped. Thisis by design in SGE, and the parallel lib should handle it on it'sown. So I request the warnings with -notify in the qsub:

mpirun: Forwarding signal 10 to jobmpirun noticed that job rank 0with PID 20526 on node node39 exited on signal 10 (User definedsignal 1).[node39:20513] ERROR: A daemon on node node39 failed to start asexpected.

[node39:20513] ERROR: There may be more information available from
[node39:20513] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node39:20513] ERROR: If the problem persists, please restart the
[node39:20513] ERROR: Grid Engine PE job
[node39:20513] The daemon received a signal 10.

[node39:20513] ERROR: A daemon on node node42 failed to start asexpected.

[node39:20513] ERROR: There may be more information available from
[node39:20513] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node39:20513] ERROR: If the problem persists, please restart the
[node39:20513] ERROR: Grid Engine PE job
[node39:20513] The daemon received a signal 10.

Which Mark already found. The USR1/2 by default terminate theapplication. So I put into my mpihello.c to ignore the signal:


   signal(SIGUSR1, SIG_IGN);

(yes, the old style should be ok for only ignore and terminate)

mpirun: Forwarding signal 10 to job[node39:20765] ERROR: A daemon onnode node39 failed to start as expected.

[node39:20765] ERROR: There may be more information available from
[node39:20765] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node39:20765] ERROR: If the problem persists, please restart the
[node39:20765] ERROR: Grid Engine PE job
[node39:20765] The daemon received a signal 10.

[node39:20765] ERROR: A daemon on node node38 failed to start asexpected.

[node39:20765] ERROR: There may be more information available from
[node39:20765] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node39:20765] ERROR: If the problem persists, please restart the
[node39:20765] ERROR: Grid Engine PE job
[node39:20765] The daemon received a signal 10.

[node39:20765] ERROR: A daemon on node node40 failed to start asexpected.

[node39:20765] ERROR: There may be more information available from
[node39:20765] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node39:20765] ERROR: If the problem persists, please restart the
[node39:20765] ERROR: Grid Engine PE job
[node39:20765] The daemon received a signal 10.

[node39:20765] ERROR: A daemon on node node44 failed to start asexpected.

[node39:20765] ERROR: There may be more information available from
[node39:20765] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node39:20765] ERROR: If the problem persists, please restart the
[node39:20765] ERROR: Grid Engine PE job
[node39:20765] The daemon received a signal 10.

And now the odd thing: the jobscript (with the mpirun) is gone on thehead node of this parallel job, but all the spawned qrsh processesare still there:


#!/bin/sh
trap '' usr1
export PATH=/home/reuti/local/openmpi-1.2rc3/bin:$PATH
mpirun -np $NSLOTS ~/mpihello


20771  1786 20771 S     \_ sge_shepherd-45258 -bg
20772 20771 20772 Ss        \_ /usr/sge/utilbin/lx24-x86/rshd -l

20775 20772 20775 S \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/node39/active_jobs/45258.1/1.node39 noshell20776 20775 20776 S \_ /home/reuti/local/openmpi-1.2rc3/bin/orted --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 5 --vpid_st

20778 20776 20776 R                     \_ /home/reuti/mpihello

So in the SGE case: usr1 should be caught by the mpirun (and notterminate it), which will notify the daemons to stop each ones childprocesses. This would simulate a real suspend, performed by OpenMPI.

Same might be true for usr2 as warning for a sigkill, but this is notreally necessary, as this can also be performed by SGE.


-- Reuti

-- Reuti

avoid seeing any signals from your terminal. When you issue a
signal, mpirun

picks it up and forwards it to your application processes via theORTE

daemons - the ORTE daemons, however, do *not* look at that signal
but just
pass it along.

As for timing, all we do is pass STOP to the OpenMPI application
process -

it's up to the local system as to what happens when a "kill -STOP" is

issued. It was always my impression that the system stopped process
execution immediately under that signal, but with some allowance
for the old
kernel vs user space issue.

Once all the processes have terminated, mpirun tells the daemons to
go ahead
and exit. That's the only way the daemons get terminated in this
procedure.

Can you tell us something about your system? Is this running under
Linux,
what kind of OS, how was OpenMPI configured, etc?

Thanks
Ralph



On 3/12/07 1:26 PM, "Reuti" <re...@staff.uni-marburg.de> wrote:

Am 12.03.2007 um 19:55 schrieb Ralph Castain:

I'll have to look into it - I suspect this is simply an erroneous
message
and that no daemon is actually being started.

I'm not entirely sure I understand what's happening, though, in
your code.
Are you saying that mpirun starts some number of application
processes which
run merrily along, and then qsub sends out USR1/2 signals followed
by STOP
and then KILL in an effort to abort the job? So the application
processes

don't normally terminate, but instead are killed via thesesignals?

If you specify -notify in SGE with the qsub, then jobs arewarned by

the sge_shepered (parent if the job) during execution, so that they
could perfom some proper shutdown action, before they are really
stopped/killed:

for suspend: USR1 -wait-defined-time- STOP
for kill: USR2 -wait-defined-time- KILL

Worth to be noted: the signals are sent to the completeprocessgroup

of the job created by the jobscript and mpirun, but not to each
daemon which is created by the internal qrsh on any of the slave
nodes! This should be orte's duty.

Question is also: are OpenMPI jobs surviving a STOP for sometime at

all, or will there be timing issues due to communication timeouts?

HTH - Reuti


Just want to ensure I understand the scenario here as that is
something
obviously unique to GE.

Thanks
Ralph


On 3/12/07 9:42 AM, "Olesen, Mark" <mark.ole...@arvinmeritor.com>
wrote:

I'm testing openmpi 1.2rc1 with GridEngine 6.0u9 and ran into
interesting
behaviour when using the qsub -notify option.

With -notify, USR1 and USR2 are sent X seconds before sendingSTOP

and KILL
signals, respectively.

When the USR2 signal is sent to the process group with the mpirun
process, I
receive an error message about not being able to start a daemon:

mpirun: Forwarding signal 12 to job[dealc12:18212] ERROR: Adaemon

on node
dealc12 failed to start as expected.

[dealc12:18212] ERROR: There may be more information availablefrom

[dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine
tasks.

[dealc12:18212] ERROR: If the problem persists, please restartthe

[dealc12:18212] ERROR: Grid Engine PE job
[dealc12:18212] The daemon received a signal 12.

[dealc12:18212] ERROR: A daemon on node dealc20 failed tostart as

expected.

[dealc12:18212] ERROR: There may be more information availablefrom

[dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine
tasks.

[dealc12:18212] ERROR: If the problem persists, please restartthe

[dealc12:18212] ERROR: Grid Engine PE job
[dealc12:18212] The daemon received a signal 12.

The job eventually stops, but the mpirun process itself continues
to live
(just the ppid changes).

According to orte(1)/Signal Propagation, USR1 and USR2 should be
propagated

to all processes in the job (which seems to be happening), butwhy

is a
daemon start being attempted and the mpirun not being stopped?

/mark

This e-mail message and any attachments may contain legally
privileged,
confidential or proprietary Information, or information otherwise
protected by
law of ArvinMeritor, Inc., its affiliates, or third parties. This
notice
serves as marking of its „Confidential‰ status as defined in any

confidentiality agreements concerning the sender andrecipient. If

you are not
the intended recipient(s), or the employee or agent responsible
for delivery
of this message to the intended recipient(s), you are hereby
notified that any
dissemination, distribution or copying of this e-mail message is
strictly
prohibited. If you have received this message in error, please
immediately
notify the sender and delete this e-mail message from your
computer.


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] signal handling (part 2)

Reply via email to