I'm quite sure that the CM CPC stuff (both IBCM -- which doesn't fully
work anyway -- and RDMA CM) will timeout and Bad Things will happen if
you interrupt it in the middle of some network transactions. The
(kernel-imposed) timeout for RDMACM is pretty short -- on the order of
a minute or
I would expect that you will hit problems with timeouts throughout the
codebase as Jeff mentioned, particularly with network connections.
Having a 'prepare to suspend' signal followed by a 'suspend now'
signal might work since it should provide enough of a window to ready
the application
On Dec 11, 2008, at 2:55 PM, Terry Dontje wrote:
Well under SGE it allows you to have SGE send mpirun SIGUSR1 so many
minutes before sending the Suspend signal.
My point is that the right approach might be to work in the context of
Josh's CR stuff -- he's already got hooks for "do this
Jeff Squyres wrote:
On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote:
It sounds reasonable to me. I agree with Ralf W about having mpirun
send a STOP to itself - that would seem to solve the problem about
stopping everything.
It would seem, however, that you cannot similarly STOP the
On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote:
It sounds reasonable to me. I agree with Ralf W about having mpirun
send a STOP to itself - that would seem to solve the problem about
stopping everything.
It would seem, however, that you cannot similarly STOP the daemons
or else you
It sounds reasonable to me. I agree with Ralf W about having mpirun
send a STOP to itself - that would seem to solve the problem about
stopping everything.
It would seem, however, that you cannot similarly STOP the daemons or
else you won't be able to CONT the job. I'm not sure how big a
Hello Rolf,
* Rolf Vandevaart wrote on Fri, Dec 05, 2008 at 08:00:42PM CET:
>
> One problem is that with SIGTSTP no longer delivering a stop signal to
> mpirun, one cannot CTRL-Z at their terminal to stop mpirun. I am trying
> to figure out how big a problem that is.
Why not first deal with
We have had requests to be able to suspend/resume MPI jobs within an SGE
environment. SGE sends a signal (which is configurable) to mpirun to
stop the job and another signal to resume it. To support this, I
propose that we add support in the ORTE to catch SIGTSTP/SIGCONT and
forward these