Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT

2008-12-11 Thread Jeff Squyres
I'm quite sure that the CM CPC stuff (both IBCM -- which doesn't fully work anyway -- and RDMA CM) will timeout and Bad Things will happen if you interrupt it in the middle of some network transactions. The (kernel-imposed) timeout for RDMACM is pretty short -- on the order of a minute or

Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT

2008-12-11 Thread Josh Hursey
I would expect that you will hit problems with timeouts throughout the codebase as Jeff mentioned, particularly with network connections. Having a 'prepare to suspend' signal followed by a 'suspend now' signal might work since it should provide enough of a window to ready the application

Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT

2008-12-11 Thread Jeff Squyres
On Dec 11, 2008, at 2:55 PM, Terry Dontje wrote: Well under SGE it allows you to have SGE send mpirun SIGUSR1 so many minutes before sending the Suspend signal. My point is that the right approach might be to work in the context of Josh's CR stuff -- he's already got hooks for "do this

Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT

2008-12-11 Thread Terry Dontje
Jeff Squyres wrote: On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote: It sounds reasonable to me. I agree with Ralf W about having mpirun send a STOP to itself - that would seem to solve the problem about stopping everything. It would seem, however, that you cannot similarly STOP the

Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT

2008-12-11 Thread Jeff Squyres
On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote: It sounds reasonable to me. I agree with Ralf W about having mpirun send a STOP to itself - that would seem to solve the problem about stopping everything. It would seem, however, that you cannot similarly STOP the daemons or else you

Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT

2008-12-08 Thread Ralph Castain
It sounds reasonable to me. I agree with Ralf W about having mpirun send a STOP to itself - that would seem to solve the problem about stopping everything. It would seem, however, that you cannot similarly STOP the daemons or else you won't be able to CONT the job. I'm not sure how big a

Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT

2008-12-06 Thread Ralf Wildenhues
Hello Rolf, * Rolf Vandevaart wrote on Fri, Dec 05, 2008 at 08:00:42PM CET: > > One problem is that with SIGTSTP no longer delivering a stop signal to > mpirun, one cannot CTRL-Z at their terminal to stop mpirun. I am trying > to figure out how big a problem that is. Why not first deal with

[OMPI devel] Forwarding SIGTSTP and SIGCONT

2008-12-05 Thread Rolf Vandevaart
We have had requests to be able to suspend/resume MPI jobs within an SGE environment. SGE sends a signal (which is configurable) to mpirun to stop the job and another signal to resume it. To support this, I propose that we add support in the ORTE to catch SIGTSTP/SIGCONT and forward these