We have not implemented a flag for it, and it is not hard to add one. One thing 
about adding a new option is, we will then need to support it even if it turns 
out to be not needed, and we are careful not to add too much extra code, and 
that's why I will do more research first and decide if it is really needed.

I Google searched for TCP suspend issues, and found that some developers say 
that it is safe if the processes are suspended when they are at a quiescent 
point.

So if in-flight messages are processed first before suspending, which should be 
the case for the freezer cgroup subsystem, then it should be safe to handle it 
without adding a new flag.

See: http://www.kernel.org/doc/Documentation/cgroups/freezer-subsystem.txt

(And Rayson added cgroups support in GE 2011.11 U1, while cgroups is Linux 
only, Linux is run by most of the clusters, at least doing small to 
medium-scale HPC.)

IBM also planned to use Containers/Cgroups in IBM BlueWaters (before IBM 
cancelled the project in 2011) to perform checkpointing and restart.

https://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_smith.pdf

 -Ron




________________________________
From: Joseph A. Farran <[email protected]>
To: Ron Chen <[email protected]> 
Cc: Rayson Ho <[email protected]>; "[email protected]" 
<[email protected]> 
Sent: Monday, June 11, 2012 11:53 PM
Subject: Re: [gridengine users] PE Job Suspend / Resume


Thanks Ron for the details and explanation.   

I will test NAMD as indicated and will get back with the
      results.    In the meantime, a couple of questions:

Is there a flag in OGE to specify if a job is suspend-able or
      not?     Like the flag that tells if a job can be check-pointed or
      not?    

I would argue that regardless if a job is suspend-able or not, to
      allow parallel jobs to be suspended and if they die, so be it. 

We do have codes that were written with MPI without any kind of
      message passing error checking, so yes these types of codes will
      usually die if suspended - depends on the in-flight messages at
      the time.    But my general feeling is that suspending a job is a
      last resort and it's better to take a chance on suspending a job
      than killing it out right.

We are using OpenMPI 1.4.4 and I believe that starting with v
      1.3.1 OpenMPI  works with OGE to stop / resume correctly, no?   
      Maybe I am missing something.    Or maybe OGE handles OpenMPI and
      MPICH2 just fine, but the issue is with other types of message
      passing programs like NAMD?

Best,
Joseph



On 6/11/2012 6:00 PM, Ron Chen wrote:

Hi Joseph, Only a few people have asked for this feature in the past, and as 
Sun (I think it was Andy) told us that suspending PE jobs
can cause issues, so the code was never changed in the original Grid Engine or 
in OGS/GE. To help us (and also you) understand the behaviour of suspending PE 
jobs, we need to do some manual testing. Can you run a small NAMD job that 
spans 2 or more nodes, and then on each node: - run ps to look for the PIDs of 
the NAMD processes of that job
- prepare to send a STOP signal to each one
- when you are finished with typing all those kill -STOP signals, then with as 
little delay as you can, press ENTER on all the nodes. Then wait for a while, 
may be 15+mins or longer, send the CONT signal to resume the tasks. See if NAMD 
continues to run. Let us know the result. So you are manually suspending the PE 
job by hand. As mentioned by others, TCP timeout can
be an issue, and in fact some checkpoint/restart libraries do not support TCP 
socket connections.  -Ron ----- Original Message -----
From: Joseph Farran <[email protected]> To: Rayson Ho <[email protected]> Cc: 
"[email protected]" <[email protected]> Sent: Monday, June 11, 2012 5:17 
PM
Subject: Re: [gridengine users] PE Job Suspend / Resume Thanks for the 
clarification. This is NAMD run, so I am launching it via "charmrun" and not 
mpirun. If the OGE code suspend via rank 0, I would think that charmrun and/or 
any other parallel job would suspend as well, no? I will try an mpirun job next 
to see if it behaves differently and suspends correctly or not. Joseph On 
06/11/2012 01:32 PM, Rayson Ho wrote: 
>Clarify... rank 0 in the previous email = the parallel job launcher
(eg. mpirun) process - usually running on the rank 0 machine. A few years ago, 
we added code to allow every process to get the
suspend signal (only for the tight-integration case), but Sun at that
time did not integrate it into the tree so we will need to start the
discussion again and see if it really is a good idea to suspend
parallel jobs. Rayson On Mon, Jun 11, 2012 at 4:21 PM, Rayson 
Ho<[email protected]>  wrote: 
>>Only rank 0 of the job is suspended if I recall correctly - it was
designed specifically because not all parallel jobs are able to handle
suspend/restart correctly - for example you can get TCP timeouts and
things like those. Rayson On Mon, Jun 11, 2012 at 3:53 PM, Joseph 
Farran<[email protected]>  wrote: 
>>>Hi. With the help of this group, I've been able to make good progress on 
>>>setting
up OGE 2011.11 with our cluster. I am testing the Suspend&  Resume features and 
it works great for serial
jobs but not able to get Parallel jobs suspended. I created a simple Parallel 
Environment (PE) called mpi and I submitted a
NAMD job to it and it runs just fine.    I then tried suspending it using
qmon 'suspend' button and it says that it suspended the job and qstat also
confirms that job is suspended with the 's' flag, however looking at the
nodes on which NAMD is running, NAMD continues to run. What am I missing with 
respect to being able to suspend PE jobs since it
works for serial jobs? Joseph _______________________________________________
users mailing list [email protected] 
https://gridengine.org/mailman/listinfo/users
>_______________________________________________
users mailing list [email protected] 
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to