As far as I know, OGS/GE behaves like SGE 6.2u5. We did not integrate changes 
related to suspending PE jobs as Sun believes it is not safe.

So we only suspend & resume the mpirun process at most I believe? Note that 
OpenMPI's runtime, called the ORTE, can suspend remote tasks IIRC.

The freezer cgroup in the kernel was developed to help batch job management 
systems, so if the system supports it, we should use it IMHO.

 -Ron




----- Original Message -----
From: Joseph A. Farran <[email protected]>
To: Ron Chen <[email protected]>
Cc: Rayson Ho <[email protected]>; "[email protected]" 
<[email protected]>
Sent: Tuesday, June 12, 2012 12:58 AM
Subject: Re: [gridengine users] PE Job Suspend / Resume

Yes it makes sense not to introduce new options.

I am not familiar with cgroups, so I need to read up on it.

On the subject of OpenMPI and OGE - does OGE correctly suspend and resumes 
programs compiled with OpenMPI using the OpenMPI s/r implementation?

Joseph

On 6/11/2012 9:21 PM, Ron Chen wrote:
> We have not implemented a flag for it, and it is not hard to add one. One 
> thing about adding a new option is, we will then need to support it even if 
> it turns out to be not needed, and we are careful not to add too much extra 
> code, and that's why I will do more research first and decide if it is really 
> needed.
>
> I Google searched for TCP suspend issues, and found that some developers say 
> that it is safe if the processes are suspended when they are at a quiescent 
> point.
>
> So if in-flight messages are processed first before suspending, which should 
> be the case for the freezer cgroup subsystem, then it should be safe to 
> handle it without adding a new flag.
>
> See: http://www.kernel.org/doc/Documentation/cgroups/freezer-subsystem.txt
>
> (And Rayson added cgroups support in GE 2011.11 U1, while cgroups is Linux 
> only, Linux is run by most of the clusters, at least doing small to 
> medium-scale HPC.)
>
> IBM also planned to use Containers/Cgroups in IBM BlueWaters (before IBM 
> cancelled the project in 2011) to perform checkpointing and restart.
>
> https://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_smith.pdf
>
>   -Ron
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to