Am 11.06.2012 um 22:21 schrieb Rayson Ho: > Only rank 0 of the job is suspended if I recall correctly - it was > designed specifically because not all parallel jobs are able to handle > suspend/restart correctly - for example you can get TCP timeouts and > things like those.
It was also just on the MPICH2 list: I thought you put it into OGE as there was this discussion some time ago: https://arc.liv.ac.uk/trac/SGE/ticket/577 -- Reuti > Rayson > > > > On Mon, Jun 11, 2012 at 3:53 PM, Joseph Farran <[email protected]> wrote: >> Hi. >> >> With the help of this group, I've been able to make good progress on setting >> up OGE 2011.11 with our cluster. >> >> I am testing the Suspend & Resume features and it works great for serial >> jobs but not able to get Parallel jobs suspended. >> >> I created a simple Parallel Environment (PE) called mpi and I submitted a >> NAMD job to it and it runs just fine. I then tried suspending it using >> qmon 'suspend' button and it says that it suspended the job and qstat also >> confirms that job is suspended with the 's' flag, however looking at the >> nodes on which NAMD is running, NAMD continues to run. >> >> What am I missing with respect to being able to suspend PE jobs since it >> works for serial jobs? >> >> Joseph >> >> >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
