Am 13.06.2012 um 11:31 schrieb Tina Friedrich: > OpenMPI can do this as well, feature got added in version 1.3.1 (see > http://www.open-mpi.de/faq/?category=running#suspend-resume - also has > instructions on how to use with SGE).
Thx, I wasn't aware of it. In the essence, it's a feature of the parallel library and not the queuing system then. So, if some need it, they may change their used MPI library. In the original question the used parallel library isn't specified. The only thing to adjust in SGE would then to suspend the master if a slave got suspended. -- Reuti > Basically, as Erik said, it's a case of instead of sending SIGTSTP to mpirun > (SIGTSTP can be trapped); mpirun catches this & forwards it to the a.outs as > SIGSTOP. Same for resume. Need to change the suspend_method to send SIGTSTP > to MPI jobs, rather than SIGSTOP, but that was the only change to the queue > setup. > > We've used this for a couple of years, seems to work fine. > > Tina > > On 13/06/12 10:11, Erik Soyez wrote: >> Hi Reuti, that's why it is SIGTSTP, not SIGSTOP. Erik Soyez. >> >> >> On Wed, 13 Jun 2012, Reuti wrote: >> >>> Am 13.06.2012 um 08:39 schrieb Erik Soyez: >>> >>>> Rayson, yes, it kind of worked with 6.2u5, but we used it mainly with >>>> HP-MPI which only needs a SIGTSTP for the master process in order to >>>> suspend the entire job. Regards, Erik. >>> >>> How does this work? Usually the sigstop can't be trapped. So, are the >>> other processes on the slave nodes stopping theirselfs as some kind of >>> heartbeat is missing as the master process is already stapped? Lateron >>> on a sigcont the master process will have to wake them up again by >>> distributing the signal of course. >>> >>> >>>> On Wed, 13 Jun 2012, Rayson Ho wrote: >>>> >>>>> On Wed, Jun 13, 2012 at 1:47 AM, Erik Soyez >>>>> <[email protected]> wrote: >>>>>> You probably need some kind of cronjob to suspend and unsuspend your >>>>>> parallel jobs correctly. Or does anyone have a patch for this? >>>>> >>>>> Erik, >>>>> >>>>> So is/was it really working when you try it with SGE 6.2u5?? >>>>> >>>>> I have not looked into the code that handles parallel job suspension >>>>> in detail (we were working on "near-by" code in 2008 and Shannon was >>>>> also looking into the suspending parallel jobs at that time, and thus >>>>> we just relied on him to debug the code :-D ). >>>>> >>>>> However, in order to properly handle the case you metioned, the >>>>> qmaster will need to keep track of the number of times subordination >>>>> happens to a job. And I can already think of issues if the accounting >>>>> code is not accurate enough. >>>>> >>>>> Do you know if other batch systems handle the case you mentioned >>>>> correctly? >>>>> >>>>> >>>>>> On Tue, 12 Jun 2012, Joseph Farran wrote: >>>>>> >>>>>>> Well, for our needs, we *REALLY* need Parallel Job suspension. It's >>>>>>> not even a choice for us. >>>>>>> >>>>>>> If Torque/Maui can do it, I am sure OGE can do it without issues. >>>>>>> >>>>>>> Can someone please tell me what patch I need to install to un-break / >>>>>>> turn-on Parallel job suspension? >>>>>>> >>>>>>> If you guys are that paranoid about PE suspension, how about >>>>>>> adding an >>>>>>> on/off flag for this since the code is already there and let the >>>>>>> admin pick? >>>>>>> >>>>>>> >>>>>>> On 06/12/2012 06:52 AM, Dave Love wrote: >>>>>>>> >>>>>>>> "Joseph A. Farran"<[email protected]> writes: >>>>>>>> >>>>>>>>> If you guys are taking requests, *please* add suspension and >>>>>>>>> ignore old >>>>>>>>> Sun recommendation. >>>>>>>> >>>>>>>> Support for suspension exists, it's just broken (per the issue Reuti >>>>>>>> pointed to). The use of | is clearly wrong, but the other bit isn't >>>>>>>> clear. It's one of the available patches I wanted to understand >>>>>>>> before >>>>>>>> applying (and had forgotten about). Can anyone cast more light on >>>>>>>> it? >> >> > > > -- > Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd > Diamond House, Harwell Science and Innovation Campus - 01235 77 8442 > > -- > This e-mail and any attachments may contain confidential, copyright and or > privileged material, and are for the use of the intended addressee only. If > you are not the intended addressee or an authorised recipient of the > addressee please notify us of receipt by returning the e-mail and do not use, > copy, retain, distribute or disclose the information in or attached to the > e-mail. > Any opinions expressed within this e-mail are those of the individual and not > necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot > guarantee that this e-mail or any attachments are free from viruses and we > cannot accept liability for any damage which you may sustain as a result of > software viruses which may be transmitted in or with the message. > Diamond Light Source Limited (company no. 4375679). Registered in England and > Wales with its registered office at Diamond House, Harwell Science and > Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom > > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
