Hi,

Am 13.06.2012 um 11:11 schrieb Erik Soyez:

> Hi Reuti, that's why it is SIGTSTP, not SIGSTOP.  Erik Soyez.

aha, and this one can be defined in suspend_method then.

-- Reuti


> On Wed, 13 Jun 2012, Reuti wrote:
> 
>> Am 13.06.2012 um 08:39 schrieb Erik Soyez:
>> 
>>> Rayson, yes, it kind of worked with 6.2u5, but we used it mainly with
>>> HP-MPI which only needs a SIGTSTP for the master process in order to
>>> suspend the entire job.  Regards, Erik.
>> 
>> How does this work? Usually the sigstop can't be trapped. So, are the other 
>> processes on the slave nodes stopping theirselfs as some kind of heartbeat 
>> is missing as the master process is already stopped? Lateron on a sigcont 
>> the master process will have to wake them up again by distributing the 
>> signal of course.
>> 
>> 
>>> On Wed, 13 Jun 2012, Rayson Ho wrote:
>>> 
>>>> On Wed, Jun 13, 2012 at 1:47 AM, Erik Soyez
>>>> <[email protected]> wrote:
>>>>> You probably need some kind of cronjob to suspend and unsuspend your
>>>>> parallel jobs correctly.  Or does anyone have a patch for this?
>>>> 
>>>> Erik,
>>>> 
>>>> So is/was it really working when you try it with SGE 6.2u5??
>>>> 
>>>> I have not looked into the code that handles parallel job suspension
>>>> in detail (we were working on "near-by" code in 2008 and Shannon was
>>>> also looking into the suspending parallel jobs at that time, and thus
>>>> we just relied on him to debug the code :-D ).
>>>> 
>>>> However, in order to properly handle the case you metioned, the
>>>> qmaster will need to keep track of the number of times subordination
>>>> happens to a job. And I can already think of issues if the accounting
>>>> code is not accurate enough.
>>>> 
>>>> Do you know if other batch systems handle the case you mentioned correctly?
>>>> 
>>>> 
>>>>> On Tue, 12 Jun 2012, Joseph Farran wrote:
>>>>> 
>>>>>> Well, for our needs, we *REALLY* need Parallel Job suspension.    It's
>>>>>> not even a choice for us.
>>>>>> 
>>>>>> If Torque/Maui can do it, I am sure OGE can do it without issues.
>>>>>> 
>>>>>> Can someone please tell me what patch I need to install to un-break /
>>>>>> turn-on Parallel job suspension?
>>>>>> 
>>>>>> If you guys are that paranoid about PE suspension, how about adding an
>>>>>> on/off flag for this since the code is already there and let the admin 
>>>>>> pick?
>>>>>> 
>>>>>> 
>>>>>> On 06/12/2012 06:52 AM, Dave Love wrote:
>>>>>>> 
>>>>>>> "Joseph A. Farran"<[email protected]>  writes:
>>>>>>> 
>>>>>>>> If you guys are taking requests, *please* add suspension and ignore old
>>>>>>>> Sun recommendation.
>>>>>>> 
>>>>>>> Support for suspension exists, it's just broken (per the issue Reuti
>>>>>>> pointed to).  The use of | is clearly wrong, but the other bit isn't
>>>>>>> clear.  It's one of the available patches I wanted to understand before
>>>>>>> applying (and had forgotten about).  Can anyone cast more light on it?
> 
> 
> -- 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Vorstandsvorsitzender/Chairman of the board of management:
> Gerd-Lothar Leonhart
> Vorstand/Board of Management:
> Dr. Bernd Finkbeiner, Michael Heinrichs, Dr. Arno Steitz, Dr. Ingrid Zech
> Vorsitzender des Aufsichtsrats/
> Chairman of the Supervisory Board:
> Philippe Miltin
> Sitz/Registered Office: Tuebingen
> Registergericht/Registration Court: Stuttgart
> Registernummer/Commercial Register No.: HRB 382196
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to