Re: [gridengine users] SGE-6.2u5: Suspend/resume problem of large parallel jobs.

Reuti Mon, 14 Feb 2011 08:08:32 -0800

Am 14.02.2011 um 13:33 schrieb Erik Soyez:

> Thanks Reuti for your quick-as-always response!
> 
> Workaround b) is no option this time, because first memory-/swap-/io
> will still slow down the (always very urgent) superordinate jobs (I
> expect kernel suspended nice-19 jobs to still do something while
> truely gridengine suspended jobs won't - or am I wrong?)


Correct. It depends on your applications, whether you can accept the impact of 
a job running at nice 19.


> and second
> the nice values (queue priority) are normally not propagated to the
> mpi slave node processes even with tight integration (or am I wrong
> again?).
> 
> Workaround a) is something I thought about, but must be implemented
> as admin prolog/epilog scripts that identify the corresponding jobs
> with still some race condition due to new jobs.  I'm not sure how
> reliable this can be implemented - by the way, will epilog scripts
> be executed at "qdel"?

Yes, they will be executed. But even then you would have to check: what is 
happening on the other nodes of the parallel job which I found on this local 
machine besides myself?

Maybe an external cron-job does it in a safer way:

- get a list of all jobs in the long queue
- for each job check, whether there is anything running on its used exechosts 
in the standard queue
- suspend/unsuspend as desired, double suspend/unsuspend shouldn't do any harm
(so the last state needn't to be recorded, but you could also check the actual 
state to avoid unnecessary calls to `qmod`)

In principle it could also be put in the global load sensor.

-- Reuti


> Erik Soyez.
> 
> 
> On Mon, 14 Feb 2011, Reuti wrote:
> 
>> Am 14.02.2011 um 12:31 schrieb Erik Soyez:
>> 
>>> Good day,
>>> 
>>> we have a major problem with the subordinate queue mechanism in 6.2u5:
>>> 
>>> Setup:      o queues "long" & "standard"
>>>             o queue "long" is subordinate to queue "standard"
>>>             ("subordinate_list  long=1")
>>> 
>>> When a "long" job is spreaded over two hosts (e.g. 24 slots, 12 each)
>>> it gets suspended by "standard" jobs on one of these host as expected.
>>> When a second "standard" job starts on the other host and then the
>>> first "standard" job finishes, the "long" job gets resumed and the
>>> second host is overloaded with two jobs.
>>> 
>>>     o Is this a known problem?
>>>     o Are there any patches available anywhere?
>>>     o Are there any workarounds?
>>>     o Does anybody know, in which older SGE versions this works
>>>       as expected?  (last time we used it was with 6.0u6....)
>> 
>> Confirmed. It looks like to happen whan a job is running on the
>> machine, which didn't trigger the suspension. As long as new jobs
>> are scheduled to the node which triggered the suspension, all is fine.
>> 
>> Workarounds I see are: a) suspend the parallel jobs by hand
>> `qmod -sj ...`, b) run jobs in the long queue with a nice value of 19
>> (queue_config priority) and don't suspend them at all.
> 
> 
> --
> -- 
> Vorstand/Board of Management:
> Dr. Bernd Finkbeiner, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
> Vorsitzender des Aufsichtsrats/
> Chairman of the Supervisory Board:
> Michel Lepert
> Sitz/Registered Office: Tuebingen
> Registergericht/Registration Court: Stuttgart
> Registernummer/Commercial Register No.: HRB 382196 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SGE-6.2u5: Suspend/resume problem of large parallel jobs.

Reply via email to