Am 14.02.2011 um 13:33 schrieb Erik Soyez:
> Thanks Reuti for your quick-as-always response!
>
> Workaround b) is no option this time, because first memory-/swap-/io
> will still slow down the (always very urgent) superordinate jobs (I
> expect kernel suspended nice-19 jobs to still do something while
> truely gridengine suspended jobs won't - or am I wrong?)
Correct. It depends on your applications, whether you can accept the impact of
a job running at nice 19.
> and second
> the nice values (queue priority) are normally not propagated to the
> mpi slave node processes even with tight integration (or am I wrong
> again?).
>
> Workaround a) is something I thought about, but must be implemented
> as admin prolog/epilog scripts that identify the corresponding jobs
> with still some race condition due to new jobs. I'm not sure how
> reliable this can be implemented - by the way, will epilog scripts
> be executed at "qdel"?
Yes, they will be executed. But even then you would have to check: what is
happening on the other nodes of the parallel job which I found on this local
machine besides myself?
Maybe an external cron-job does it in a safer way:
- get a list of all jobs in the long queue
- for each job check, whether there is anything running on its used exechosts
in the standard queue
- suspend/unsuspend as desired, double suspend/unsuspend shouldn't do any harm
(so the last state needn't to be recorded, but you could also check the actual
state to avoid unnecessary calls to `qmod`)
In principle it could also be put in the global load sensor.
-- Reuti
> Erik Soyez.
>
>
> On Mon, 14 Feb 2011, Reuti wrote:
>
>> Am 14.02.2011 um 12:31 schrieb Erik Soyez:
>>
>>> Good day,
>>>
>>> we have a major problem with the subordinate queue mechanism in 6.2u5:
>>>
>>> Setup: o queues "long" & "standard"
>>> o queue "long" is subordinate to queue "standard"
>>> ("subordinate_list long=1")
>>>
>>> When a "long" job is spreaded over two hosts (e.g. 24 slots, 12 each)
>>> it gets suspended by "standard" jobs on one of these host as expected.
>>> When a second "standard" job starts on the other host and then the
>>> first "standard" job finishes, the "long" job gets resumed and the
>>> second host is overloaded with two jobs.
>>>
>>> o Is this a known problem?
>>> o Are there any patches available anywhere?
>>> o Are there any workarounds?
>>> o Does anybody know, in which older SGE versions this works
>>> as expected? (last time we used it was with 6.0u6....)
>>
>> Confirmed. It looks like to happen whan a job is running on the
>> machine, which didn't trigger the suspension. As long as new jobs
>> are scheduled to the node which triggered the suspension, all is fine.
>>
>> Workarounds I see are: a) suspend the parallel jobs by hand
>> `qmod -sj ...`, b) run jobs in the long queue with a nice value of 19
>> (queue_config priority) and don't suspend them at all.
>
>
> --
> --
> Vorstand/Board of Management:
> Dr. Bernd Finkbeiner, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
> Vorsitzender des Aufsichtsrats/
> Chairman of the Supervisory Board:
> Michel Lepert
> Sitz/Registered Office: Tuebingen
> Registergericht/Registration Court: Stuttgart
> Registernummer/Commercial Register No.: HRB 382196
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users