Thanks Reuti for your quick-as-always response!
Workaround b) is no option this time, because first memory-/swap-/io
will still slow down the (always very urgent) superordinate jobs (I
expect kernel suspended nice-19 jobs to still do something while
truely gridengine suspended jobs won't - or am I wrong?) and second
the nice values (queue priority) are normally not propagated to the
mpi slave node processes even with tight integration (or am I wrong
again?).
Workaround a) is something I thought about, but must be implemented
as admin prolog/epilog scripts that identify the corresponding jobs
with still some race condition due to new jobs. I'm not sure how
reliable this can be implemented - by the way, will epilog scripts
be executed at "qdel"?
Erik Soyez.
On Mon, 14 Feb 2011, Reuti wrote:
Am 14.02.2011 um 12:31 schrieb Erik Soyez:
Good day,
we have a major problem with the subordinate queue mechanism in 6.2u5:
Setup: o queues "long" & "standard"
o queue "long" is subordinate to queue "standard"
("subordinate_list long=1")
When a "long" job is spreaded over two hosts (e.g. 24 slots, 12 each)
it gets suspended by "standard" jobs on one of these host as expected.
When a second "standard" job starts on the other host and then the
first "standard" job finishes, the "long" job gets resumed and the
second host is overloaded with two jobs.
o Is this a known problem?
o Are there any patches available anywhere?
o Are there any workarounds?
o Does anybody know, in which older SGE versions this works
as expected? (last time we used it was with 6.0u6....)
Confirmed. It looks like to happen whan a job is running on the
machine, which didn't trigger the suspension. As long as new jobs
are scheduled to the node which triggered the suspension, all is fine.
Workarounds I see are: a) suspend the parallel jobs by hand
`qmod -sj ...`, b) run jobs in the long queue with a nice value of 19
(queue_config priority) and don't suspend them at all.
--
--
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Roland Niemeier,
Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Michel Lepert
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users