Thanks Reuti for your quick-as-always response!

Workaround b) is no option this time, because first memory-/swap-/io
will still slow down the (always very urgent) superordinate jobs (I
expect kernel suspended nice-19 jobs to still do something while
truely gridengine suspended jobs won't - or am I wrong?) and second
the nice values (queue priority) are normally not propagated to the
mpi slave node processes even with tight integration (or am I wrong
again?).

Workaround a) is something I thought about, but must be implemented
as admin prolog/epilog scripts that identify the corresponding jobs
with still some race condition due to new jobs.  I'm not sure how
reliable this can be implemented - by the way, will epilog scripts
be executed at "qdel"?

Erik Soyez.


On Mon, 14 Feb 2011, Reuti wrote:

Am 14.02.2011 um 12:31 schrieb Erik Soyez:

Good day,

we have a major problem with the subordinate queue mechanism in 6.2u5:

Setup:  o queues "long" & "standard"
                o queue "long" is subordinate to queue "standard"
                ("subordinate_list  long=1")

When a "long" job is spreaded over two hosts (e.g. 24 slots, 12 each)
it gets suspended by "standard" jobs on one of these host as expected.
When a second "standard" job starts on the other host and then the
first "standard" job finishes, the "long" job gets resumed and the
second host is overloaded with two jobs.

        o Is this a known problem?
        o Are there any patches available anywhere?
        o Are there any workarounds?
        o Does anybody know, in which older SGE versions this works
          as expected?  (last time we used it was with 6.0u6....)

Confirmed. It looks like to happen whan a job is running on the
machine, which didn't trigger the suspension. As long as new jobs
are scheduled to the node which triggered the suspension, all is fine.

Workarounds I see are: a) suspend the parallel jobs by hand
`qmod -sj ...`, b) run jobs in the long queue with a nice value of 19
(queue_config priority) and don't suspend them at all.


--
--
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Michel Lepert
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to