Am 14.02.2011 um 12:31 schrieb Erik Soyez:

> Good day,
> 
> we have a major problem with the subordinate queue mechanism in 6.2u5:
> 
> Setup:        o queues "long" & "standard"
>       o queue "long" is subordinate to queue "standard"
>               ("subordinate_list  long=1")
> 
> When a "long" job is spreaded over two hosts (e.g. 24 slots, 12 each)
> it gets suspended by "standard" jobs on one of these host as expected.
> When a second "standard" job starts on the other host and then the
> first "standard" job finishes, the "long" job gets resumed and the
> second host is overloaded with two jobs.
> 
>       o Is this a known problem?
>       o Are there any patches available anywhere?
>       o Are there any workarounds?
>       o Does anybody know, in which older SGE versions this works
>         as expected?  (last time we used it was with 6.0u6....)

Confirmed. It looks like to happen whan a job is running on the machine, which 
didn't trigger the suspension. As long as new jobs are scheduled to the node 
which triggered the suspension, all is fine.

Workarounds I see are: a) suspend the parallel jobs by hand `qmod -sj ...`, b) 
run jobs in the long queue with a nice value of 19 (queue_config priority) and 
don't suspend them at all.

-- Reuti 


> Many thanks, Erik Soyez.
> 
> 
> --
> 
> 
> 
> -- 
> Vorstand/Board of Management:
> Dr. Bernd Finkbeiner, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
> Vorsitzender des Aufsichtsrats/
> Chairman of the Supervisory Board:
> Michel Lepert
> Sitz/Registered Office: Tuebingen
> Registergericht/Registration Court: Stuttgart
> Registernummer/Commercial Register No.: HRB 382196 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to