On Tue, 16 Apr 2013 17:40:12 +0200 Reuti Reuti wrote: Hi Reuti,
I think I'm starting to understand how it works. :-) (As queuewise preemption doesn't fit my needs, I've moved to slotwise preemption). My problems was node complex_values (slots & virtual_free) definitions. I used to work based on that, and I saw no reference to that in any example. Once I've added preempt, I was not aware that when complex_value limits are reached, preemption is not evaluated. That's what was happening in my previous conf. only 8 slots where allowed in my node, so new jobs could never start because of that). So, I must relax those limit or remove them (at least at host level). Thanks Reuti, without your answers I'd be _more_ lost (if possible). Conclusion: my current conf must change if I want to start using preemption. So I've moved node slot complex to queue definition: high-queue: slots 1,[aracne13=8] subordinate_list slots=4(low-el6:1:sr) low-queue: slots 4 I submit 16 jobs in low, wait till 4 start, submit 16 in high, and 4 from low get suspended 8 from high start. Great! But I'm facing 2 problems when changing values... 1.-) subordinate_list slots value must be the same number as low-queue slots. If not I get a confusing behaivour: I submit 16 jobs in low and none in high, and I get 4 runnig 2 suspended: 475878 0.06387 low abria r 04/17/2013 14:54:23 [email protected] 1 475879 0.06382 low abria r 04/17/2013 14:54:23 [email protected] 1 475880 0.06378 low abria r 04/17/2013 14:54:23 [email protected] 1 475881 0.06373 low abria r 04/17/2013 14:54:23 [email protected] 1 475882 0.06368 low abria S 04/17/2013 14:54:23 [email protected] 1 475883 0.06364 low abria S 04/17/2013 14:54:23 [email protected] 1 4 are able to start, and 2 start as suspended.... those 2 suspened should never start. 2.-) Requeueing jobs (using trasnparent_chekpoint). I'm facing the same problem as http://www.mentby.com/Group/grid-engine/another-slotwise-preemption-question.html John had on 2010. Suspendedn (requeud) jobs get rescheduled every scheduler cycle. It's more or less what happens in case 1, the system detects "empty" slots and tries to push a job there, but then it sees that it's a subordinate slot and suspends (checkpoints-reque) the job. So, in that thread it is said that version 6.2.u6 it must be solved, I'm runnning gridengine-qmaster-2011.11p1-2 (which I don't really know what version it corresponds), so, my question: am I affected but the bug or have I missconfigured something? I could try adding a slot complex to queue and send it to alarm (as you suggested, but if an upgrade solves the issue, I'll go there). Then I'll have to play with virtual_free cause I will face the same issue as when I define host slots. *Reuti, do you know where are the docs you talk about in the above link? http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf http://gridengine.sunsource.net/howto/checkointing.html TIA, Arnau _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
