On Tue, 16 Apr 2013 17:40:12 +0200
Reuti Reuti wrote:

Hi Reuti,

I think I'm starting to understand how it works. :-) 
(As queuewise preemption doesn't fit my needs, I've moved to slotwise
preemption).
My problems was node complex_values (slots & virtual_free) definitions.
I used to work based on that, and I saw no reference to that in any
example.
Once I've added preempt, I was not aware that when complex_value limits
are reached, preemption is not evaluated. That's what was happening in
my previous conf. only 8 slots where allowed in my node, so new jobs
could never start because of that). So, I must relax those limit or
remove them (at least at host level). Thanks Reuti, without your
answers I'd be _more_ lost (if possible).


Conclusion: my current conf must change if I want to start using
preemption.

So I've moved node slot complex to queue definition:

high-queue:
slots                 1,[aracne13=8]
subordinate_list      slots=4(low-el6:1:sr)


low-queue:
slots                   4


I submit 16 jobs in low, wait till 4 start, submit 16 in high, and 4
from low get suspended 8 from high start. Great!


But I'm facing 2 problems when changing values...


1.-) subordinate_list slots value must be the same number as
low-queue slots. If not I get a confusing behaivour:

I submit 16 jobs in low and none in high, and I get 4 runnig 2
suspended:

 475878 0.06387 low        abria        r     04/17/2013 14:54:23 
[email protected]      1
 475879 0.06382 low        abria        r     04/17/2013 14:54:23 
[email protected]      1
 475880 0.06378 low        abria        r     04/17/2013 14:54:23 
[email protected]      1
 475881 0.06373 low        abria        r     04/17/2013 14:54:23 
[email protected]      1
 475882 0.06368 low        abria        S     04/17/2013 14:54:23 
[email protected]      1
 475883 0.06364 low        abria        S     04/17/2013 14:54:23 
[email protected]      1

4 are able to start, and 2 start as suspended....
those 2 suspened should never start.



2.-) Requeueing jobs (using trasnparent_chekpoint). I'm facing the same
problem as
http://www.mentby.com/Group/grid-engine/another-slotwise-preemption-question.html
 John had on 2010.

Suspendedn (requeud) jobs get rescheduled every scheduler cycle. It's
more or less what happens in case 1, the system detects "empty" slots
and tries to push a job there, but then it sees that it's a subordinate
slot and suspends (checkpoints-reque) the job.


So, in that thread it is said that version 6.2.u6 it must be solved,
I'm runnning gridengine-qmaster-2011.11p1-2 (which I don't really know
what version it corresponds), so, my question: am I affected but the
bug or have I missconfigured something?
I could try adding a slot complex to queue and send it to alarm (as you
suggested, but if an upgrade solves the issue, I'll go there).


Then I'll have to play with virtual_free cause I will face the same
issue as when I define host slots.


*Reuti, do you know where are the docs you talk about in the above link?

http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf 
http://gridengine.sunsource.net/howto/checkointing.html 

TIA,
Arnau
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to