Hi,

Am 14.06.2012 um 17:36 schrieb Sabine Kreidl:

> thanks very much for all the suggestions and sorry for my late follow up. I 
> have to admit, that adding disabled nodes to the AR has its advantages for 
> e.g. maintenance windows. It would be a very nice feature, though, if one was 
> able to specify the desired behavior (respecting disabled nodes vs. omitting 
> them from the AR) with an option to qrsub - as a potential RFE? :-) 

At one location there is already one:

https://arc.liv.ac.uk/trac/SGE/ticket/770

You can extend this if you like.

-- Reuti


> I currently had a (new) problem with a waiting AR for a maintenance window. 
> The used version on this system is SGE 6.2u3, admittedly, so maybe this is a 
> known and already resolved issue within newer versions:
> 
> We have two queues, only one of them - par.q - accepting parallel jobs, i.e. 
> associated with our defined PEs. I got the AR submitted via 
> qrsub -u XXX,YYY -a 07051000 -e 07091000 -pe openmpi-* 1008
> granted within par.q (default job runtimes are 10 days, so we do have plenty 
> of time still). 
> All of a sudden the available slots for all instances of par.q were set to 0 
> and no parallel jobs got scheduled anymore. Accordingly, "qstat -g c" showed 
> a negative count for available slots in par.q (some parallel jobs still 
> running). As I suspected the AR, I deleted it, but a Master restart was 
> necessary before the default 8 cores per queue instance were recognized again.
> 
> Does anyone have experience with such a behavior and maybe some suggestions 
> on how to avoid the problem?
> 
> Thanks again and best regards,
> Sabine
> 
> 
> Am 16.02.2012 01:06, schrieb Dave Love:
>> William Hay <[email protected]>
>>  writes:
>> 
>> 
>>> We have a complex associated with every node called status that is
>>> normally set to OK.  When a node has a problem we set it to a
>>> description of the problem instead.   Our JSV ensures jobs always
>>> request status=OK.  With a similar complex you could request status=OK
>>> when making the AR.
>>> 
>> Yes, I think that's the only solution currently for disabled queues, but
>> I'd guess it's straightforward to avoid them as an option if someone
>> would like to try.  We don't currently use AR, so I haven't looked at
>> it.
>> 
>> 
>>> We also have a script that lists out nodes that aren't OK and their
>>> status.  Essentially duplicating the functionality of pbsnodes under
>>> Torque.  With this available as a permanent way to disable nodes we've
>>> set queues to enabled at startup and use qmod -d to mean "disabled
>>> till next reboot" only.
>>> 
>> I tag bad nodes with a comment and put them into a "testing" hostgroup
>> with access only for admins (via RQS, which will be ignored for AR for a
>> reason I don't follow).  I think if node user_lists were used instead of
>> the RQS to restrict access, an AR would exclude the bad nodes for
>> non-admins, but I'm not sure.
>> _______________________________________________
>> users mailing list
>> 
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>> 
>> 
>> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to