Stuart Barkley <[email protected]> writes:

>> The only tricky thing is using the right qsub parameters to get
>> exclusive access to the node, either by submitting a parallel job
>> that uses all the slots, or exclusive=true.  I need to make a
>> specific admin queue across all the nodes.  You may want to worry
>> about reservation and bumping up the priority too.
>
> This is where things do seem to start to get more complex and require
> specific site setup to support.

Yes, but I think the admin queue could be a convention which shouldn't
disrupt existing configs much.  Maybe I should make an example with that
premise.

>> and the guts of my reboot job are just
>>
>>   /usr/bin/sudo /sbin/service sgeexecd softstop
>>   /usr/bin/sudo /sbin/reboot
>
> Also where things can get tricky:  If you stop sgeexecd does the
> script keep running long enough to do the reboot?

`softstop' kills execd, not the shepherd.  It's there to stop admin
reports of failed jobs.  (reboot can hang if filesystems can't be
unmounted for some reason -- I tried an IPMI reset, but that was
unreliable for some reason I didn't debug.)

> Does SGE see the
> job finish before the node reboots?  Does SGE see the jobs die due to
> the reboot (hopefully it doesn't restart the reboot job).

It doesn't need to see the job finish.  Obviously it shouldn't be
restartable.

> We are still in process of developing our compute image.  Every couple
> of weeks (becoming less often) I need to do a global reboot across
> 100+ SGE nodes.

Same here, but not very frequently.  I just submit a reboot job for
each.

> The following two lines (occur several places) are probably site
> specific.  Is 'compute' a SGE attribute or something to do with a site
> 'genders?' system?

It's in genders <https://computing.llnl.gov/linux/genders.html>.  All it
does here is convert between node numbers and names without having to
hard-wire node name prefixes into the script for different clusters.
You could use node names directly.

> I need to study 'limit' a lot more to understand all the functions and
> interactions.

There's not much to the RQS rule.  sge_resource_request(5) needs
re-writing, but Reuti suggests the RFE document
http://arc.liv.ac.uk/repos/darcs/sge/doc/devel/rfe/ResourceQuotaSpecification.html
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to