Stuart Barkley <[email protected]> writes: >> The only tricky thing is using the right qsub parameters to get >> exclusive access to the node, either by submitting a parallel job >> that uses all the slots, or exclusive=true. I need to make a >> specific admin queue across all the nodes. You may want to worry >> about reservation and bumping up the priority too. > > This is where things do seem to start to get more complex and require > specific site setup to support.
Yes, but I think the admin queue could be a convention which shouldn't disrupt existing configs much. Maybe I should make an example with that premise. >> and the guts of my reboot job are just >> >> /usr/bin/sudo /sbin/service sgeexecd softstop >> /usr/bin/sudo /sbin/reboot > > Also where things can get tricky: If you stop sgeexecd does the > script keep running long enough to do the reboot? `softstop' kills execd, not the shepherd. It's there to stop admin reports of failed jobs. (reboot can hang if filesystems can't be unmounted for some reason -- I tried an IPMI reset, but that was unreliable for some reason I didn't debug.) > Does SGE see the > job finish before the node reboots? Does SGE see the jobs die due to > the reboot (hopefully it doesn't restart the reboot job). It doesn't need to see the job finish. Obviously it shouldn't be restartable. > We are still in process of developing our compute image. Every couple > of weeks (becoming less often) I need to do a global reboot across > 100+ SGE nodes. Same here, but not very frequently. I just submit a reboot job for each. > The following two lines (occur several places) are probably site > specific. Is 'compute' a SGE attribute or something to do with a site > 'genders?' system? It's in genders <https://computing.llnl.gov/linux/genders.html>. All it does here is convert between node numbers and names without having to hard-wire node name prefixes into the script for different clusters. You could use node names directly. > I need to study 'limit' a lot more to understand all the functions and > interactions. There's not much to the RQS rule. sge_resource_request(5) needs re-writing, but Reuti suggests the RFE document http://arc.liv.ac.uk/repos/darcs/sge/doc/devel/rfe/ResourceQuotaSpecification.html _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
