Limiting the slots on exec hosts (some options): http://www.softpanorama.org/HPC/Grid_engine/Resources/slot_limits.shtml
Exclusive complex configuration (2010 post): https://web.archive.org/web/20101027190030/http://wikis.sun.com/display/gridengine62u3/Configuring+Exclusive+Scheduling exclusive excl BOOL EXCL YES YES 0 1000 qconf -me <hostname> Add 'exclusive=true' to complex_values line. Can be for loop scripted that to add to all hosts more easily ... an exercise for the reader. Add '-l excl=true' to your qsub command to use the complex. Cheers, -Hugh On Mar 2, 2016, at 18:25, Michael Stauffer <mgsta...@gmail.com<mailto:mgsta...@gmail.com>> wrote: On Tue, Mar 1, 2016 at 6:19 PM, Reuti <re...@staff.uni-marburg.de<mailto:re...@staff.uni-marburg.de>> wrote: Hi, Am 01.03.2016 um 23:44 schrieb Michael Stauffer: > SoGE 8.1.8 > > I need to reboot my compute nodes after the glibc patch, and wanted to do so > nicely, i.e. wait for each node's jobs to finish before rebooting. I've done > this before and it worked, but now my setup is a little more complicated and > I changed my reinstall script. > > I have a queue for qsub jobs and one for qlogin. Each is assigned a different > number of cores per node so that some nodes always have at least a couple > cores available for qlogin sessions, and some nodes are used only for qsub > jobs. > > However my reinstall script (taken from the sge examples, listed below) does > its thing by submitting a job that requests all the cores on a node, so it > only runs when other jobs have completed. So I created a new queue called > reboot.q that is allotted all cores on all nodes. My understanding was that > the queues would cooperatively manage resources, so if a node was using, for > example, 8 cores for jobs on my qsub queue, then my reboot job that's > requesting 16 cores would wait until those jobs finish. Did you limit the overall slot count across all queues by a consumable complex on an exechost level ("complex_values slots=8") and/or with an RQS? Otherwise each queue can use all defined slots counts in each particular queue definition (and overload the nodes essentially). No, at least not knowingly. I should do this also for regular usage to avoid overloading. How do I actually do this? That is, I don't know from what you say how to actually do this. My queues look like this for 'slots' (e.g. for the qsub queue:) slots 1,[compute-0-0.local=0],[compute-0-1.local=15], \ [compute-0-2.local=15],[compute-0-3.local=15], \ [compute-0-4.local=16],[compute-0-5.local=16], \ [compute-0-6.local=16],[compute-0-7.local=16], \ [compute-0-9.local=16],[compute-0-10.local=16], \ [compute-0-11.local=16],[compute-0-12.local=16], \ [compute-0-13.local=16],[compute-0-14.local=16], \ [compute-0-15.local=16],[compute-0-16.local=16], \ [compute-0-17.local=16],[compute-0-18.local=16], \ [compute-0-8.local=16],[compute-0-19.local=16], \ [compute-0-20.local=16] complex_values NONE Do I do something similar for the complex_values parameter? > But when I ran my script, all nodes rebooted for reinstall immediately. I > guess I don't understand things correctly? Can someone set me straight? How > do I do a node reboot only after jobs have finished under these circumstances? What about attaching the "exclusive" complex (needs to be defined manually in `qconf -mc`) to each exechost and request this when submitting the reboot job? Even one slot would be enough then to get exclusive access to each node. This sounds great. Can you give me details on how to do this? What are values needed for the complex configuration params? Something like this? name shortcut type relop requestable consumable default urgency exclusive ex BOOL == YES NO 0 0 How is it attached to each exechost? Thanks very much. -M -- Reuti > script: > > ME=`hostname` > > EXECHOSTS=`qconf -sel` > > for TARGETHOST in $EXECHOSTS; do > > if [ "$ME" == "$TARGETHOST" ]; then > > echo "Skipping $ME. This is the submission host" > > else > > numprocs=`qconf -se $TARGETHOST | \ > > awk '/^processors/ {print $2}'` > > /opt/rocks/bin/rocks set host boot $TARGETHOST action=install > > qsub -p 1024 -pe unihost $numprocs -binding > linear:${numprocs} -q reboot.q@$TARGETHOST \ > > /root/admin/scripts/sge-reboot.qsub > > echo "Set $TARGETHOST for Reinstallation" > > fi > > done > > > Thanks > > -M > _______________________________________________ > users mailing list > users@gridengine.org<mailto:users@gridengine.org> > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org<mailto:users@gridengine.org> https://gridengine.org/mailman/listinfo/users
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users