> I'll let you know what happens, I got to a chance to try things out on a Xen mimic of the grid and starting up a new execd does seem to allow one to carry on using the resource on which you have orpahned any jobs by taking out the original execd.
A full write-up of my testing can be found here http://homepages.ecs.vuw.ac.nz/~kevin/forSGE/Extending_Grid_Engine_Runtimes_with_an_execd_softstop.html but the salient points follow to keep things in the thread. In between the softstop and the restart, replace the execute host's configuration which just had these defaults execd_spool_dir /var/opt/gridengine/default/spool gid_range 20000-20100 by creating a local conf for it qconf -mconf localnode with new values as follows execd_spool_dir /var/opt/gridengine/default/spool2 gid_range 20101-20200 The restart even creates the new spool directory. A qstat still shows the job on that node with a slot taken # qstat -f -u \* queuename qtype resv/used/tot. load_avg arch states ------------------------------------------------------------------------------- [email protected] BIP 0/0/1 0.00 lx24-amd64 ------------------------------------------------------------------------------- [email protected] BIP 0/1/1 0.00 lx24-amd64 7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 1 A pstree shows a new execd tree and the orpahned job +-sge_execd---4*[{sge_execd}] +-sge_shepherd---sh---sleep Even after altering the configuration to add another slot works # qstat -f -u \* queuename qtype resv/used/tot. load_avg arch states ------------------------------------------------------------------------------- [email protected] BIP 0/0/1 0.00 lx24-amd64 ------------------------------------------------------------------------------- [email protected] BIP 0/1/2 0.00 lx24-amd64 7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 1 Submitting another job to the same queue sees job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 [email protected] 1 8 0.55500 qsub3.sh buckleke r 06/17/2012 12:07:05 [email protected] 1 with the pstree showing both +-sge_execd-+-sge_shepherd---sh---sleep | +-4*[{sge_execd}] +-sge_shepherd---sh---sleep with the Grid Engine now believing that both slots are used # qstat -f -u \* queuename qtype resv/used/tot. load_avg arch states ------------------------------------------------------------------------------- [email protected] BIP 0/0/1 0.00 lx24-amd64 ------------------------------------------------------------------------------- [email protected] BIP 0/2/2 0.01 lx24-amd64 7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 1 8 0.55500 qsub3.sh buckleke r 06/17/2012 12:07:05 1 Eventually, the newer job stops as normal yet, the qmaster thinks the old one is still running, even though it has finished # qstat -f -u \* queuename qtype resv/used/tot. load_avg arch states ------------------------------------------------------------------------------- [email protected] BIP 0/0/1 0.00 lx24-amd64 ------------------------------------------------------------------------------- [email protected] BIP 0/1/2 0.00 lx24-amd64 7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 1 and the Grid Engine knows nothing about it finsihing either # qacct -j 7 error: job id 7 not found and nor does the user looking for their job $ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 [email protected] 1 even though that job has run its course on the node we mangled, with a pstree there now only showing +-sge_execd---4*[{sge_execd}] To get back to the "original" environment, we "softstop" the new execd, although, with no jobs running node it, we could just ==stopp= it.. Modify the execd's conf back to what it was (in this case, the defaults, so we could just delete the local config) The system now thinks the job that was orpahned finshed when it did (after 10 minutes) qsub_time Sun Jun 17 11:59:53 2012 start_time Sun Jun 17 12:00:05 2012 end_time Sun Jun 17 12:10:05 2012 This will get my user out of a major bind, so thanks to all for the insight and feedback. Kevin Buckley ECS, VUW, NZ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
